Google Open Sources Its Data Interchange Format 332

Posted by kdawson on Tuesday July 08, 2008 @04:07PM from the it's-fast-that's-why dept.

A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google. Betanews spotlights some of Protocol Buffers' contrasts with XML and IDL, with which it is most comparable. Google's blogger claims, "And, yes, it is very fast — at least an order of magnitude faster than XML."

Google Open Sources Its Data Interchange Format

This discussion has been archived. No new comments can be posted.

Search 332 Comments Log In/Create an Account

Comments Filter:

As a former user of CORBA (Score:5, Interesting)

by Anonymous Coward writes: on Tuesday July 08, 2008 @04:28PM (#24105431)

It looks like Google has taken some of the good elements of CORBA and IIOP into its own interchange format.
While CORBA certainly is bloated in a lot of ways, the IIOP wire protocol it uses is vastly faster and more efficient than any XML out there.. and yes it is just as "open" (publicly documented and Freely available for use in any open source application) as any XML schema out there. J2EE uses IIOP as well and its is technically possible to interoperate (although the problem with CORBA is that different implementations never really interoperated as they were supposed to).
As a side note, I'd rather write IDL code than an XML schema any day of the week too, but that's another rant.

Fast (Score:5, Interesting)

by JamesP ( 688957 ) writes: on Tuesday July 08, 2008 @04:30PM (#24105457)

"And, yes, it is very fast â" at least an order of magnitude faster than XML."
Just wait for the XML zealots to come crashing and not believing that XML is not the fastest, best, solution to all the world's problems (including cancer) and of course people at Google are amateurs and id10ts and WHY DO YOU HATE XML kind of stuff.
Or, as Joel Spolski once said: http://www.joelonsoftware.com/articles/fog0000000296.html [joelonsoftware.com]
No, there is nothing wrong with XML per se, except for the fans...

Re:An order of magnitude over XML? (Score:5, Interesting)

by dedazo ( 737510 ) writes: on Tuesday July 08, 2008 @04:34PM (#24105539) Journal

Looks like Google just invented the IIOP [wikipedia.org] wire protocol, which is also platform agnostic and an open standard.
I guess the main difference here is that their "compiler" can generate the actual language-domain classes off of the descriptor files, which is a definite advantage over "classic" IDL.
"Google protocol Buffers" is cooler than the OMG terminology, but this kind of thing has been around for 20 years.

JSON (Score:5, Interesting)

by hey ( 83763 ) writes: on Tuesday July 08, 2008 @04:49PM (#24105729) Journal

Looks kinda like JSON to me.

Ok, I'll bite... (Score:5, Interesting)

by Dutch Gun ( 899105 ) writes: on Tuesday July 08, 2008 @05:03PM (#24105961)

Obviously, those at Google felt XML didn't work well for them. They have the resources to invent a protocol and libraries to support it. And, they are big enough to be their own ecosystem, which means as long as everyone at Google is using their formats, interop is no biggie. Good for them, I don't begrudge that decision.
I'm actually a game developer, not a web developer, so I'll speak to XML's use as a file format in general. Here's a few points regarding our use of XML:
* We only use it as a source format for our tools. XML is far too inefficient and verbose to use in the final game - all our XML data is packed into our own proprietary binary data format.
* We also only use it as a meta-data format, not a primary container type. For instance, we store gameplay scripts, audio script, and cinematic meta-data in XML format. We're not foolish enough to store images, sounds, or maps in a highly-verbose, text-based format. XML's value to us is in how well it can glue large pieces of our game together.
* All our latest tools are written in C# and using the .NET platform (Windows is our development platform, of course). It's astoundingly easy to serialize data structures to XML using .NET libraries - just a few lines of code.
* Because it's a text-based format and human readable, if a file breaks in any way, we can just do a diff in source control to see what changed, and why it's breaking.
I'll make a concession that I've heard of some pretty awful uses of XML. But those who dismiss XML as a valuable tool in the toolchest are equally as foolish as those who believe it's the end-all and be-all of programming (I'm not saying that's true of you, just pointing out foolishness on both sides). Like any tool, it's most valuable when used in it's optimal role, not when shoehorned into projects as a solution to everything.

Re:Why another encoding scheme? (Score:2, Interesting)

by hattig ( 47930 ) writes: on Tuesday July 08, 2008 @06:15PM (#24107145) Journal

Looking at the ProtoBuf documentation (lightly) it looks like stuff that any lazy programmer has implemented to make their life easier. For instance I have written code that will take a description file* (like a .proto file) and generate (a) the Java class file, (b) the SQL schema, and (c) the DAO code in-between. It did the camel-case conversion just like this .proto thing, etc. I'm sure the Google thing is far more polished and proven, of course, but hey ...
Adding on custom binary serialisation probably wouldn't take that long, although if I was to do it I would probably mimic ProtoBuf 'cos why reinvent the wheel (Google, take note). On the other hand, generating XML from an object is as simple as appending to a StringBuilder with some utility methods to do tags and attributes, and SAXParsers aren't the most inefficient things either.
However it clearly solves a problem for Google, and it looks simple to use.
(* well, actually I used Java 5 annotations on a barebones class object rather than having to parse a text file)

Re:JSON (Score:5, Interesting)

by 0xABADC0DA ( 867955 ) writes: on Tuesday July 08, 2008 @06:35PM (#24107449)

Modify JSON so unquoted attributes are 'type labels' and define the type of an attribute by giving a label or a default value. For instance:
phoneType: { MOBILE: 0, HOME: 1, WORK: 2 }

phoneNumber: { "number": "", "type": phoneType }

person: { "name": "", "id": 0, "email": "", "phone": [ phoneNumber ], }

... now you have pretty much exactly the same message definition as protocol buffers, but in pure JSON. It could also use some convention like "@WORK" for labels/classes so that a normal JSON parser can parse the message definitions. You can write a code generator to make access classes for messages just by walking the json and looking at the types. I don't see that 'required' and 'optional' keywords help much... imo defaults are generally better (even if they are nil). But this could easily be expressed in a json message definition.
It's easy to make a binary JSON format that is fast and also small, so there is little advantage to protocol buffers there. It's also easy and ridiculously fast to compress JSON text using say character-based lzo (Oberhumer).
Maybe somebody can explain, but it doesn't seem like protocol buffers really have much advantages over JSON. It sounds like it is effectively just a binary format for JSON-like data (name-value pairs they say) along with a code generator to access it. The code generator is nice, but this is like a day's work max. Maybe I'm not understanding google's problems, but I'll stick with JSON since it actually is a cross-platform, language neutral data format... and you can always optimize it if actually needed.

Re:Why another encoding scheme? (Score:5, Interesting)

by joelwyland ( 984685 ) writes: on Tuesday July 08, 2008 @08:24PM (#24108765)

Just imagine how far we ahead we would be today if Google had put the same effort into creating tools the rest of the SQL-writing, open(2)-using world could use.
We wouldn't be ahead at all. We use different tools than they do because they are dealing with different volumes of traffic, data and demands. Let's take a moment and look at your specific complaints. You say Google suffers from NIH syndrome. Having previously worked at Google, I think you are half right. The difference is that Google both benefits _and_ suffers from NIH syndrome. Sometimes the company spends too much time reinventing the wheel, but sometimes the tools out there aren't (and shouldn't be) useful to Google. Apache shouldn't be changed to support the kind of traffic that Google handles because then it wouldn't nearly as good for all of the rest of the world. General software is great because it solves so many problems. However, general software isn't the right solution for all problems, especially extreme ones. Just about all of Google's needs are extreme ones due to the volume of traffic. You dislike the idea of BigTable. Why not use the right tool for the right job? BigTable is a ridiculously fast database system that works beautifully with petabyte sized databases. SQL isn't the right answer to all solutions. They DO use SQL... but when it is the appropriate solution. They have some really sexy internal tools for dealing with SQL and such and I'm hoping those are coming down the open source pipeline soon. :) You claim the Protocol Buffers are clunky. I've used them and developed with them extensively. They aren't clunky at all, they are actually quite elegant and easy to use. They streamline development, are incredibly reliable, and are incredibly fast. You obviously are confused by GFS as well. The system is transparent to the application by using standard i/o stream classes. It is inherently redundant to ensure data security. It is so fast in its response time that Google search is the fastest of any major player. The list goes on and on. I don't really see how you can be upset at Google for making awesome software and then giving us access to it.

Re:Why another encoding scheme? (Score:2, Interesting)

by osu-neko ( 2604 ) writes: on Tuesday July 08, 2008 @08:31PM (#24108841)

Definitions based on observed usage:
NIH syndrome (n): A condition suffered by individuals or organizations that roll their own solutions tailored specifically for their needs, rather than using the most recently hyped hammer on every nail.

Re:XML is a crappy format (Score:4, Interesting)

by mmurphy000 ( 556983 ) writes: on Tuesday July 08, 2008 @09:12PM (#24109197)
And all of them "check" the format, wasting CPU time, memory and cache, then can do nothing but crash (oh, sorry, throw exception for which there is no valid logic to handle) in the impossible case of format being invalid, and doing nothing if the actual data is semantically invalid (because semantic processing is done by a program written by a programmer who knows that it can't verify the data). Validation solves the problem that does not exist, it makes as much sense as accompanying data structures in memory with a CRC -- if it ever does not match, what are you going to do, send a message "Stand by for imminent crash" into the log? It's a completely wrong place for verification unless your application development model is "perma-debugging".
In the world I live in, data is frequently valid, but not always:
- Data corruption in a communications link (e.g., this series of tubes we're using)
- Data corruption in a storage medium (e.g., hardware hiccup, bit flip due to cosmic ray)
- Version differences between sender and receiver conception of the data format
- Malware that pretends to be a legitimate sender but, instead, sends invalid data
Many of those can be caught by the general-purpose validators that you decry, and that limits the number of validation routines programmers have to deal with. And your complaints re: CPU, memory, and cache place a value on them that may or may not be proper in every context. Or, as my former business partner put it, "in six months' time, computers will be faster and cheaper, but programmers will be neither".
Most of the data in anything that actually used for some practical purpose is of a "streaming" kind, request-response cycle is more often an exception than a rule. It only became popular because it's easy to implement with crappy tools.
You obviously have a very different definition of "streaming" than I do, as I'd argue virtually nothing uses streaming, from the days of FORTRAN and COBOL to the present day.
By definition, if you don't know semantics, data is meaningless (get it -- semantics, meaning).
Precisely. Decomposable formats, like XML, allow programs to have semantics for part, but not all, of a data structure. Non-decomposable formats, like C structs, require semantics for all of a data structure. In situations where you know 100% of all use cases for a data structure, non-decomposable formats are fine. If, however, you want to allow for what Jonathan Zittrain refers to as "generativity" (i.e., unanticipated uses for existing technology as a means of advancing said technology), decomposable formats can be a benefit.
Take, for example, ODT vs. classic binary Word documents, which are pretty much just a serialization of a big-ass binary structure as I understand it. I've written programs that parse and generate ODT, or, more precisely, the portions of ODT that I need. Frankly, I don't care what the rest of it is, so long as my generated documents work properly. And I didn't need to refer to the ODT documentation on OASIS or anything to write them, as the XML was sufficiently human-readable that, accompanied with experimentation, I was able to determine how to generate valid ODT. With Word, even if there were OOXML-sized documentation for it, I'd have to hand-roll my own parser for the whole damn format, just to pick out the pieces I need to work with. Now, if I worked for Microsoft on the Word team, I wouldn't have that problem, because I'd already have the parser. However, I, like most people, don't work for Microsoft, and even if Microsoft's parsers were available, they might not fit my environment (e.g., won't run on Linux).
Don't get me wrong, XML definitely gets overused. That's a problem with the uses of XML, not XML itself.
Re:XML is a crappy format (Score:3, Interesting)

by Alex Belits ( 437 ) * writes: on Tuesday July 08, 2008 @09:26PM (#24109447) Homepage

I think you are missing the point. XML is good where you want to receive data from other systems over which you have no control. So it doesn't matter how good you are as a programmer, and how well you write YOUR program, the issue is that you've got cabbages (or programmers who resemble cabbages) upstream sending you data.
So XML is good for talking to systems that use XML, and not for actually developing efficient or usable software!
That's my whole point -- its only value is that it's some standard that replaced the situation when no common standard existed. Actual quality of its design is still crap, it's written by wrong people, derived from wrong theoretical base and is implementing using wrong tools and techniques. I am not claiming that it's completely unusable, or that people shouldn't use it for user-oriented applications and interoperability. I claim that the quality of the standard is total shit, and people who developed it are self-serving, ideologically blinded, dishonest, incompetent hacks, so no wonder that those who actually needed a good data interchange format had to sevelop something different.

Pity they did not use Corba IDL + IIOP (Score:1, Interesting)

by Anonymous Coward writes: on Wednesday July 09, 2008 @04:12AM (#24113513)

OK, Corba IDL and IIOP have some quirks, but they work very well. There are excellent Open Source implementations like JacORB, TAO or IIOP.NET that interoperate very well with each other or J2EE. Google could have been compatible to all this instead of going their own way.

Re:XML is not a 'format'! (Score:3, Interesting)

by Alex Belits ( 437 ) * writes: on Wednesday July 09, 2008 @07:43AM (#24114653) Homepage

XML is a system of grammar that is used to create defined formats.
...made for people who slept through compiler courses.
You can't use XML to markup data. You have to use a defined grammar to create a format. You might say that this is an issue of semantics but that is the point. If your only use/understanding of XML is as a static data format then your doing it [XML/XSLT/..] wrong.
No, you can't "create" a format with XML. To "create" anything but the most trivial formats you have to provide a definition of both syntax and semantics. XML provides ridiculously complex, stupidly designed means to define a syntax, and absolutely nothing to define semantics, so you still have to either document it or, more likely, provide an implementation.
Guess what? The syntax is such a microscopic part of your task, the amount of work you have just placed into your reference implementation of semantics is multiple orders of magnitude higher than whatever you "saved" by not implementing syntax parser from scratch, leave alone implemented it using any tools that existed long before XML was introduced. The problem is, people who "learn XML" never learn how dead simple parsing in general is, so they use those "frameworks" and "tools" to save what otherwise would be literally seconds of their mental work.
I am not against simplifying further tasks that are already simple if it serves any valid purpose. The problem with XML, it does not really simplify anything, it provides ridiculously amateurish solution for a common easy problem without even a slightest attempt to help with truly complex part of work.
XML is crappy tool for static storage. If the data is being read/written by the same program there are faster/simpler was to encode that data. But that isn't what XML is meant for. To repeat my previous post; XML documents are abstracted semantic models that are designed to be transformed and dynamically interpreted.
Words "XML", "abstract" and "semantic" do not belong in the same phrase -- XML is developed at the level of a second-year CS student who managed to completely miss what "abstract" and "semantics" mean. It's not "abstract", it's artificial and irrelevant. The only value of XML is the fact that it's some standard, however this does not change the fact that it's nearly the worst possible solution for any imaginable problem.
Here is a link to an example of how XML/XSLT can be used to extend and enhance an existing XML based web service [Generating RSS with XSLT and Amazon ECS]. This a perfect example of the agnostic client scenario that XML was designed for (ie: the service could care less how the data is represented or transformed).
Have you read anything I wrote? XML is useful for interoperability with things that already use XML, and for making representation of pretty pictures/UI. This has nothing to do with the fact that it's crap, and that we all will be better off if with a standard created by someone competent. For the values of "competent" as in "anyone who actually studied CS".

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google Open Sources Its Data Interchange Format 332

Google Open Sources Its Data Interchange Format More Login

Google Open Sources Its Data Interchange Format

As a former user of CORBA (Score:5, Interesting)

Fast (Score:5, Interesting)

Re:An order of magnitude over XML? (Score:5, Interesting)

JSON (Score:5, Interesting)

Ok, I'll bite... (Score:5, Interesting)

Re:Why another encoding scheme? (Score:2, Interesting)

Re:JSON (Score:5, Interesting)

Re:Why another encoding scheme? (Score:5, Interesting)

Re:Why another encoding scheme? (Score:2, Interesting)

Re:XML is a crappy format (Score:4, Interesting)

Re:XML is a crappy format (Score:3, Interesting)

Pity they did not use Corba IDL + IIOP (Score:1, Interesting)

Re:XML is not a 'format'! (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot