Google Open Sources Its Data Interchange Format 332

Posted by kdawson on Tuesday July 08, 2008 @04:07PM from the it's-fast-that's-why dept.

A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google. Betanews spotlights some of Protocol Buffers' contrasts with XML and IDL, with which it is most comparable. Google's blogger claims, "And, yes, it is very fast — at least an order of magnitude faster than XML."

This discussion has been archived. No new comments can be posted.

Google Open Sources Its Data Interchange Format

Search 332 Comments Log In/Create an Account

Comments Filter:

Re:Likely story! (Score:4, Informative)

by caerwyn ( 38056 ) writes: on Tuesday July 08, 2008 @04:26PM (#24105389)

Are you serious? XML is great for certain applications, but the one thing it *isn't* is fast. It's very believable that something like this could be an order of magnitude faster.

Parent Share
twitter facebook
compare to thrift ( from facebook) (Score:5, Informative)

by Anonymous Coward writes: on Tuesday July 08, 2008 @04:29PM (#24105439)

both really from the same design sheet, but thrift has been opensource'd for over a year, and has many more language bindings. its been in use in several opensource projects (thrudb comes to mind), and has much more extant articles/documentation.
http://developers.facebook.com/thrift/

Share
twitter facebook
Re:No PERL API ??!!?? (Score:5, Informative)

by yknott ( 463514 ) writes: on Tuesday July 08, 2008 @04:40PM (#24105601) Homepage Journal

According to Brad Fitzpatrick's(of LiveJounral fame) blog [livejournal.com], He's working on Perl support.

Parent Share
twitter facebook
Re:WTF am I missing (Score:5, Informative)

by jandrese ( 485 ) writes: <kensama@vt.edu> on Tuesday July 08, 2008 @04:47PM (#24105701) Homepage Journal

They open sourced the compiler (for C++, Java, and Python) that lets you actually use the data interchange format. If you follow the link you can download the code and start using it today. The code is open source.

Parent Share
twitter facebook
Re:Why another encoding scheme? (Score:3, Informative)

by MightyMartian ( 840721 ) writes: on Tuesday July 08, 2008 @04:48PM (#24105721) Journal

It's not hard because XML has to be the most bloated (and yet still, ironically, nowhere near human-readable) format ever invented. That it has not only not been discarded, but is now being used to store binary blobs by guys like Microsoft and OO.org is testimony to the sheer overwhelming stupidity of a lot of developers.

Parent Share
twitter facebook
Re:What? (Score:3, Informative)

by merreborn ( 853723 ) writes: on Tuesday July 08, 2008 @05:00PM (#24105927) Journal

1) It has a binary format, far more compact (and faster to unserialize) than PHP's text-based serialized format.
2) It handles multiple versions of the same objects (e.g., your server can interact with both PhoneNumber 2.0 and PhoneNumber 3.0 objects relatively trivially)
3) It generates code for converting each format into objects in their 3 supported languages.
So, no, not really.

Parent Share
twitter facebook
Re:JSON (Score:5, Informative)

by Temporal ( 96070 ) writes: on Tuesday July 08, 2008 @05:20PM (#24106247) Journal

Structurally Protocol Buffers are similar to JSON, yes. In fact, you could use the classes generated by the Protocol Buffer compiler together with some code that encodes and decodes them in JSON. This is something some Google projects do internally since it's useful for communicating with AJAX apps. Writing a custom encoding that operates on arbitrary protocol buffer classes is actually pretty easy since all protocol message objects have a reflection interface (even in C++).
The advantage of using the protocol buffer format instead of JSON is that it's smaller and faster, but you sacrifice human-readability.

Parent Share
twitter facebook
Re:Why rely on a single data encoding system ? (Score:3, Informative)

by Temporal ( 96070 ) writes: on Tuesday July 08, 2008 @05:28PM (#24106361) Journal

It's worth noting that writing alternative encoders and decoders for protocol buffers is really easy (since protocol message objects have a reflection interface, even in C++), so you can use the friendly generated code without being tied to the format.

Parent Share
twitter facebook
Re:XML is a crappy format (Score:3, Informative)

by Alex Belits ( 437 ) * writes: on Tuesday July 08, 2008 @05:32PM (#24106425) Homepage

Actually it handles languages EXTREMELY POORLY, because one of the design goal was to make Unicode mandatory. If XML was truly designed for handling multilingual data, every tag would be able to have attributes for language, charset and encoding, and those tags would default to "undefined, treat as opaque", to ensure safe round trip of untagged data from/to other formats.
Now it's impossible to use non-Unicode charsets when using multiple languages in the same program because THE WHOLE FREAKING DOCUMENT (why is it even mandatory to have "document"? Log file, for example, may contain records that should be readable before the file reached it final length -- if it is written in XML, it's formally invalid until the last moment when the logger stopped writing to it and closed its last tag even though it has no semantic meaning for the log that is a collection of records and records only) has to have one and only one charset/encoding even though it can contain text in multiple languages. Since most charsets only support one or few related languages, all implementations of XML-using software ended up hardcoding Unicode as the only supported charset.

Parent Share
twitter facebook
Re:Why another encoding scheme? (Score:5, Informative)

by Abcd1234 ( 188840 ) writes: on Tuesday July 08, 2008 @05:33PM (#24106447) Homepage

You think? Take BigTable. Wikipedia describes it as: '"a sparse, distributed multi-dimensional sorted map", sharing characteristics of both row-oriented and column-oriented databases'. Sounds, to me, like a specialized solution to a very specialized problem, a problem that, I presume, didn't fit with any existing solution. Same goes with GFS. After all, do you really think they didn't evaluate existing solutions before embarking on building an entirely new distributed filesystem? Do you really think they're that stupid?
As for Protocol Buffers, given the existing solutions out there (such as ASN.1 and CORBA) are generally ugly and/or over-engineered, it sounds to me like they're simply addressing a gap in the industry... after all, XML and SOAP aren't the end-all and be-all of generic object-passing protocols.

Parent Share
twitter facebook
Re:Why XML is good (Score:3, Informative)

by Temporal ( 96070 ) writes: on Tuesday July 08, 2008 @05:39PM (#24106535) Journal

XML and this protocol differ in only one way: one is plain text, the other is binary.
They also differ in that XML has a *lot* more features. For example, protocol buffers have no concept of entities, or even interleaved text. Those can be useful when your data is a text document with markup -- e.g. HTML -- but they tend to get in the way when you just want to pass around something like a struct.

Parent Share
twitter facebook
Re:C# (Score:3, Informative)

by jrumney ( 197329 ) writes: on Tuesday July 08, 2008 @05:50PM (#24106719)

Can you still read serialized objects created by older versions of your software?
As long as all you have done is added new fields, then you can tag the new fields as OptionalField or NonSerialized to maintain backwards compatibility. The advantage of using Google's library is that it works across languages and runtimes. Java, .NET, PHP and Python all have serialization built in, but they are all incompatible, so you can't use it to pass an object from your Java backend to a C# client then on to Python for some final processing before displaying in a PHP generated webpage.

Parent Share
twitter facebook
ASN.1 encoded with BER/DER just needs tools (Score:4, Informative)

by Animats ( 122034 ) writes: on Tuesday July 08, 2008 @05:51PM (#24106741) Homepage

ASN.1, from 1985, really is very similar. Here's a message defined in ASN.1 form:
Order ::= SEQUENCE { header Order-header, items SEQUENCE OF Order-line} Order-header ::= SEQUENCE { number Order-number, date Date, client Client,payment Payment-method } Order-number ::= NumericString (SIZE (12)) Date ::= NumericString (SIZE (8)) -- MMDDYYYY Client ::= SEQUENCE { name PrintableString (SIZE (1..20)), street PrintableString (SIZE (1..50)) OPTIONAL,postcode NumericString (SIZE (5)), town PrintableString (SIZE (1..30)), country PrintableString (SIZE (1..20)) DEFAULT default-country } default-country PrintableString ::= "France" Payment-method ::= CHOICE { check NumericString (SIZE (15)), credit-card Credit-card, cash NULL } Credit-card ::= SEQUENCE { type Card-type, number NumericString (SIZE (20)), expiry-date NumericString (SIZE (6)) -- MMYYYY -- } Card-type ::= ENUMERATED { cb(0), visa(1), eurocard(2), diners(3), american-express(4) }

Note that this has almost exactly the same feature set as Google's representation. There are named, typed field which can be optional or repeated. It just looks more like Pascal, while Google's syntax looks more like C.

Parent Share
twitter facebook
Re:JSON (Score:4, Informative)

by pavon ( 30274 ) writes: on Tuesday July 08, 2008 @05:57PM (#24106865)

The major difference between this and something like JSON or YAML or even XML is that those formats all include the format information (variable names, nesting, etc) along with the data. This does not.
message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
What you are looking at above is the Protocol Format (.proto file) for a single message, which is analogous to an XML schema. No data is stored in that file - the numbers you see are unique ids for the different fields, and they are used in the low low-level representation of the data (not all fields have to be included in every instance of a message)
The actual data is serialized using a compact binary format, not ASCII like JSON/YAML/XML which makes it much more efficient both to transfer over a network as well as to parse.

Parent Share
twitter facebook
All the world is not a VAX^W^WWindows. (Score:3, Informative)

by argent ( 18001 ) writes: <peter@slashdot . 2 0 06.taronga.com> on Tuesday July 08, 2008 @06:02PM (#24106925) Homepage Journal

Anyway, can someone shed some light on how this is different than binary serialization I've been using to pass C# objects around for quite some time now?
It's portable and language-independent?

Parent Share
twitter facebook
Re:An order of magnitude over XML? (Score:5, Informative)

by jd ( 1658 ) writes: <imipak @ y a h o o . com> on Tuesday July 08, 2008 @06:12PM (#24107089) Homepage Journal

Technically, you are correct - platform-agnostic data transfer has been possible since Sun's earliest RPC implementations. However, this seems to be considerably lighter-weight (although so is Mount Everest) and because order is specified, it's going to be much simpler to pluck specific data out of a data stream. You don't need to have an order-agnostic structure and then an ordering layer in each language-specific library.
There have been all kinds of attempts to produce this sort of stuff. RPC, DCE, Corba, DCOM, etc, are programmatic interfaces and handle function calls, synchronization, etc. OPeNDAP is probably the closest to Google's architecture in that it is ONLY data. It's more sophisticated, as it handles much more complex data types than mere structures, but it has its own overheads issues. It isn't designed to scale to terabyte databases, although it DOES scale extremely well and is definitely the preferred method of delivering high-volume structured scientific data - at least when compared to the RPC family of methods, or indeed the XML family. I wouldn't use it for the kind of volume of data Google handles, though, you'd kill the servers.

Parent Share
twitter facebook
Re:In the company I used to work... (Score:3, Informative)

by Temporal ( 96070 ) writes: on Tuesday July 08, 2008 @06:31PM (#24107367) Journal

This is 49 bytes: <person name="John Doe" email="jdoe@example.com">
The equivalent Protocol Buffer is 28 bytes. In addition to the 24 bytes of text, each field has a 1-byte tag and a 1-byte length. The example you quoted is protocol buffer *text* format, which is used mostly for debugging, not for actual interchange.

Parent Share
twitter facebook
Re:XML is a crappy format (Score:3, Informative)

by Alex Belits ( 437 ) * writes: on Tuesday July 08, 2008 @07:14PM (#24108011) Homepage

You still have all conversion routines built into language support, so all non-Unicode charsets still carry their support code into software. And it would be very easy to switch between charsets -- this happens anyway when you deal with character ranges that are not present in the fonts you use for your output. It all happens behind the scenes anyway.
The problem is, XML developers' Unicode fanaticism threw all this flexibility out of the window on the level of document data and metadata processing, keeping all complexity and sabotaging functionality, just to leave implementors and users no choice but to convert everything to Unicode.
This came at a pretty high price -- "simplified" processing allowed to handle text as if languages don't matter, so whenever I write a document in both Russian and English I actually get either document in English with sequences of Cyrillic characters that look like Russian words, or document in Russian with sequences of Roman characters that look like English words (as opposed to, say, Roman characters in Russian-only text that contains formulas).
Everything that is capable of editing or processing XML will make no attempt to let me choose the languages, so support for language-dependent processing is just as missing as support for charset-dependent processing, it's all Unicode, right? Except, of course, Unicode will do nothing to tell the application to choose right capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting, etc. -- "simplification" only made sense if it turned language handling into a superficial imitation of itself. If support for multiple charsets was included, it would automatically have to properly process all metadata including languages, and it would be able to use non-Unicode-oriented language-specific routines that existed for decades before Unicode.
So this is what Unicode produced -- it allowed to write software that looks like it draws all those pretty characters, but in fact does not handle languages in a way that a native speaker would recognize them as such. Good job, indeed.

Parent Share
twitter facebook
More XML? EXI, Efficient Xml Interchange! (Score:4, Informative)

by refactored ( 260886 ) writes: <cyent@@@xnet...co...nz> on Tuesday July 08, 2008 @08:12PM (#24108645) Homepage Journal

http://www.w3.org/XML/EXI/ [w3.org]
The development of the Efficient XML Interchange (EXI) format was guided by five design principles, namely, the format had to be general, minimal, efficient, flexible, and interoperable. The format satisfies these prerequisites, achieving generality, flexibility, and performance while at the same time keeping complexity in check.
Many of the concepts employed by the EXI format are applicable to the encoding of arbitrary languages that can be described by a grammar. Even though EXI utilizes schema information to improve compactness and processing efficiency, it does not depend on accurate, complete or current schemas to work.

Parent Share
twitter facebook
Re:Likely story! (Score:5, Informative)

by cnettel ( 836611 ) writes: on Tuesday July 08, 2008 @08:13PM (#24108675)

The problem is that, in my experience, it is easy to write a 99 % XML-compliant parser that is 10 times faster. That last percent, though...

Parent Share
twitter facebook
Re:No PERL API ??!!?? (Score:3, Informative)

by Onan ( 25162 ) writes: on Tuesday July 08, 2008 @08:15PM (#24108695)

Uh, no. Google officially deems perl unmaintainable, and its internal use is completely verboten.
You're quite welcome to write your own if you want it, but it's not something we'd ever use ourselves.

Parent Share
twitter facebook
Re:This is a good thing (Score:3, Informative)

by Temporal ( 96070 ) writes: on Tuesday July 08, 2008 @08:47PM (#24108991) Journal

Note that protocol buffers give you the equivalent of a DOM -- an object representing the parsed message. This is usually much more convenient to use than SAX parsing (depending on your use case, of course). So, I'm not sure if comparing against SAX is necessarily fair. Though I think protocol buffers would still win just because there is less to parse and parsing length-delimited chunks is faster than character-delimited.

Parent Share
twitter facebook
Re:XML was not created for speed (Score:2, Informative)

by Ant P. ( 974313 ) writes: on Tuesday July 08, 2008 @08:56PM (#24109067)

XML was created to look like SGML but with more strict parsing rules. The rest of those TLAs you list were created out of sadism.

Parent Share
twitter facebook
Lets actually compare (Score:3, Informative)

by cryptoluddite ( 658517 ) writes: on Tuesday July 08, 2008 @09:19PM (#24109323)

The big difference is that a protocol buffer cannot be understood without the message format (.proto file). Now lets actually take a look at a real list, like say the developers for apache [sourceforge.net] (as a list of {name:,email:} objects):
protobuf: ~1654 bytes
json: 1915 bytes
protobuf.lzop: ~744 bytes
json.lzop: 809 bytes
What you see is precious little difference in the size of the data even though the json is self-describing. The lzop version is essentially identically sized, and compressing and decompressing with lzo is wicked fast. So size is not a reason to use proto buffers.
Maybe speed is? Instead of using lzo compression just create a JSON binary format. This is trivial, and provides essentially the same size and speed benefits as protocol buffers while still being JSON in nature.
The only advantage to protocol buffers then is that they generate access and verify classes for you in you favorite language (if that language is C++, Java, or Python). Big deal, again this is absolutely trivial.
To me what this demonstrates is premature optimization. Instead, first use a simple text format like JSON then if that is too large compress it. Then if that is too slow send it in binary.
Note: I approximated the size of the proto buffers based on the descriptions of the binary format since I haven't downloaded the code (it actually compresses less well since I did not vary the 'length' bytes in my test file).

Parent Share
twitter facebook
Re:An order of magnitude over XML? (Score:4, Informative)

by vrmlguy ( 120854 ) writes: <samwyse@gmail.AUDENcom minus poet> on Tuesday July 08, 2008 @11:25PM (#24111095) Homepage Journal

The only obvious thing I see missing is a canonical way to encode the .proto file as a Protocol Buffer, to make a stream self-describing.
A-ha! I found it! [google.com] "Thus, the classes in this file allow protocol type definitions to be communicated efficiently between processes."
Why do you need this? Well, you may not. "Most users will not care about descriptors, because they will write code specific to certain protocol types and will simply use the classes generated by the protocol compiler directly. Advanced users who want to operate on arbitrary types (not known at compile time) may want to read descriptors in order to learn about the contents of a message."

Parent Share
twitter facebook
Re:XML is a crappy format (Score:3, Informative)

by pikine ( 771084 ) writes: on Wednesday July 09, 2008 @02:23AM (#24112695) Journal

Not to mention the minor detail that XML and compilers are orthogonal: you can use XML (or many other data interchange formats) with non-compiled languages, and most compilers know nothing about XML (or many other data interchange formats).
If you have taken a compiler class, you'd learn about "compiler compilers" which are parser generators. He's just talking about the concept of parsing in general, and that XML is for people who don't understand how to write parsers.
I don't agree with everything he says, but I think you need to know some context to understand that it's not off-the-chart nonsense.

Parent Share
twitter facebook
Re:An order of magnitude over XML? (Score:3, Informative)

by Nynaeve ( 163450 ) writes: on Wednesday July 09, 2008 @12:01PM (#24118875)

I got a 404 on your link. try this one [google.com].

Parent Share
twitter facebook
Re:XML is a crappy format (Score:3, Informative)

by shutdown -p now ( 807394 ) writes: on Wednesday July 09, 2008 @01:09PM (#24120031) Journal

The problem is, XML developers' Unicode fanaticism threw all this flexibility out of the window on the level of document data and metadata processing, keeping all complexity and sabotaging functionality, just to leave implementors and users no choice but to convert everything to Unicode.
You might have not noticed, but it's not just XML. Almost everyone has moved to Unicode now, and those who haven't yet (Ruby, PHP) are being mocked for just that, and have the move on the top of their TODO. Learn to live with it already.
This came at a pretty high price -- "simplified" processing allowed to handle text as if languages don't matter, so whenever I write a document in both Russian and English I actually get either document in English with sequences of Cyrillic characters that look like Russian words, or document in Russian with sequences of Roman characters that look like English words (as opposed to, say, Roman characters in Russian-only text that contains formulas).
Everything that is capable of editing or processing XML will make no attempt to let me choose the languages, so support for language-dependent processing is just as missing as support for charset-dependent processing, it's all Unicode, right? Except, of course, Unicode will do nothing to tell the application to choose right capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting, etc. -- "simplification" only made sense if it turned language handling into a superficial imitation of itself. If support for multiple charsets was included, it would automatically have to properly process all metadata including languages, and it would be able to use non-Unicode-oriented language-specific routines that existed for decades before Unicode.
You are extremely confused here. Neither of these: "capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting" - has anything to do with charset or encoding; none whatsoever. This is because encoding has nothing to do with a language. UTF-8 is an encoding which can handle hundreds of languages. Latin-1 is another one that can handle perhaps several dozen. Windows-1251, IIRC, can handle both Russian and Belarusian. It really does not matter. What matters is the language of the text and the associated culture (aka "locale") - trying to infer it from encoding or charset is silly and, in the end, futile.
And, surprise surprise, XML has a standard mechanism to associate content of an element with a specific language - it's xml:lang attribute. So, whenever you write, for example, an XHTML document that contains both English and Russian, all you need is to surround parts of texts in another language with <span> or <div>, and mark them with xml:lang. It's even specifically mentioned in the XHTML spec.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google Open Sources Its Data Interchange Format 332

Google Open Sources Its Data Interchange Format More Login

Google Open Sources Its Data Interchange Format

Re:Likely story! (Score:4, Informative)

compare to thrift ( from facebook) (Score:5, Informative)

Re:No PERL API ??!!?? (Score:5, Informative)

Re:WTF am I missing (Score:5, Informative)

Re:Why another encoding scheme? (Score:3, Informative)

Re:What? (Score:3, Informative)

Re:JSON (Score:5, Informative)

Re:Why rely on a single data encoding system ? (Score:3, Informative)

Re:XML is a crappy format (Score:3, Informative)

Re:Why another encoding scheme? (Score:5, Informative)

Re:Why XML is good (Score:3, Informative)

Re:C# (Score:3, Informative)

ASN.1 encoded with BER/DER just needs tools (Score:4, Informative)

Re:JSON (Score:4, Informative)

All the world is not a VAX^W^WWindows. (Score:3, Informative)

Re:An order of magnitude over XML? (Score:5, Informative)

Re:In the company I used to work... (Score:3, Informative)

Re:XML is a crappy format (Score:3, Informative)

More XML? EXI, Efficient Xml Interchange! (Score:4, Informative)

Re:Likely story! (Score:5, Informative)

Re:No PERL API ??!!?? (Score:3, Informative)

Re:This is a good thing (Score:3, Informative)

Re:XML was not created for speed (Score:2, Informative)

Lets actually compare (Score:3, Informative)

Re:An order of magnitude over XML? (Score:4, Informative)

Re:XML is a crappy format (Score:3, Informative)

Re:An order of magnitude over XML? (Score:3, Informative)

Re:XML is a crappy format (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot