Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet

Google Open Sources Its Data Interchange Format 332

A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google. Betanews spotlights some of Protocol Buffers' contrasts with XML and IDL, with which it is most comparable. Google's blogger claims, "And, yes, it is very fast — at least an order of magnitude faster than XML."
This discussion has been archived. No new comments can be posted.

Google Open Sources Its Data Interchange Format

Comments Filter:
  • Re:Likely story! (Score:4, Informative)

    by caerwyn ( 38056 ) on Tuesday July 08, 2008 @04:26PM (#24105389)

    Are you serious? XML is great for certain applications, but the one thing it *isn't* is fast. It's very believable that something like this could be an order of magnitude faster.

  • by Anonymous Coward on Tuesday July 08, 2008 @04:29PM (#24105439)

    both really from the same design sheet, but thrift has been opensource'd for over a year, and has many more language bindings. its been in use in several opensource projects (thrudb comes to mind), and has much more extant articles/documentation.

    http://developers.facebook.com/thrift/

  • by yknott ( 463514 ) on Tuesday July 08, 2008 @04:40PM (#24105601) Homepage Journal

    According to Brad Fitzpatrick's(of LiveJounral fame) blog [livejournal.com], He's working on Perl support.

  • Re:WTF am I missing (Score:5, Informative)

    by jandrese ( 485 ) <kensama@vt.edu> on Tuesday July 08, 2008 @04:47PM (#24105701) Homepage Journal
    They open sourced the compiler (for C++, Java, and Python) that lets you actually use the data interchange format. If you follow the link you can download the code and start using it today. The code is open source.
  • by MightyMartian ( 840721 ) on Tuesday July 08, 2008 @04:48PM (#24105721) Journal

    It's not hard because XML has to be the most bloated (and yet still, ironically, nowhere near human-readable) format ever invented. That it has not only not been discarded, but is now being used to store binary blobs by guys like Microsoft and OO.org is testimony to the sheer overwhelming stupidity of a lot of developers.

  • Re:What? (Score:3, Informative)

    by merreborn ( 853723 ) on Tuesday July 08, 2008 @05:00PM (#24105927) Journal

    1) It has a binary format, far more compact (and faster to unserialize) than PHP's text-based serialized format.
    2) It handles multiple versions of the same objects (e.g., your server can interact with both PhoneNumber 2.0 and PhoneNumber 3.0 objects relatively trivially)
    3) It generates code for converting each format into objects in their 3 supported languages.

    So, no, not really.

  • Re:JSON (Score:5, Informative)

    by Temporal ( 96070 ) on Tuesday July 08, 2008 @05:20PM (#24106247) Journal

    Structurally Protocol Buffers are similar to JSON, yes. In fact, you could use the classes generated by the Protocol Buffer compiler together with some code that encodes and decodes them in JSON. This is something some Google projects do internally since it's useful for communicating with AJAX apps. Writing a custom encoding that operates on arbitrary protocol buffer classes is actually pretty easy since all protocol message objects have a reflection interface (even in C++).

    The advantage of using the protocol buffer format instead of JSON is that it's smaller and faster, but you sacrifice human-readability.

  • by Temporal ( 96070 ) on Tuesday July 08, 2008 @05:28PM (#24106361) Journal

    It's worth noting that writing alternative encoders and decoders for protocol buffers is really easy (since protocol message objects have a reflection interface, even in C++), so you can use the friendly generated code without being tied to the format.

  • by Alex Belits ( 437 ) * on Tuesday July 08, 2008 @05:32PM (#24106425) Homepage

    Actually it handles languages EXTREMELY POORLY, because one of the design goal was to make Unicode mandatory. If XML was truly designed for handling multilingual data, every tag would be able to have attributes for language, charset and encoding, and those tags would default to "undefined, treat as opaque", to ensure safe round trip of untagged data from/to other formats.

    Now it's impossible to use non-Unicode charsets when using multiple languages in the same program because THE WHOLE FREAKING DOCUMENT (why is it even mandatory to have "document"? Log file, for example, may contain records that should be readable before the file reached it final length -- if it is written in XML, it's formally invalid until the last moment when the logger stopped writing to it and closed its last tag even though it has no semantic meaning for the log that is a collection of records and records only) has to have one and only one charset/encoding even though it can contain text in multiple languages. Since most charsets only support one or few related languages, all implementations of XML-using software ended up hardcoding Unicode as the only supported charset.

  • by Abcd1234 ( 188840 ) on Tuesday July 08, 2008 @05:33PM (#24106447) Homepage

    You think? Take BigTable. Wikipedia describes it as: '"a sparse, distributed multi-dimensional sorted map", sharing characteristics of both row-oriented and column-oriented databases'. Sounds, to me, like a specialized solution to a very specialized problem, a problem that, I presume, didn't fit with any existing solution. Same goes with GFS. After all, do you really think they didn't evaluate existing solutions before embarking on building an entirely new distributed filesystem? Do you really think they're that stupid?

    As for Protocol Buffers, given the existing solutions out there (such as ASN.1 and CORBA) are generally ugly and/or over-engineered, it sounds to me like they're simply addressing a gap in the industry... after all, XML and SOAP aren't the end-all and be-all of generic object-passing protocols.

  • Re:Why XML is good (Score:3, Informative)

    by Temporal ( 96070 ) on Tuesday July 08, 2008 @05:39PM (#24106535) Journal

    XML and this protocol differ in only one way: one is plain text, the other is binary.

    They also differ in that XML has a *lot* more features. For example, protocol buffers have no concept of entities, or even interleaved text. Those can be useful when your data is a text document with markup -- e.g. HTML -- but they tend to get in the way when you just want to pass around something like a struct.

  • Re:C# (Score:3, Informative)

    by jrumney ( 197329 ) on Tuesday July 08, 2008 @05:50PM (#24106719)

    Can you still read serialized objects created by older versions of your software?

    As long as all you have done is added new fields, then you can tag the new fields as OptionalField or NonSerialized to maintain backwards compatibility. The advantage of using Google's library is that it works across languages and runtimes. Java, .NET, PHP and Python all have serialization built in, but they are all incompatible, so you can't use it to pass an object from your Java backend to a C# client then on to Python for some final processing before displaying in a PHP generated webpage.

  • by Animats ( 122034 ) on Tuesday July 08, 2008 @05:51PM (#24106741) Homepage

    ASN.1, from 1985, really is very similar. Here's a message defined in ASN.1 form:

    Order ::= SEQUENCE {
    header Order-header,
    items SEQUENCE OF Order-line}

    Order-header ::= SEQUENCE {
    number Order-number,
    date Date,
    client Client,payment Payment-method }

    Order-number ::= NumericString (SIZE (12))
    Date ::= NumericString (SIZE (8)) -- MMDDYYYY

    Client ::= SEQUENCE {
    name PrintableString (SIZE (1..20)),
    street PrintableString (SIZE (1..50)) OPTIONAL,postcode NumericString (SIZE (5)),
    town PrintableString (SIZE (1..30)),
    country PrintableString (SIZE (1..20))
    DEFAULT default-country }
    default-country PrintableString ::= "France"

    Payment-method ::= CHOICE {
    check NumericString (SIZE (15)),
    credit-card Credit-card,
    cash NULL }

    Credit-card ::= SEQUENCE {
    type Card-type,
    number NumericString (SIZE (20)),
    expiry-date NumericString (SIZE (6)) -- MMYYYY -- }

    Card-type ::= ENUMERATED { cb(0), visa(1), eurocard(2),
    diners(3), american-express(4) }

    Note that this has almost exactly the same feature set as Google's representation. There are named, typed field which can be optional or repeated. It just looks more like Pascal, while Google's syntax looks more like C.

  • Re:JSON (Score:4, Informative)

    by pavon ( 30274 ) on Tuesday July 08, 2008 @05:57PM (#24106865)

    The major difference between this and something like JSON or YAML or even XML is that those formats all include the format information (variable names, nesting, etc) along with the data. This does not.

    message Person {
        required int32 id = 1;
        required string name = 2;
        optional string email = 3;
    }

    What you are looking at above is the Protocol Format (.proto file) for a single message, which is analogous to an XML schema. No data is stored in that file - the numbers you see are unique ids for the different fields, and they are used in the low low-level representation of the data (not all fields have to be included in every instance of a message)

    The actual data is serialized using a compact binary format, not ASCII like JSON/YAML/XML which makes it much more efficient both to transfer over a network as well as to parse.

  • Anyway, can someone shed some light on how this is different than binary serialization I've been using to pass C# objects around for quite some time now?

    It's portable and language-independent?

  • by jd ( 1658 ) <imipak@ y a hoo.com> on Tuesday July 08, 2008 @06:12PM (#24107089) Homepage Journal
    Technically, you are correct - platform-agnostic data transfer has been possible since Sun's earliest RPC implementations. However, this seems to be considerably lighter-weight (although so is Mount Everest) and because order is specified, it's going to be much simpler to pluck specific data out of a data stream. You don't need to have an order-agnostic structure and then an ordering layer in each language-specific library.

    There have been all kinds of attempts to produce this sort of stuff. RPC, DCE, Corba, DCOM, etc, are programmatic interfaces and handle function calls, synchronization, etc. OPeNDAP is probably the closest to Google's architecture in that it is ONLY data. It's more sophisticated, as it handles much more complex data types than mere structures, but it has its own overheads issues. It isn't designed to scale to terabyte databases, although it DOES scale extremely well and is definitely the preferred method of delivering high-volume structured scientific data - at least when compared to the RPC family of methods, or indeed the XML family. I wouldn't use it for the kind of volume of data Google handles, though, you'd kill the servers.

  • by Temporal ( 96070 ) on Tuesday July 08, 2008 @06:31PM (#24107367) Journal

    This is 49 bytes: <person name="John Doe" email="jdoe@example.com">

    The equivalent Protocol Buffer is 28 bytes. In addition to the 24 bytes of text, each field has a 1-byte tag and a 1-byte length. The example you quoted is protocol buffer *text* format, which is used mostly for debugging, not for actual interchange.

  • by Alex Belits ( 437 ) * on Tuesday July 08, 2008 @07:14PM (#24108011) Homepage

    You still have all conversion routines built into language support, so all non-Unicode charsets still carry their support code into software. And it would be very easy to switch between charsets -- this happens anyway when you deal with character ranges that are not present in the fonts you use for your output. It all happens behind the scenes anyway.

    The problem is, XML developers' Unicode fanaticism threw all this flexibility out of the window on the level of document data and metadata processing, keeping all complexity and sabotaging functionality, just to leave implementors and users no choice but to convert everything to Unicode.

    This came at a pretty high price -- "simplified" processing allowed to handle text as if languages don't matter, so whenever I write a document in both Russian and English I actually get either document in English with sequences of Cyrillic characters that look like Russian words, or document in Russian with sequences of Roman characters that look like English words (as opposed to, say, Roman characters in Russian-only text that contains formulas).

    Everything that is capable of editing or processing XML will make no attempt to let me choose the languages, so support for language-dependent processing is just as missing as support for charset-dependent processing, it's all Unicode, right? Except, of course, Unicode will do nothing to tell the application to choose right capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting, etc. -- "simplification" only made sense if it turned language handling into a superficial imitation of itself. If support for multiple charsets was included, it would automatically have to properly process all metadata including languages, and it would be able to use non-Unicode-oriented language-specific routines that existed for decades before Unicode.

    So this is what Unicode produced -- it allowed to write software that looks like it draws all those pretty characters, but in fact does not handle languages in a way that a native speaker would recognize them as such. Good job, indeed.

  • by refactored ( 260886 ) <cyent.xnet@co@nz> on Tuesday July 08, 2008 @08:12PM (#24108645) Homepage Journal
    http://www.w3.org/XML/EXI/ [w3.org]

    The development of the Efficient XML Interchange (EXI) format was guided by five design principles, namely, the format had to be general, minimal, efficient, flexible, and interoperable. The format satisfies these prerequisites, achieving generality, flexibility, and performance while at the same time keeping complexity in check.

    Many of the concepts employed by the EXI format are applicable to the encoding of arbitrary languages that can be described by a grammar. Even though EXI utilizes schema information to improve compactness and processing efficiency, it does not depend on accurate, complete or current schemas to work.

  • Re:Likely story! (Score:5, Informative)

    by cnettel ( 836611 ) on Tuesday July 08, 2008 @08:13PM (#24108675)
    The problem is that, in my experience, it is easy to write a 99 % XML-compliant parser that is 10 times faster. That last percent, though...
  • by Onan ( 25162 ) on Tuesday July 08, 2008 @08:15PM (#24108695)

    Uh, no. Google officially deems perl unmaintainable, and its internal use is completely verboten.

    You're quite welcome to write your own if you want it, but it's not something we'd ever use ourselves.

  • by Temporal ( 96070 ) on Tuesday July 08, 2008 @08:47PM (#24108991) Journal

    Note that protocol buffers give you the equivalent of a DOM -- an object representing the parsed message. This is usually much more convenient to use than SAX parsing (depending on your use case, of course). So, I'm not sure if comparing against SAX is necessarily fair. Though I think protocol buffers would still win just because there is less to parse and parsing length-delimited chunks is faster than character-delimited.

  • by Ant P. ( 974313 ) on Tuesday July 08, 2008 @08:56PM (#24109067)

    XML was created to look like SGML but with more strict parsing rules. The rest of those TLAs you list were created out of sadism.

  • by cryptoluddite ( 658517 ) on Tuesday July 08, 2008 @09:19PM (#24109323)

    The big difference is that a protocol buffer cannot be understood without the message format (.proto file). Now lets actually take a look at a real list, like say the developers for apache [sourceforge.net] (as a list of {name:,email:} objects):

    protobuf: ~1654 bytes

    json: 1915 bytes

    protobuf.lzop: ~744 bytes

    json.lzop: 809 bytes

    What you see is precious little difference in the size of the data even though the json is self-describing. The lzop version is essentially identically sized, and compressing and decompressing with lzo is wicked fast. So size is not a reason to use proto buffers.

    Maybe speed is? Instead of using lzo compression just create a JSON binary format. This is trivial, and provides essentially the same size and speed benefits as protocol buffers while still being JSON in nature.

    The only advantage to protocol buffers then is that they generate access and verify classes for you in you favorite language (if that language is C++, Java, or Python). Big deal, again this is absolutely trivial.

    To me what this demonstrates is premature optimization. Instead, first use a simple text format like JSON then if that is too large compress it. Then if that is too slow send it in binary.

    Note: I approximated the size of the proto buffers based on the descriptions of the binary format since I haven't downloaded the code (it actually compresses less well since I did not vary the 'length' bytes in my test file).

  • by vrmlguy ( 120854 ) <samwyse AT gmail DOT com> on Tuesday July 08, 2008 @11:25PM (#24111095) Homepage Journal

    The only obvious thing I see missing is a canonical way to encode the .proto file as a Protocol Buffer, to make a stream self-describing.

    A-ha! I found it! [google.com] "Thus, the classes in this file allow protocol type definitions to be communicated efficiently between processes."

    Why do you need this? Well, you may not. "Most users will not care about descriptors, because they will write code specific to certain protocol types and will simply use the classes generated by the protocol compiler directly. Advanced users who want to operate on arbitrary types (not known at compile time) may want to read descriptors in order to learn about the contents of a message."

  • by pikine ( 771084 ) on Wednesday July 09, 2008 @02:23AM (#24112695) Journal

    Not to mention the minor detail that XML and compilers are orthogonal: you can use XML (or many other data interchange formats) with non-compiled languages, and most compilers know nothing about XML (or many other data interchange formats).

    If you have taken a compiler class, you'd learn about "compiler compilers" which are parser generators. He's just talking about the concept of parsing in general, and that XML is for people who don't understand how to write parsers.

    I don't agree with everything he says, but I think you need to know some context to understand that it's not off-the-chart nonsense.

  • by Nynaeve ( 163450 ) on Wednesday July 09, 2008 @12:01PM (#24118875)

    I got a 404 on your link. try this one [google.com].

  • by shutdown -p now ( 807394 ) on Wednesday July 09, 2008 @01:09PM (#24120031) Journal

    The problem is, XML developers' Unicode fanaticism threw all this flexibility out of the window on the level of document data and metadata processing, keeping all complexity and sabotaging functionality, just to leave implementors and users no choice but to convert everything to Unicode.

    You might have not noticed, but it's not just XML. Almost everyone has moved to Unicode now, and those who haven't yet (Ruby, PHP) are being mocked for just that, and have the move on the top of their TODO. Learn to live with it already.

    This came at a pretty high price -- "simplified" processing allowed to handle text as if languages don't matter, so whenever I write a document in both Russian and English I actually get either document in English with sequences of Cyrillic characters that look like Russian words, or document in Russian with sequences of Roman characters that look like English words (as opposed to, say, Roman characters in Russian-only text that contains formulas).

    Everything that is capable of editing or processing XML will make no attempt to let me choose the languages, so support for language-dependent processing is just as missing as support for charset-dependent processing, it's all Unicode, right? Except, of course, Unicode will do nothing to tell the application to choose right capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting, etc. -- "simplification" only made sense if it turned language handling into a superficial imitation of itself. If support for multiple charsets was included, it would automatically have to properly process all metadata including languages, and it would be able to use non-Unicode-oriented language-specific routines that existed for decades before Unicode.

    You are extremely confused here. Neither of these: "capitalization procedure, spellchecker, hyphenation, phonetic match, acronym expansion, index sorting" - has anything to do with charset or encoding; none whatsoever. This is because encoding has nothing to do with a language. UTF-8 is an encoding which can handle hundreds of languages. Latin-1 is another one that can handle perhaps several dozen. Windows-1251, IIRC, can handle both Russian and Belarusian. It really does not matter. What matters is the language of the text and the associated culture (aka "locale") - trying to infer it from encoding or charset is silly and, in the end, futile.

    And, surprise surprise, XML has a standard mechanism to associate content of an element with a specific language - it's xml:lang attribute. So, whenever you write, for example, an XHTML document that contains both English and Russian, all you need is to surround parts of texts in another language with <span> or <div>, and mark them with xml:lang. It's even specifically mentioned in the XHTML spec.

Love may laugh at locksmiths, but he has a profound respect for money bags. -- Sidney Paternoster, "The Folly of the Wise"

Working...