Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

[ Create a new account ]

Google Open Sources Its Data Interchange Format

Posted by kdawson on Tuesday July 08, @04:07PM
from the it's-fast-that's-why dept.
A number of readers have noted Google's open sourcing of their internal data interchange format, called Protocol Buffers (here's the code and the doc). Google elevator statement for Protocol Buffers is "a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more." It's the way data is formatted to move around inside of Google. Betanews spotlights some of Protocol Buffers' contrasts with XML and IDL, with which it is most comparable. Google's blogger claims, "And, yes, it is very fast — at least an order of magnitude faster than XML."

Related Stories

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More | Login | Reply
Loading... please wait.
  • by Anonymous Coward on Tuesday July 08, @04:10PM (#24105175)

    So is, well, just about anything.

    • by dedazo (737510) on Tuesday July 08, @04:34PM (#24105539) Journal

      Looks like Google just invented the IIOP [wikipedia.org] wire protocol, which is also platform agnostic and an open standard.

      I guess the main difference here is that their "compiler" can generate the actual language-domain classes off of the descriptor files, which is a definite advantage over "classic" IDL.

      "Google protocol Buffers" is cooler than the OMG terminology, but this kind of thing has been around for 20 years.

      • Technically, you are correct - platform-agnostic data transfer has been possible since Sun's earliest RPC implementations. However, this seems to be considerably lighter-weight (although so is Mount Everest) and because order is specified, it's going to be much simpler to pluck specific data out of a data stream. You don't need to have an order-agnostic structure and then an ordering layer in each language-specific library.

        There have been all kinds of attempts to produce this sort of stuff. RPC, DCE, Corba, DCOM, etc, are programmatic interfaces and handle function calls, synchronization, etc. OPeNDAP is probably the closest to Google's architecture in that it is ONLY data. It's more sophisticated, as it handles much more complex data types than mere structures, but it has its own overheads issues. It isn't designed to scale to terabyte databases, although it DOES scale extremely well and is definitely the preferred method of delivering high-volume structured scientific data - at least when compared to the RPC family of methods, or indeed the XML family. I wouldn't use it for the kind of volume of data Google handles, though, you'd kill the servers.

    • An order of magnitude over XML? So is, well, just about anything.

      Well, let's also not forget that the meaning of the expression "an order of magnitude" depends strongly from the numeric base you're using.

  • by TheRealMindChild (743925) on Tuesday July 08, @04:13PM (#24105217) Homepage Journal
    "Google's blogger claims, "And, yes, it is very fast -- at least an order of magnitude faster than XML."

    That is just because they aren't using enough XML!
      • by jandrese (485) <kensama@vt.edu> on Tuesday July 08, @04:27PM (#24105409) Homepage Journal
        Yeah, I mean XML didn't earn its reputation for being lightning fast and byte efficient for nothing...
      • Re:Likely story! (Score:5, Insightful)

        by cduffy (652) <charles+slashdot ... is.net minus bsd> on Tuesday July 08, @04:32PM (#24105497)

        Being 10x faster than XML to work with is entirely believable: If you're serializing directly to binary structures, those structures can be directly manipulated without any parsing at all... and if you need to do some byte-swapping and alignment adjustments to get them into and out of native form for your current processor, those are still operations which can be performed in a matter of a few CPU instructions, rather than through a few hundred KB of libraries.

        I drink the XML kool-aid plenty -- but there are things it's good for, and things it's not. Serializing and parsing truly massive amounts of data is part of the latter set.

        • by Temporal (96070) on Tuesday July 08, @06:13PM (#24107101) Journal

          The example they give is for a small set of data, and percentages vary more dramatically as sample sizes decrease.

          We wanted to give an idea of the speed without trying to boast too much or look like we were directly challenging anyone. Of course every news outlet has chosen to highlight the speed comment -- including the numbers which were intended to be ballpark figures -- more than was intended, but I guess that isn't surprising.

          I agree that the tiny "person" example is not a good benchmark case. It was intended as a usage example, not a speed example, but I stuck the speed numbers in there just meaning to give people a vague idea of the difference. The "20-100 times faster" comment is based on testing a variety of formats -- both unrealistic ones and real-life formats used in our search pipeline -- against programmatically generated XML equivalents (which may or may not themselves be realistic, though they contain the same data with the same structure). libxml2 was used for parsing XML. I don't really know how libxml2's speed compares to other XML parsers, but I didn't have a lot of time to investigate. The 20x faster number comes from the largest data set (~100k-ish) while the 100x number comes from a very small message. The most realistic case was about 50x. Sorry that I cannot provide exact details of the benchmark setup since many of the test cases were proprietary internal formats.

          In any case, I'm hoping that some independent source conducts some tests because I think anything we produced would probably have unintentional biases in it. Of course, I'll update the numbers in the docs if they turn out to be wildly off-base.

  • I bet ... (Score:5, Funny)

    by Anonymous Coward on Tuesday July 08, @04:15PM (#24105239)

    ... it requires piping data through google's servers for data mining and ad injection purposes.

  • by Anonymous Coward on Tuesday July 08, @04:28PM (#24105431)

    It looks like Google has taken some of the good elements of CORBA and IIOP into its own interchange format.
    While CORBA certainly is bloated in a lot of ways, the IIOP wire protocol it uses is vastly faster and more efficient than any XML out there.. and yes it is just as "open" (publicly documented and Freely available for use in any open source application) as any XML schema out there. J2EE uses IIOP as well and its is technically possible to interoperate (although the problem with CORBA is that different implementations never really interoperated as they were supposed to).
        As a side note, I'd rather write IDL code than an XML schema any day of the week too, but that's another rant.

  • by Anonymous Coward on Tuesday July 08, @04:29PM (#24105439)

    both really from the same design sheet, but thrift has been opensource'd for over a year, and has many more language bindings. its been in use in several opensource projects (thrudb comes to mind), and has much more extant articles/documentation.

    http://developers.facebook.com/thrift/

  • Fast (Score:5, Interesting)

    by JamesP (688957) on Tuesday July 08, @04:30PM (#24105457)

    "And, yes, it is very fast â" at least an order of magnitude faster than XML."

    Just wait for the XML zealots to come crashing and not believing that XML is not the fastest, best, solution to all the world's problems (including cancer) and of course people at Google are amateurs and id10ts and WHY DO YOU HATE XML kind of stuff.

    Or, as Joel Spolski once said: http://www.joelonsoftware.com/articles/fog0000000296.html [joelonsoftware.com]

    No, there is nothing wrong with XML per se, except for the fans...

    • Ok, I'll bite... (Score:5, Interesting)

      by Dutch Gun (899105) on Tuesday July 08, @05:03PM (#24105961)

      Obviously, those at Google felt XML didn't work well for them. They have the resources to invent a protocol and libraries to support it. And, they are big enough to be their own ecosystem, which means as long as everyone at Google is using their formats, interop is no biggie. Good for them, I don't begrudge that decision.

      I'm actually a game developer, not a web developer, so I'll speak to XML's use as a file format in general. Here's a few points regarding our use of XML:

      * We only use it as a source format for our tools. XML is far too inefficient and verbose to use in the final game - all our XML data is packed into our own proprietary binary data format.
      * We also only use it as a meta-data format, not a primary container type. For instance, we store gameplay scripts, audio script, and cinematic meta-data in XML format. We're not foolish enough to store images, sounds, or maps in a highly-verbose, text-based format. XML's value to us is in how well it can glue large pieces of our game together.
      * All our latest tools are written in C# and using the .NET platform (Windows is our development platform, of course). It's astoundingly easy to serialize data structures to XML using .NET libraries - just a few lines of code.
      * Because it's a text-based format and human readable, if a file breaks in any way, we can just do a diff in source control to see what changed, and why it's breaking.

      I'll make a concession that I've heard of some pretty awful uses of XML. But those who dismiss XML as a valuable tool in the toolchest are equally as foolish as those who believe it's the end-all and be-all of programming (I'm not saying that's true of you, just pointing out foolishness on both sides). Like any tool, it's most valuable when used in it's optimal role, not when shoehorned into projects as a solution to everything.

  • Smart move (Score:5, Insightful)

    by ruin20 (1242396) on Tuesday July 08, @04:32PM (#24105491)
    Since they're Google people will clamor over this (as we're doing here) and the result will be at least a handful of folks will learn and use it. Google's key to success has always been finding fresh talent and removing barriers from their contributing and advancement so what I've seen they've done is A) help train potential employee's on how they're tech and thought process works, and B) provide themselves a filter by which to gauge the ability for a potential employee to understand they're system.

    And as a bonus, they help undermine opponents who use competing technologies by helping train the workforce away from their practices. Overall I think it's very intelligent and well done strategic move.

  • The point of this isn't so much that it's faster than XML (so is everything else), it's that google took everything that a real person needs in a IDL and cut out everything else. Most IDLs have a serious case of second system effect, where features are added that nobody uses but seriously complicate the API. Even XML suffers from that (have you ever seen the kind of data structure you need to store a DOM, or what that does to library APIs for manipulating XML)?

    I'd use it because 95% of the time all I need is something simple like this, and the other 5% of the time I should go back and rethink my design anyway.

    That said, there is still a case for XML, especially the self documenting and human readable nature of the document, but there are a lot of cases where it is used today where it only adds unnecessary complexity and actually makes your code more difficult to maintain instead of simpler.
  • by Alex Belits (437) * on Tuesday July 08, @04:42PM (#24105649) Homepage

    I always told people that -- it's optimized for:

    1. Easy parsing by parsers written by people who slept through their compiler classes.

    2. Verification in situations when it's impossible to devise a meaningful reaction to a failure (other than either "everything failed, turn off the computers and go home" and "assume the data to be valid anyway because ALL of it will have the same formatting error because the same program generates it")

    3. Dealing with data that arrives in neatly packaged "documents" and "requests", as opposed to being constantly produced and consumed.

    4. Either communicating between programs that have the same knowledge of message semantics, or preparation of pretty human-readable documents.

    None of the above even remotely applies to anything practical except UI/display formats -- this is why XHTML and ODF (and because of that at some extent XSL) are usable, SOAP is a load of crap, and for the rest of purposes XML is used as a glorified CSL with angle brackets. XML is widespread because monumentally stupid standard is still better than no standard.

    So here is your example of how superior can be ANY format that is not based on this stupid idea.

  • JSON (Score:5, Interesting)

    by hey (83763) on Tuesday July 08, @04:49PM (#24105729) Journal

    Looks kinda like JSON to me.

    • Re:JSON (Score:5, Informative)

      by Temporal (96070) on Tuesday July 08, @05:20PM (#24106247) Journal

      Structurally Protocol Buffers are similar to JSON, yes. In fact, you could use the classes generated by the Protocol Buffer compiler together with some code that encodes and decodes them in JSON. This is something some Google projects do internally since it's useful for communicating with AJAX apps. Writing a custom encoding that operates on arbitrary protocol buffer classes is actually pretty easy since all protocol message objects have a reflection interface (even in C++).

      The advantage of using the protocol buffer format instead of JSON is that it's smaller and faster, but you sacrifice human-readability.

    • Re:JSON (Score:5, Interesting)

      by 0xABADC0DA (867955) on Tuesday July 08, @06:35PM (#24107449)

      Modify JSON so unquoted attributes are 'type labels' and define the type of an attribute by giving a label or a default value. For instance:

      phoneType: { MOBILE: 0, HOME: 1, WORK: 2 }

      phoneNumber: { "number": "", "type": phoneType }

      person: {
        "name": "",
        "id": 0,
        "email": "",
        "phone": [ phoneNumber ],
      }

      ... now you have pretty much exactly the same message definition as protocol buffers, but in pure JSON. It could also use some convention like "@WORK" for labels/classes so that a normal JSON parser can parse the message definitions. You can write a code generator to make access classes for messages just by walking the json and looking at the types. I don't see that 'required' and 'optional' keywords help much... imo defaults are generally better (even if they are nil). But this could easily be expressed in a json message definition.

      It's easy to make a binary JSON format that is fast and also small, so there is little advantage to protocol buffers there. It's also easy and ridiculously fast to compress JSON text using say character-based lzo (Oberhumer).

      Maybe somebody can explain, but it doesn't seem like protocol buffers really have much advantages over JSON. It sounds like it is effectively just a binary format for JSON-like data (name-value pairs they say) along with a code generator to access it. The code generator is nice, but this is like a day's work max. Maybe I'm not understanding google's problems, but I'll stick with JSON since it actually is a cross-platform, language neutral data format... and you can always optimize it if actually needed.

    • Re:WTF am I missing (Score:5, Informative)

      by jandrese (485) <kensama@vt.edu> on Tuesday July 08, @04:47PM (#24105701) Homepage Journal
      They open sourced the compiler (for C++, Java, and Python) that lets you actually use the data interchange format. If you follow the link you can download the code and start using it today. The code is open source.
    • by Chyeld (713439) <chyeld@news g u y .com> on Tuesday July 08, @04:51PM (#24105753)

      Seems like you are missing the code they released that allows you to implement this in a number of languages from the 'get-go'.

      You've also missed that they've just told the world how the majority of their systems talk, something most people would find interesting given how much Google does and the fact that one of Google's strong points is mangling huge amounts of data in a relatively quickly manner.

      PS. Your format stinks and is horribly slow and unscalable when it comes to adding to the library. Genre's are so unbelievably grey defined that you might as well just sort them by the dominate color of the cover. Google would have done better.

      • by Abcd1234 (188840) on Tuesday July 08, @05:33PM (#24106447) Homepage

        You think? Take BigTable. Wikipedia describes it as: '"a sparse, distributed multi-dimensional sorted map", sharing characteristics of both row-oriented and column-oriented databases'. Sounds, to me, like a specialized solution to a very specialized problem, a problem that, I presume, didn't fit with any existing solution. Same goes with GFS. After all, do you really think they didn't evaluate existing solutions before embarking on building an entirely new distributed filesystem? Do you really think they're that stupid?

        As for Protocol Buffers, given the existing solutions out there (such as ASN.1 and CORBA) are generally ugly and/or over-engineered, it sounds to me like they're simply addressing a gap in the industry... after all, XML and SOAP aren't the end-all and be-all of generic object-passing protocols.