Forgot your password?
typodupeerror
Google Businesses Networking The Internet IT Hardware

How Google Routes Around Outages 105

Posted by timothy
from the retrains-pigeons dept.
1sockchuck writes "Making changes to Google's search infrastructure is akin to 'changing the tires on a car while you're going at 60 down the freeway,' according to Urs Holzle, who oversees the company's massive data center operations. In a Q-and-A with Data Center Knowledge, Holzle discusses Google's infrastructure, how it has engineered its system to route around hardware failures, and how it responds when something goes awry. These updates usually go unnoticed, but during system maintenance last month a software bug triggered an outage for Gmail."
This discussion has been archived. No new comments can be posted.

How Google Routes Around Outages

Comments Filter:
  • Just me? (Score:5, Funny)

    by Anonymous Coward on Wednesday March 25, 2009 @04:22PM (#27334633)

    Was it just me or did anyone else spend a few minutes contemplating how you actually could make a car that did allow you to change a flat while moving?

    • Re:Just me? (Score:5, Funny)

      by esocid (946821) on Wednesday March 25, 2009 @04:24PM (#27334657) Journal
      Just you. I kept thinking about how I could use a car metaphor to describe how google...oh wait.
      • Re:Just me? (Score:5, Funny)

        by Slumdog (1460213) on Wednesday March 25, 2009 @04:38PM (#27334843)

        Just you. I kept thinking about how I could use a car metaphor to describe how google...oh wait.

        I kept thinking about derailing a car, before I realized I was on the wrong track.

        • Re:Just me? (Score:5, Funny)

          by tux0r (604835) <<moc.liamg> <ta> <todhsals+sregnifcigam>> on Wednesday March 25, 2009 @06:03PM (#27335777) Homepage

          I kept thinking about derailing a car, before I realized I was on the wrong track.

          I was going to reply about mixing metaphors, but then I lost my train of thought.

          • by Dekker3D (989692)

            oh come on guys, i know posting here is like watching a train wreck, but.. let's stop hopping onto the bandwagon, shall we?

            • oh come on guys, i know posting here is like watching a train wreck, but.. let's stop hopping onto the bandwagon, shall we?

              Better not keep going down this way. I think the light at the end of the tunnel is a train.

              • TSCHOOO TSCHOOOOOOOOOOOOO *SPLAT* OOOOOOOOOOOOO...........

                ___
                filter error error fixer: Lorem ipsum dolor sit amet, consectetuer sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

        • I kept thinking about derailing a car, before I realized I was on the wrong track.

          It's easy. (With a little help from Google Images...)

          Car [westminstercollege.edu]

          Derailer [mcn.org]

        • All that aside, having tires is not absolutely critical to driving a car. One time I saw a Honda Civic going just fine down the street with two rims sans tires on the driver's side. Many people have no respect for their cars, but in spite of this being an older car, it was quite disturbing. The guy probably did something to shred his (bald?) tires, didn't have a spare or enough money/patience for a tow/new tires. As usual, no police were in sight.

          I think the Internet is better maintained than poor people's

        • by ivicente (1373953)
          Since there are some countries where cars drive on the left side of the road (England, for one) how can they get Google?
    • Re:Just me? (Score:5, Insightful)

      by Anonymous Coward on Wednesday March 25, 2009 @04:31PM (#27334757)

      I thought about it for approximately 30 seconds. Then I realized that it is a bad analogy. A Google car would have hundreds of redundant wheels, changing one is easy.

      • Re:Just me? (Score:5, Insightful)

        by Saerko (1174897) on Wednesday March 25, 2009 @04:41PM (#27334881)
        That's what I was thinking too; and probably just function like an 18-wheeler where a tire can blow out and there's so much support that the load is still distributed adequately.

        Basically, all this means is Google designs like Mack while everyone else designs like Chrysler...

      • by sik0fewl (561285)

        That took you thirty seconds?

      • Exactly. When a tire went out, its axle would lift up so it could be changed out safely. Actually, there would be several spare tire units ready to move down when another blew out, and there would be like 8 tires on the road all the time, so losing one wouldn't be an immediate problem. It is an interesting notion, a vehicle which wouldn't ever have to stop.
    • Re:Just me? (Score:5, Interesting)

      by Yetihehe (971185) on Wednesday March 25, 2009 @04:37PM (#27334829)
      Car is a bad analogy, building airplane in mid-air [youtube.com] is better.
    • I'd build a car with 6 wheels. If one of them popped you could have the wheel raise into some extra tall wheel well. Then change it and have it lower back to the ground after spinning up to speed. On a side note... Say 'Ronald rightly raised the right rear wheel well' three times fast.
      • Three fixed wheels on each side, so it could steer like a tank? I'd go one more axle to eight wheels six on the ground at a time and then deploy the fourth axle in the event of a flat, fix the tire while that axle is not spinning, and then return it to service when required. I wouldn't want to have to change a tire on a rotating axle. Or your could have just five wheels no fixed axles and deploy the fifth when one tire broke.
    • duh silly, just put the car into hover mode. Sheesh you'd think Back to the Future 2 would have taught folks a thing or two about life close to 2015!
    • by rickb928 (945187)

      It's not how well the bear dances. It's the fact that it dances at all.

    • Re:Just me? (Score:4, Informative)

      by zonky (1153039) on Wednesday March 25, 2009 @05:24PM (#27335389)
      That problem has been solved for sometime, at least in Rallying. http://www.inforally.sibiul.ro/wrc-rally-news-10661-runflat_mousse_tyres_detail.html [sibiul.ro] At least, that was until they banned it.
      • wow brilliant!

        Why was it banned? Is the mousse carcinogenic or something??

        • Re: (Score:3, Interesting)

          by zonky (1153039)
          Trying to lower costs of competing. Also, it could be argued that mousse meant that off-line mistakes were not 'punished'.
    • Guilty as charged.
    • Was it just me or did anyone else spend a few minutes contemplating how you actually could make a car that did allow you to change a flat while moving?

      Retractable axles, like they use on dump trucks. With the wheels off the pavement, changing it becomes trivial. You need at least two sets of drive wheels though for that to work. But that's a closed-system solution. An adaptive solution would be a semi-truck that would deploy a ramp out the back. Drive the car onto the ramp and then effect the fix on the semi-truck. Then return the car to the flow of traffic. It would be dangerous though if it was one of the front tires; At 60 MPH, a normal car tire that h

    • Yes, I can do it in just 4 steps:

      1. Stop car
      2. Jack car up
      3. Remove flat
      4. Install spare

      Can I get a job at Google now?

    • Re: (Score:3, Funny)

      by moxley (895517)

      What are you talking about? I'm in America, and we need to find someone to blame for the flat first....

      Then, maybe we can fix it..Got any nails and a hammer?

    • by rossdee (243626)

      I remember something from a Popular Science|Mechnics magazine from the 1950's. This guy had heavily modified a car to keep going, it could be refueled, and change tires while moving. There were small wheels at each corner that could be jacked down and lift that wheel up, and an extended running board. Of course it didn't happen at highway speeds, just a few miles per hr. I think the goal was to cross the country without stopping for some record. Tires and other stuff were stored in a trailer towed behind t

    • You need to contemplate no more, since the answer is in easily digested film format: http://en.wikipedia.org/wiki/The_Big_Bus [wikipedia.org]

    • I figured you'd need a car with redundant wheels that can be lowered and retracted while the car was moving. Probably also analogous to how google manages hardware downtime: redundancy.

      While technically possible to make such a car, I don't see any practical use for such a system when it's just safer and more efficient to stop the car and replace the wheel.

      At first I thought it might be a useful system for an armored car like the presidential limo, but the added weight of another motorized system would
    • by mgblst (80109)

      Wow, you guys really haven't seen the Big Bus?? Does all that, power by nuclear reactor, and comes with a pool and piano bar.

    • by Jack9 (11421)

      Took me about 15 seconds to think about it. Knight-Rider style, load the car (into/onto/via grapple) a specialized truck/platform that exists to fix flats in transit or simply en-route. The car analogy is only bad because it doesn't really define the constraints based on the nature of the data, imo.

    • by cdxta (1170917)
      You can do it on a motorcycle why not a car http://www.liveleak.com/view?i=b9b_1235994320 [liveleak.com]
    • by loic_2003 (707722)
      You can do it with a motorbike.... [youtube.com]
  • by fuzzyfuzzyfungus (1223518) on Wednesday March 25, 2009 @04:29PM (#27334729) Journal
    It just treats the damage as censorship and routes around it, right?
  • by Anonymous Coward on Wednesday March 25, 2009 @04:30PM (#27334741)

    To those looking for a more in-depth description, check out the technical paper on the google file system:

    http://labs.google.com/papers/gfs.html

    Had to read it for a search engines course in college, it's pretty darn spiffy.

  • Has the same issue with rolling out updates and even though Google is (I suppose 10^100) times larger than any other company it does not mean that the same principles can't be applied. I don't see why Google should have any more problem than any other large company especially as they clearly have lots of resources and expertise to bring to bear.
    • I suspect that that is exactly why they get a (slightly puffy) profile piece written about them.

      People with exotic problems and lots of resources make interesting fantasy material(hence the interest in reading about experimental fusion widgets and submarines and satellites and stuff).

      People with common problems and few resources are mostly human interest/commiseration fodder("and that is how Mr. Bitter Q. Sysadmin keeps the department running for under $500/quarter, using nothing more than terrible coff
      • by TheEcho (1459195)

        "a pile of paperclips running Gentoo Linux"
        Can I get one of those? Does it move around when you are compiling something? At least my wife would think that would be worth staring at instead of a bunch lines of text when doing my daily 'emerge --sync && emerge -uDNav world'.

    • by SeePage87 (923251)
      Probably because they handle more traffic than anyone else in the world, and their servers going down is a much more noticeable event than, e.g., GE's. And because the nature of their services probably complicates things as well.
  • Easy (Score:2, Flamebait)

    by Daimanta (1140543)

    Google treats outages like damage and routs around it.

  • by Anonymous Coward on Wednesday March 25, 2009 @04:44PM (#27334923)

    Excellent use of the car analogy, especially since it is possible to change a tire while driving a car. Youtube video [youtube.com] at 1:48.

    Slightly..ahem... OT so posting anon.

    • by relguj9 (1313593)

      Excellent use of the car analogy, especially since it is possible to change a tire while driving a car. Youtube video [youtube.com] at 1:48.

      Slightly..ahem... OT so posting anon.

      lol, all I can think of those drivers saying after they read that quote is... "ahahhaha... too easy! too easy!"

  • by girlintraining (1395911) on Wednesday March 25, 2009 @05:01PM (#27335119)

    You know, the article read like a press release. Hasn't slashdot whored itself out enough lately on these kinds of things? Google is so ultra-reliable, blah blah, 24x7, blah blah, commitment, blah blah, premier service partner, blah blah... I get that kind of talk enough in staff meetings. Where's the meat already!?

    Why not write an article with some nice graphics saying what happens to my request from the time I hit "Search" to the time I click a result. List off all the servers it goes through, their roles, how they're monitored, etc. Give examples of failure and show the mode decisions the software makes (and where this software is running) -- show the latencies and other performance impacts as my request bounces over failure after failure. That's what I expect when I pull up an article entitled "How Google Routes Around Outages". Something useful, professionally enriching, intellectually stimulating, etc. In short, tell me why I (should) never see a "500 Internal Server Error" from Google, but I do from just about every other major website I've used.

    • Re: (Score:2, Insightful)

      by Anonymous Coward
      I would wager to say you would learn all this if you were hired on as part of google's site reliability team. Probably most of that info. you're curious about is something they're not willing to talk about in great detail for competitive reasons.
    • Re: (Score:3, Interesting)

      by Red Flayer (890720)

      You know, the article read like a press release. Hasn't slashdot whored itself out enough lately on these kinds of things?

      YMBNH.

      This has been happening since as long as I've been lurking slashdot (2000?), and didn't go away once I set up an account (2002? maybe 2003). And from the YMBNH posts I saw when I began lurking, this has apparently been an issue since the beginning (or shortly thereafter).

      At any rate, complaining about it won't do much good. There's a saying maybe it might help you to repeat:

      Giv

    • by Qzukk (229616)

      I have to admit I was disappointed too. We recently had our colocation facility fall down on the job (turns out they have no alternate way to contact everyone should their internet fail) and I was hoping to get some insight into setting up hot sites, how a site should determine whether it can see the internet or not (clearly I can't just ping google anymore), and other things that would be useful from a technical perspective.

      Instead I get "white box" and "black box" monitoring, and I have yet to figure out

    • by oldhack (1037484)
      Yeah, I think we should, like, get out money back. Or something.
    • by oldr4ver (1192469)
      I could tell you, but I'd have to kill you.
  • Simple, really... (Score:5, Informative)

    by neokushan (932374) on Wednesday March 25, 2009 @05:07PM (#27335185)

    The key point:

    When they get an outage, they check how it was caught and if it wasn't caught automatically, they figure out how to next time. Simple rule: They learn from their mistakes and don't put all their eggs in one basket.

    • by SeePage87 (923251) on Wednesday March 25, 2009 @05:13PM (#27335255)
      Bollocks! I tried learning from my mistakes once, and boy did that ever turn out bad. Now I know better than to try that again.
      • Re: (Score:1, Flamebait)

        by neokushan (932374)

        Only because your mistake was NOT drugging the bitch.

      • by mgblst (80109)

        I agree, if you keep making the same mistake over and over again, you learn to deal with it better, it is less of a surprise, and you are more prepared for it. You can become a real expert at that mistake, which keeps on happening. If you stop making this mistake, you might start making new mistakes....well, who knows how to fix those?

    • But what we have here is, many baskets but with only one egg, and the problem is to make sure the egg going into the right basket "AND" getting back out from that basket while number of baskets grows!

      What it sounds to me is that Google intelligence is not redundancy but rather granular task assignment with each task in mind that level of fault tolerance in result should be greater than risk of failure in result as it grows/scales.

      For example (simplistic), raid 10 vs. raid 6 comes to mind. Both tolerant to

    • by oldr4ver (1192469)

      Simple rule: They learn from their mistakes and don't put all their eggs in one basket.

      Yes they do. Its called the Earth.

  • Changing tires (Score:3, Interesting)

    by Spazmania (174582) on Wednesday March 25, 2009 @05:20PM (#27335339) Homepage

    akin to 'changing the tires on a car while you're going at 60 down the freeway,'

    This is not so hard. Just design the car with 4 axles instead of 2 and lift one off the road at a time. Helps if it can swivel for easy access to the lugnuts.

    • akin to 'changing the tires on a car while you're going at 60 down the freeway,'

      This is not so hard. Just design the car with 4 axles instead of 2 and lift one off the road at a time. Helps if it can swivel for easy access to the lugnuts.

      I tried that once, but before I had even produced isometric drawings my tire had shredded, and my rim was trashed...

  • by Thantik (1207112) on Wednesday March 25, 2009 @05:23PM (#27335383)
    Isn't this how the *internet* is (at least in theory) supposed to work anyhow? Instead we have 90% of the cables that route the middle-east/europe running through the same canal. And I know of VERY few ISPs who actually make their systems redundant anymore. /sadface
  • Ok, granted they are not travelling 60mph, this is still pretty impressive.. I consider this on-topic, because maybe it is possible to do what the summary suggests (replace wheel in moving car). :)

    Watch from 1:55 to 2:35:
    Youtube video of guys replacing a wheel on a car while it is moving.. [youtube.com]
    • That only works on the non-driving wheels of two-wheel drive vehicles though. To offer a quick analogy, if Google were a car, then Google would only be able to replace those servers that are not responsible for helping drive the search engine forward on the Internet superhighway.

  • I'm sure they just do exactly what I do when I'm at work and have a problem: they google for an answer to the problem at hand.

    Oh, wait.

    The above sort of leads into explaining my fear of asking google "is google alive" and the ensuing apocalypse.

  • Tell me I am not the only one who read that and wondered if Google also employ I P Nightly etc
  • has never really seemed appropriate to me. in an infrastructure you have quorum managers, active fencing, and weighted peers...what does road construction have to do with any of it?

Real programs don't eat cache.

Working...