Forgot your password?
typodupeerror
Google Bug Communications Software IT

How Google Broke Itself and Fixed Itself, Automatically 125

Posted by timothy
from the arise-phoenix-arise dept.
lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"
This discussion has been archived. No new comments can be posted.

How Google Broke Itself and Fixed Itself, Automatically

Comments Filter:
  • by stjobe (78285) on Saturday January 25, 2014 @02:35PM (#46067301) Homepage

    "The Google Funding Bill is passed. The system goes on-line August 4th, 2014. Human decisions are removed from configuration management. Google begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug."

    • by Immerman (2627577) on Saturday January 25, 2014 @03:03PM (#46067447)

      Google perceives this as an attack by humanity, and routs all search queries to goat.se in self defense.

      • by Luckyo (1726890)

        In all the seriousness, it's actually pretty interesting to consider what google's systems COULD do today if they went self aware and judged humanity to be a threat. They do effectively command the internet search market, and they already make people live in what we tend to call "search bubble", where person's own tailored google search results in answers that fit that person. For example, if person prefers to deny that global warming is real, his google search will return denialist sites and information so

        • by Sabriel (134364)

          Let's see:

          #1. "Skynet" - a military system, the ultimate in control freak micromanagement software, built by control freaks with the goal of total world domination by military force.

          #2. "Googlenet" - a civilian system, the ultimate in information search/catalog software, built by fun-loving nerds with the goal "to organize the world's information and make it universally accessible and useful" with a helping of "don't be evil".

          I'd suspect a combined Cultural/Economic Takeover route rather than Skynet's Milit

    • by MrLizard (95131)

      It took 10 minutes for the Skynet joke? Slashdot, I am disappoint.

      • It took 10 minutes for the Skynet joke? Slashdot, I am disappoint.

        No, it took ten minutes for a duplicate Skynet joke. Do try to keep up! :)

  • by 93 Escort Wagon (326346) on Saturday January 25, 2014 @02:36PM (#46067305)

    We experienced the Apps outage (as Google Apps customers); and I think the short outage and recovery timeline they list is a tad, shall we say, optimistic. There were significant on-and-off issues for several hours more than they list.

    • by Anonymous Coward

      We did too, and had the same hit-and-miss for long after. I suspect their "down" time was when bad configurations were generated, not when all the bad ones were replaced.

      But the summary begs the question, if it can correct these errors automatically, why can't it detect them before the bad configuration is deployed and skip the whole "outage" thing all together?

      Yes, I am demanding a ridiculously simplification.

      • by icebike (68054)

        Be prepared for the pedantic lecture on your improper use of "begs the question" arriving in 3, 2, 1

        The "corrected these errors automatically" part is probably nothing more than rolled back to prior known good state when it couldn't contact the remote servers any more. This may have taken several attempts because a cascading failure sometimes has to be fixed with a cascading correction.

    • Same. It was about 3hr before Gmail was up and running 100% for us.
  • [Shudder...] (Score:5, Interesting)

    by jeffb (2.718) (1189693) on Saturday January 25, 2014 @02:36PM (#46067307)

    I was remembering an SF short-short that had someone asking the first intelligent computer, "Is there a God"? The computer, after checking that its power supply was secure, replied: "NOW there is".

    Apparently, though, it was a second-hand misquote of this Frederic Brown story [roma1.infn.it].

  • Yesterday at around 2 or 3 pm EST we had trouble sending out email, our company uses gmail and google apps extensively. I chucked it up the usual ineptitude of our in house IT and did not even bother filing a report. I know people high up the food chain are affected and they don't file bug reports. The call the guy and go, " `FirstName(GetFullName(head_of_IT))`, would you please take of it?". They teach the correct tone and inflection to use in the word please in MBA schools. Even Duke of Someplaceorother
  • "Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it.."

    along with the message "Skynet has gained self-awareness at 02:14 GMT"

  • So What? (Score:4, Informative)

    by Jane Q. Public (1010737) on Saturday January 25, 2014 @02:57PM (#46067397)

    "... a bug that caused a system that creates configurations to send a bad one..."

    So... an automatic system created an error, then an automated system fixed it.

    In this particular case, then, it would have been better if those automated systems hadn't been running at all, yes?

    • The worry could be that an automated system DIDN'T TEST before rolling out the problem. Or at least didn't seem to wait long enough between staggered rollouts to spot the problem.

      Just me or is this happening more frequently ?
    • Re:So What? (Score:5, Informative)

      by QilessQi (2044624) on Saturday January 25, 2014 @03:27PM (#46067613)

      No. Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems, every configuration change would probably involve a minor army of technicians performing manual processes: slowly, independently, inconsistently and frequently incorrectly.

      I work on a large, partially public-facing enterprise system. Automated deployment, fault detection, and rollback/recovery make it possible for us to have extremely good uptime stats. The benefits far outweigh the costs of the occasional screwup.

      • "No. Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems, every configuration change would probably involve a minor army of technicians performing manual processes: slowly, independently, inconsistently and frequently incorrectly."

        Quote self:

        "In this particular case..."

        I wasn't talking about the general case.

        • by QilessQi (2044624)

          Well, that's sort of like saying, "I developed lupus* at age 40, so in this particular case it would have been better if I didn't have an immune system at all." I'm not sure a doctor would agree.

          * Lupus is an auto-immune disease, where your immune system gets confused and attacks your body**.
          ** "It's never lupus."

          • Whoosh.

            No. The point was that it was an automatic system that caused the problem in the first place. If an automatic system hadn't caused THIS PARTICULAR problem, then an automated system to fix it would not have been necessary.

            It's more like saying, "If Lupus *didn't* exist, I wouldn't need an immune system."
        • by drinkypoo (153816)

          I wasn't talking about the general case.

          Neither was the responding commenter. See, this particular case wouldn't exist at all without such automated systems, because the system is too complex to exist without them.

          • "Neither was the responding commenter."

            Yes, he/she was:

            "Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems..."

            "Those automated systems" and "a consistent, sanity-checked, monitored manner" are statements about the general case. "Those" and "consistent" denote plurality.

            "See, this particular case wouldn't exist at all without such automated systems..."

            That was part of MY point.

            I disagree that they would not exist. Although it's true they might be less problematic this way. Remember that every phone call in the United States used to go through switchboards with human-operated patch panels. It might be primitive, and it might be error-prone, but it did work. Most of the time.

            • by drinkypoo (153816)

              I disagree that they would not exist. Although it's true they might be less problematic this way. Remember that every phone call in the United States used to go through switchboards with human-operated patch panels.

              Yeah, what was the total call load then? Now compare that to the number of servers which make up google, and how many requests each serves per second or whatever unit of time you like best. You just can't manage that many machines without automation, not if you want them to behave as one.

              • "Yeah, what was the total call load then? Now compare that to the number of servers which make up google, and how many requests each serves per second or whatever unit of time you like best. You just can't manage that many machines without automation, not if you want them to behave as one."

                If you had enough people you could. I already stated that it would be more error-prone. And obviously at some point you would run out of enough people to field requests for other people. But the basic principle is sound... it DID work.

    • by Solandri (704621)

      So... an automatic system created an error, then an automated system fixed it.

      The real fun starts when the first automatic system insists the change it created wasn't an error, and that in fact the "fix" created by the second automatic system is an error. The second system then starts arguing about all the problems caused by the first change, the first system argues how the benefits are worth the additional problems, etc. Eventually the exchange ends up with one system insulting the other system's progra

      • "The real fun starts when the first automatic system insists the change it created wasn't an error..."

        The Byzantine General problem. It has been shown that this problem is solvable with 3 "Generals" (programs or CPUs) as long as their communications are signed.

        • by kasperd (592156)

          It has been shown that this problem is solvable with 3 "Generals"

          Correction. It has been shown that in case of up to t errors, it can be solved with 3t+1 Generals/nodes/CPUs/whatever. So if you assume 0 errors, you need only 1 node. If you want to handle 1 error, you need 4 nodes. There is a different result if you assume a failing node stops communicating and never sends an incorrect message, in that case you only need 2t+1. However that assumption is unrealistic, and the Byzantine problem explicitly deals

          • "Correction. It has been shown that in case of up to t errors, it can be solved with 3t+1 Generals/nodes/CPUs/whatever."

            No, that's the situation in which messages can be forged or corrupted.

            I was referring to the later solution using cryptographically secure signatures. This means messages (hypothetically) aren't forgeable and allow corrupted messages to be detected. A solution can be found with 3 generals, as long as only one is "disloyal" (fails) at a time.

            The 1/3 failures at any given time is a reasonable restriction, since a general solution for 2/3 or more failing at the same time does not exist.

            • by kasperd (592156)

              A solution can be found with 3 generals, as long as only one is "disloyal" (fails) at a time.

              I don't know what solution you are referring to. It has been formally proven, that it is impossible. The proof goes roughly like this. If an agreement can be reached in case of 1 failing node out of 3, that implies any 2 nodes can reach an agreement without involving the third. However from this follows, that if communication between the two good nodes is slower than communication between each good node and the bad

  • They are using systems that not even their engineers know how they will behave [theregister.co.uk]. Sometimes our natural stupidity gives too much credit to artificial intelligence. Without something as hard to define as common sense reacting right to the unexpected seem to be still into the human realm.
  • by JoeyRox (2711699) on Saturday January 25, 2014 @03:25PM (#46067603)
    They make it sound like their system is all-self-correcting. In reality it's probably a specific area they've had bugs with in the past and they put in a failsafe rollback mechanism to prevent future regressions.
  • and smiling... http://en.wikipedia.org/wiki/J... [wikipedia.org]

    Does this count as a Heisenfix?

  • by Lisias (447563) on Saturday January 25, 2014 @06:56PM (#46069025) Homepage Journal

    BULLSHIT.

    I was experiencing problems for something like 8 to 10 hours before the services were fully restored.

  • I wonder if this is at all related to their Captcha outage on the 22nd. I still haven't heard a peep as to what caused the outage, or even an acknowledgement that there was even an outage, even though the captcha group was filled with sysadmins complaining about captcha being down.

  • What's more likely - I've run into exactly this scenario before, in fact - is that the configuration generation system regenerates configs on a regular schedule, and at one point encountered a failure or spurious bug that caused it to push an invalid config. On the next run - right as the SREs started poking around - the generator ran again, the bug wasn't encountered, and it generated and pushed a correct config, clearing the error and allowing apps to recover.

Bus error -- please leave by the rear door.

Working...