How Google Broke Itself and Fixed Itself, Automatically

How Google Broke Itself and Fixed Itself, Automatically 125

Posted by timothy on Saturday January 25, 2014 @02:25PM from the arise-phoenix-arise dept.

lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"

How Google Broke Itself and Fixed Itself, Automatically

This discussion has been archived. No new comments can be posted.

Search 125 Comments Log In/Create an Account

Comments Filter:

Having had to deal with this... (Score:5, Informative)

by 93 Escort Wagon ( 326346 ) writes: on Saturday January 25, 2014 @02:36PM (#46067305)

We experienced the Apps outage (as Google Apps customers); and I think the short outage and recovery timeline they list is a tad, shall we say, optimistic. There were significant on-and-off issues for several hours more than they list.

So What? (Score:4, Informative)

by Jane Q. Public ( 1010737 ) writes: on Saturday January 25, 2014 @02:57PM (#46067397)

"... a bug that caused a system that creates configurations to send a bad one..."
So... an automatic system created an error, then an automated system fixed it.

In this particular case, then, it would have been better if those automated systems hadn't been running at all, yes?

Re:So What? (Score:5, Informative)

by QilessQi ( 2044624 ) writes: on Saturday January 25, 2014 @03:27PM (#46067613)

No. Those automated systems enable a small number of human beings to administer a large number of servers in a consistent, sanity-checked, and monitored manner. If Google didn't have those automated systems, every configuration change would probably involve a minor army of technicians performing manual processes: slowly, independently, inconsistently and frequently incorrectly.
I work on a large, partially public-facing enterprise system. Automated deployment, fault detection, and rollback/recovery make it possible for us to have extremely good uptime stats. The benefits far outweigh the costs of the occasional screwup.

Re:Well congratulations (Score:5, Informative)

by Anonymous Coward writes: on Saturday January 25, 2014 @03:40PM (#46067693)

That "hell of a lot of infrastructure" just takes CFEngine/Puppet, a version control system (git, svn, whatever), Nagios, and a fairly simple shell script.
Haha. Hahaha. HAHAHAHAHAHA. Oh God, please tell me you don't actually believe that?

You need reliable monitoring.
Reliable monitoring is fucking difficult.
Show me a Nagios installation and I'll likely show you one with hundreds of spurious alerts, masses of long-lived Criticals and lots of "Oh we don't know why it keeps doing that, it just does, don't worry about it."

You also need full coverage (Damn near 100%) configuration management.
Full coverage configuration management is fucking difficult.
Show me a configuration management deployment and I'll show the snowflakes and edge cases and old applications and "Oh yeah well we only have like three of those so it's not worth the effort".

I've come close to that level of coverage (both configuration management and monitoring) but it was only ~400 machines (a mix of physical and virtual instances). Doing it at 60k servers is an inordinate task, and I'd suggest you've never actually tried anything like it if you honestly think that all it takes is "a fairly simple shell script".

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

How Google Broke Itself and Fixed Itself, Automatically 125

How Google Broke Itself and Fixed Itself, Automatically More Login

How Google Broke Itself and Fixed Itself, Automatically

Having had to deal with this... (Score:5, Informative)

So What? (Score:4, Informative)

Re:So What? (Score:5, Informative)

Re:Well congratulations (Score:5, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot