How Google Broke Itself and Fixed Itself, Automatically

lemur3 writes "On January 24th Google had some problems with a few of its services. Gmail users and people who used various other Google services were impacted just as the Google Reliability Team was to take part in an Ask Me Anything on Reddit. Everything seemed to be resolved and back up within an hour. The Official Google Blog had a short note about what happened from Ben Treynor, a VP of Engineering. According to the blog post it appears that the outage was caused by a bug that caused a system that creates configurations to send a bad one to various 'live services.' An internal monitoring system noticed the problem a short time later and caused a new configuration to be spread around the services. Ben had this to say of it on the Google Blog, 'Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored.'"
  • by Anonymous Coward on Saturday January 25, 2014 @02:41PM (#46067327)
    The clever part is that it automatically recovered; that means that their monitoring, performance metrics and configuration management systems are very tightly integrated. Most importantly, it means they are trusted; having worked at three different places now on things like configuration management and monitoring, and I've never once seen anywhere that approached that level of reliability. It's something to aim for.
  • by Anonymous Coward on Saturday January 25, 2014 @02:59PM (#46067415)

    If you haven't met a system that takes less than of the order of tens of minutes to recover from a configuration error, you have worked in some shitty places.

    Once again: automatically recover. Any human can notice a problem and revert a config; it takes a hell of a lot of infrastructure and clever infrastructure to have the system do it itself. I'm not surprised Google have solved it; it is, at its core, a data problem.

  • by JoeMerchant (803320) on Saturday January 25, 2014 @04:50PM (#46068153)

    What's really clever here is that they trust the automatons to make the corrections without human intervention, and the automatons haven't caused a horrible feedback loop meltdown of the system.

    It's not quite rocket science, but those kinds of self-correcting systems have just as much potential to screw themselves up as they do to fix themselves.

