


Google Cloud Caused Outage By Ignoring Its Usual Code Quality Protections (theregister.com) 18
Google Cloud has attributed last week's widespread outage to a flawed code update in its Service Control system that triggered a global crash loop due to missing error handling and lack of feature flag protection. The Register reports: Google's explanation of the incident opens by informing readers that its APIs, and Google Cloud's, are served through our Google API management and control planes." Those two planes are distributed regionally and "are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints." The core binary that is part of this policy check system is known as "Service Control."
On May 29, Google added a new feature to Service Control, to enable "additional quota policy checks." "This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code," Google's incident report explains. The search monopolist appears to have had concerns about this change as it "came with a red-button to turn off that particular policy serving path." But the change "did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash."
Google uses feature flags to catch issues in its code. "If this had been flag protected, the issue would have been caught in staging." That unprotected code ran inside Google until June 12th, when the company changed a policy that contained "unintended blank fields." Here's what happened next: "Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment."
Google's post states that its Site Reliability Engineering team saw and started triaging the incident within two minutes, identified the root cause within 10 minutes, and was able to commence recovery within 40 minutes. But in some larger Google Cloud regions, "as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on ... overloading the infrastructure." Service Control wasn't built to handle this, which is why it took almost three hours to resolve the issue in its larger regions. The teams running Google products that went down due to this mess then had to perform their own recovery chores. Going forward, Google has promised a couple of operational changes to prevent this mistake from happening again: "We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity."
On May 29, Google added a new feature to Service Control, to enable "additional quota policy checks." "This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code," Google's incident report explains. The search monopolist appears to have had concerns about this change as it "came with a red-button to turn off that particular policy serving path." But the change "did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash."
Google uses feature flags to catch issues in its code. "If this had been flag protected, the issue would have been caught in staging." That unprotected code ran inside Google until June 12th, when the company changed a policy that contained "unintended blank fields." Here's what happened next: "Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment."
Google's post states that its Site Reliability Engineering team saw and started triaging the incident within two minutes, identified the root cause within 10 minutes, and was able to commence recovery within 40 minutes. But in some larger Google Cloud regions, "as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on ... overloading the infrastructure." Service Control wasn't built to handle this, which is why it took almost three hours to resolve the issue in its larger regions. The teams running Google products that went down due to this mess then had to perform their own recovery chores. Going forward, Google has promised a couple of operational changes to prevent this mistake from happening again: "We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity."
Blame Gemini (Score:2)
Just waiting for some vibe coder at Google to say the Gemini AI wrote it.
Re: (Score:1)
Re: (Score:1)
Yes, as we know humans never make nullptr deref bugs.
Re: (Score:2)
Yes, as we know humans never make nullptr deref bugs.
Which would explain why the AI would generate such code. It learned from the best.
Null pointer error (Score:2)
Google's explanation:
https://status.cloud.google.co... [google.com]
Without the appropriate error handling, the null pointer caused the binary to crash
was this (Score:3, Insightful)
The article doesn't mention if this was AI generated code or not
Because it was and they won't admit this in public. The narrative of "AI will replace us all in 3 months" shouldn't be challenged.
Re: (Score:2)
yup, and there's at least a 25% chance it was written by an AI.
https://fortune.com/2024/10/30... [fortune.com]
Re: (Score:2)
Probably. And the human in charge screwed up on top of that. They were lucky to catch this early. I wonder what other mistakes that will get triggered at some time in the future they have in there now. Better not depend on Google for anything.
Re: was this (Score:3)
Re: (Score:2)
It's humans all the way down, I
Re: was this (Score:2)
Re: punishment already ongoing (Score:1)
Best way to punish an AI is to let millions of non-developers talk to it endlessly with vibe-coding giving incomplete, contradictory unspecific prompts. Add personal problems, sexual questions you never dared to say out loud and i cannot imagine worse.
"AI" always fails my first programming test. (Score:2)
Whenever I meet a new chatbot, I ask it to "Sort a linked list in C".
If the resulting code leaks memory, I don't trust the bot to write code.
So far, none have earned my trust.
Re: (Score:2)
I shouldn't have to ask it to avoid leaking memory. If I do, it's insufficiently "intelligent" to be trusted with any responsibility.
Re: (Score:3)
Just gave it a shot with claude and the sorting algorithm doesn't appear to leak memory. The list is created with malloc and never freed, but the program also terminates so the memory is freed automatically. If I ask it to do something else with it like remove items, it correctly calls free on items being removed:
Basic prompt "sort a linked list in C":
https://claude.ai/public/artif... [claude.ai]
Extended prompt "Sort a linked list in C. Also include functions to remove a list item":
https://claude.ai/public/artif... [claude.ai]
In other words: Crappy code by incompetents (Score:2)
So Google has joined the cloud version of the race to the bottom now. Not much of a surprise.