


Google Cloud Caused Outage By Ignoring Its Usual Code Quality Protections (theregister.com) 26
Google Cloud has attributed last week's widespread outage to a flawed code update in its Service Control system that triggered a global crash loop due to missing error handling and lack of feature flag protection. The Register reports: Google's explanation of the incident opens by informing readers that its APIs, and Google Cloud's, are served through our Google API management and control planes." Those two planes are distributed regionally and "are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints." The core binary that is part of this policy check system is known as "Service Control."
On May 29, Google added a new feature to Service Control, to enable "additional quota policy checks." "This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code," Google's incident report explains. The search monopolist appears to have had concerns about this change as it "came with a red-button to turn off that particular policy serving path." But the change "did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash."
Google uses feature flags to catch issues in its code. "If this had been flag protected, the issue would have been caught in staging." That unprotected code ran inside Google until June 12th, when the company changed a policy that contained "unintended blank fields." Here's what happened next: "Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment."
Google's post states that its Site Reliability Engineering team saw and started triaging the incident within two minutes, identified the root cause within 10 minutes, and was able to commence recovery within 40 minutes. But in some larger Google Cloud regions, "as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on ... overloading the infrastructure." Service Control wasn't built to handle this, which is why it took almost three hours to resolve the issue in its larger regions. The teams running Google products that went down due to this mess then had to perform their own recovery chores. Going forward, Google has promised a couple of operational changes to prevent this mistake from happening again: "We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity."
On May 29, Google added a new feature to Service Control, to enable "additional quota policy checks." "This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code," Google's incident report explains. The search monopolist appears to have had concerns about this change as it "came with a red-button to turn off that particular policy serving path." But the change "did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash."
Google uses feature flags to catch issues in its code. "If this had been flag protected, the issue would have been caught in staging." That unprotected code ran inside Google until June 12th, when the company changed a policy that contained "unintended blank fields." Here's what happened next: "Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment."
Google's post states that its Site Reliability Engineering team saw and started triaging the incident within two minutes, identified the root cause within 10 minutes, and was able to commence recovery within 40 minutes. But in some larger Google Cloud regions, "as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on ... overloading the infrastructure." Service Control wasn't built to handle this, which is why it took almost three hours to resolve the issue in its larger regions. The teams running Google products that went down due to this mess then had to perform their own recovery chores. Going forward, Google has promised a couple of operational changes to prevent this mistake from happening again: "We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity."
Blame Gemini (Score:2)
Just waiting for some vibe coder at Google to say the Gemini AI wrote it.
Re: (Score:1)
Re: (Score:2)
Yes, as we know humans never make nullptr deref bugs.
Re: (Score:3)
Yes, as we know humans never make nullptr deref bugs.
Which would explain why the AI would generate such code. It learned from the best.
Re: (Score:2)
Null pointer error (Score:2)
Google's explanation:
https://status.cloud.google.co... [google.com]
Without the appropriate error handling, the null pointer caused the binary to crash
was this (Score:3, Insightful)
The article doesn't mention if this was AI generated code or not
Because it was and they won't admit this in public. The narrative of "AI will replace us all in 3 months" shouldn't be challenged.
Re: (Score:2)
yup, and there's at least a 25% chance it was written by an AI.
https://fortune.com/2024/10/30... [fortune.com]
Re: (Score:2)
Probably. And the human in charge screwed up on top of that. They were lucky to catch this early. I wonder what other mistakes that will get triggered at some time in the future they have in there now. Better not depend on Google for anything.
Re: was this (Score:4, Funny)
Re: (Score:2)
It's humans all the way down, I
Re: was this (Score:4, Insightful)
Re: (Score:2)
If you need help generating boiler plate code, something is wrong with the programming language and framework: Your code should only contain logic for the task at hand. The "framework" should provide for everything else without the need of repeating boiler plate from an "AI" or other code generating tool. Accepting that you should write 10 lines of boiler plate to get 1 line of "business logic", is bad to begin with - but is in general how things are :-(
I'm afraid LLMs might make that situation worse. Ease of use, all over, has slowly been getting worse for a long time. Lots of innocent little decisions to do something cheap and quick that all add up to a big mess. It's kind of like the old argument around languages/frameworks being designed with IDEs in mind, that it makes some coding easier but also enables a design that is harder to use without one. I can already see horrible patterns in the everything as YAML movement, and that won't get better. Bigger
Re: punishment already ongoing (Score:1)
Best way to punish an AI is to let millions of non-developers talk to it endlessly with vibe-coding giving incomplete, contradictory unspecific prompts. Add personal problems, sexual questions you never dared to say out loud and i cannot imagine worse.
Re: (Score:3)
They also talk about "This policy data contained unintended blank fields" but leave it at that. Why did it contain unintended blank fields? How did they get in there? It doesn't say. Is it a problem? Are they hand-writing this po
Re: (Score:2)
Because it was and they won't admit this in public. The narrative of "AI will replace us all in 3 months" shouldn't be challenged.
Dude, no it won't, and the only rational people saying things like that are trying to scare governments into pulling the ladder up for them to block competition. The whole it's so dangerous there should be rules - after I have a foothold - routine. The rest are irrational singularity theory weirdos that definitely are not trying to hide anything, they're just really confused about reality.
You don't have to use THEIR straw man or dumbassery to hate on AI, it looks absolutely ridiculous. Hate it for the visua
"AI" always fails my first programming test. (Score:5, Insightful)
Whenever I meet a new chatbot, I ask it to "Sort a linked list in C".
If the resulting code leaks memory, I don't trust the bot to write code.
So far, none have earned my trust.
Re: (Score:2)
I shouldn't have to ask it to avoid leaking memory. If I do, it's insufficiently "intelligent" to be trusted with any responsibility.
Re: (Score:2)
Nirvana fallacy, and lack of understanding of how LLMs work and thus how to use them.
1. You should instruct it to self-review, as that puts it into an entirely different role than "generate an answer". So ask it to double-check its work, whether it can see any mistakes in it, etc. You can even use multiple AIs/models and have them 'pair program' this way.
2. Write your prompt in the most professional way possible. Research shows that prompting it as a novice gives you more novice-like answers.
Note that Claud
Re:"AI" always fails my first programming test. (Score:4, Informative)
Just gave it a shot with claude and the sorting algorithm doesn't appear to leak memory. The list is created with malloc and never freed, but the program also terminates so the memory is freed automatically. If I ask it to do something else with it like remove items, it correctly calls free on items being removed:
Basic prompt "sort a linked list in C":
https://claude.ai/public/artif... [claude.ai]
Extended prompt "Sort a linked list in C. Also include functions to remove a list item":
https://claude.ai/public/artif... [claude.ai]
Re: (Score:2)
You could ask it for something a lot simpler. Before reading this thread I happened to ask Claude to add a bit where a map of userid to record counts was outputted sorted by count in descending order. I was just kind of tired and ordinary Comparison didn't work in the dev console. Guess what Claude did? It made an array of counts, sorted THAT, then USED THAT AS A KEY to get the user ids. After I asked what on earth caused you to do such a silly thing it gave me a bubble sort. Okay fine, I don't actually dep
In other words: Crappy code by incompetents (Score:3)
So Google has joined the cloud version of the race to the bottom now. Not much of a surprise.
Slack off in mission-critical ... (Score:2)
... devops and eventually pay the price.
You'd think the l33t coders at Google would have this little basic truth deeply engrained in their genes. Humans doing human stuff I guess.
At least they did make a decent rollback happen.
Someone gonna get fired (Score:2)
" the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code"
"the company changed a policy that contained "unintended blank fields."
Its very likely that Google has a test suite for code like this, and mock data that is used as input which should cover all code paths. My guess is that someone added code and didn't create a unit test for it, and/or the mock data didn't have empty fields. There's going to be a penalty for that.