Google Cloud Caused Outage By Ignoring Its Usual Code Quality Protections (theregister.com) 42

Posted by BeauHD on Monday June 16, 2025 @08:50PM from the behind-the-scenes dept.

Google Cloud has attributed last week's widespread outage to a flawed code update in its Service Control system that triggered a global crash loop due to missing error handling and lack of feature flag protection. The Register reports: Google's explanation of the incident opens by informing readers that its APIs, and Google Cloud's, are served through our Google API management and control planes." Those two planes are distributed regionally and "are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints." The core binary that is part of this policy check system is known as "Service Control."

On May 29, Google added a new feature to Service Control, to enable "additional quota policy checks." "This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code," Google's incident report explains. The search monopolist appears to have had concerns about this change as it "came with a red-button to turn off that particular policy serving path." But the change "did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash."

Google uses feature flags to catch issues in its code. "If this had been flag protected, the issue would have been caught in staging." That unprotected code ran inside Google until June 12th, when the company changed a policy that contained "unintended blank fields." Here's what happened next: "Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment."

Google's post states that its Site Reliability Engineering team saw and started triaging the incident within two minutes, identified the root cause within 10 minutes, and was able to commence recovery within 40 minutes. But in some larger Google Cloud regions, "as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on ... overloading the infrastructure." Service Control wasn't built to handle this, which is why it took almost three hours to resolve the issue in its larger regions. The teams running Google products that went down due to this mess then had to perform their own recovery chores. Going forward, Google has promised a couple of operational changes to prevent this mistake from happening again: "We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers. We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity."

Google Cloud Caused Outage By Ignoring Its Usual Code Quality Protections

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 42 Comments Log In/Create an Account

Comments Filter:

Blame Gemini (Score:2)

by jonsmirl ( 114798 ) writes:

Just waiting for some vibe coder at Google to say the Gemini AI wrote it.
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  Wasn't I reading recently that over 30% of code written at Google is now AI generated?
- Re: (Score:2)
  
  by SirSlud ( 67381 ) writes:
  
  Yes, as we know humans never make nullptr deref bugs.
  - Re: (Score:3)
    
    by mrfaithful ( 1212510 ) writes:
    
    Yes, as we know humans never make nullptr deref bugs.
    Which would explain why the AI would generate such code. It learned from the best.
    - Re: (Score:2)
      
      by chas.williams ( 6256556 ) writes:
      
      GIGO
- Null pointer error (Score:3)
  
  by will4 ( 7250692 ) writes:
  
  Google's explanation:
  https://status.cloud.google.co... [google.com]
  Without the appropriate error handling, the null pointer caused the binary to crash
- Re: (Score:2)
  
  by kwelch007 ( 197081 ) writes:
  
  Yes, in Swift, because it is faster.
was this (Score:3, Insightful)

by hjf ( 703092 ) writes: on Monday June 16, 2025 @09:46PM (#65454695) Homepage

The article doesn't mention if this was AI generated code or not
Because it was and they won't admit this in public. The narrative of "AI will replace us all in 3 months" shouldn't be challenged.

- Re: (Score:2)
  
  by Steveftoth ( 78419 ) writes:
  
  yup, and there's at least a 25% chance it was written by an AI.
  https://fortune.com/2024/10/30... [fortune.com]
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  Probably. And the human in charge screwed up on top of that. They were lucky to catch this early. I wonder what other mistakes that will get triggered at some time in the future they have in there now. Better not depend on Google for anything.
  - Re: was this (Score:4, Funny)
    
    by Fons_de_spons ( 1311177 ) writes: on Tuesday June 17, 2025 @01:09AM (#65454927)
    
    That's it with AI isn't it. Even if it screws up big time, it is the human's fault. It should take responsibility! ... How do you punish an AI? Put it a week on back propagation? ;-)
    
    - Re: (Score:2)
      
      by martin-boundary ( 547041 ) writes:
      
      As it should be. A coding AI is a mindless code completion software. It exists to help a human developer write boiler plate code for languages that need it. If the human didn't follow coding standards and the correct deployment process, then it's clearly the human's fault. If the development process was modified to mandate AI usage, in such a way that the invariants of the prior standard are unwittingly crippled, then the human manager who changed the process is responsible.
      It's humans all the way down, I
      - Re: was this (Score:4, Insightful)
        
        by simlox ( 6576120 ) writes: on Tuesday June 17, 2025 @02:23AM (#65455021)
        
        If you need help generating boiler plate code, something is wrong with the programming language and framework: Your code should only contain logic for the task at hand. The "framework" should provide for everything else without the need of repeating boiler plate from an "AI" or other code generating tool. Accepting that you should write 10 lines of boiler plate to get 1 line of "business logic", is bad to begin with - but is in general how things are :-(
        
        
        Re: (Score:3)
        
        by ToasterMonkey ( 467067 ) writes:
        
        If you need help generating boiler plate code, something is wrong with the programming language and framework: Your code should only contain logic for the task at hand. The "framework" should provide for everything else without the need of repeating boiler plate from an "AI" or other code generating tool. Accepting that you should write 10 lines of boiler plate to get 1 line of "business logic", is bad to begin with - but is in general how things are :-(
        I'm afraid LLMs might make that situation worse. Ease of use, all over, has slowly been getting worse for a long time. Lots of innocent little decisions to do something cheap and quick that all add up to a big mess. It's kind of like the old argument around languages/frameworks being designed with IDEs in mind, that it makes some coding easier but also enables a design that is harder to use without one. I can already see horrible patterns in the everything as YAML movement, and that won't get better. Bigger
    - Re: punishment already ongoing (Score:1)
      
      by jkechel ( 1101181 ) writes:
      
      Best way to punish an AI is to let millions of non-developers talk to it endlessly with vibe-coding giving incomplete, contradictory unspecific prompts. Add personal problems, sexual questions you never dared to say out loud and i cannot imagine worse.
- Re: (Score:3)
  
  by phantomfive ( 622387 ) writes:
  
  The article is written in a way to CYA of the people involved. It spreads the blame around without being too specific, so no one knows what was going on. For example, it talks about "the null pointer" but that's the only mention of it. At best that doesn't make sense gramatically.
  
  They also talk about "This policy data contained unintended blank fields" but leave it at that. Why did it contain unintended blank fields? How did they get in there? It doesn't say. Is it a problem? Are they hand-writing this po
  - Re: was this (Score:3)
    
    by LindleyF ( 9395567 ) writes:
    
    Google uses a blameless postmortem process to identify areas for improvement. It isn't about making an example, it's about making sure it can't happen again. This is pretty good for an external report. The problem here is that the code path with the bug was cut it abruptly rather than gradually; that isn't supposed to happen. A secondary problem was hitting a backend too hard when all the jobs tried to come up at once; probably insufficiently random backoff.
    - Re: (Score:2)
      
      by phantomfive ( 622387 ) writes:
      
      This is pretty good for an external report.
      
      No it's not lol.
      The problem here is that the code path with the bug was cut it abruptly rather than gradually;
      Your sentence doesn't make sense, it's not grammatical. Furthermore the problem was not abrupt: the code was merged in May, the outage was in June.
      
      The report is bad.
      - Re: was this (Score:2)
        
        by LindleyF ( 9395567 ) writes:
        
        "Cut in." Not my fault slashdot has no edit function.....anyway, yes, that's exactly the problem. The CODE was rolled out slowly; but that didn't matter because that path was not in use at the time. It began getting traffic hitting that bug all at once. That should not have happened. Usually you gradually divert traffic onto a new code path.
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        yes, that's exactly the problem. The CODE was rolled out slowly; but that didn't matter because that path was not in use at the time. It began getting traffic hitting that bug all at once. That should not have happened
        You don't understand the problem. You can't understand it from the report (I can't understand it from the report either), because the report doesn't explain it. Your mind tried to fill in the blanks, but it shouldn't have.
        "Cut in." Not my fault slashdot has no edit function
        It's your fault you didn't proofread in the preview stage. Blaming others is the root of all your problems.
        
        Re: was this (Score:2)
        
        by LindleyF ( 9395567 ) writes:
        
        I've never seen a preview stage here. Weird. Anyway, I understand enough to know that there was a weird combination of circumstances (including insufficient unit testing) that caused this to hit everything at once. Normally, you'd see these crashes during a 1% canary and stop it before it becomes user visible, either during a canary of the binary or of the configuration change that diverted traffic. This somehow missed both.
- Re: (Score:2)
  
  by ToasterMonkey ( 467067 ) writes:
  
  Because it was and they won't admit this in public. The narrative of "AI will replace us all in 3 months" shouldn't be challenged.
  Dude, no it won't, and the only rational people saying things like that are trying to scare governments into pulling the ladder up for them to block competition. The whole it's so dangerous there should be rules - after I have a foothold - routine. The rest are irrational singularity theory weirdos that definitely are not trying to hide anything, they're just really confused about reality.
  You don't have to use THEIR straw man or dumbassery to hate on AI, it looks absolutely ridiculous. Hate it for the visua
"AI" always fails my first programming test. (Score:5, Insightful)

by preflex ( 1840068 ) writes: on Monday June 16, 2025 @10:52PM (#65454771)

Whenever I meet a new chatbot, I ask it to "Sort a linked list in C".
If the resulting code leaks memory, I don't trust the bot to write code.
So far, none have earned my trust.

- Re: (Score:1)
  
  by preflex ( 1840068 ) writes:
  
  I shouldn't have to ask it to avoid leaking memory. If I do, it's insufficiently "intelligent" to be trusted with any responsibility.
  - Re: (Score:3)
    
    by dinfinity ( 2300094 ) writes:
    
    Nirvana fallacy, and lack of understanding of how LLMs work and thus how to use them.
    1. You should instruct it to self-review, as that puts it into an entirely different role than "generate an answer". So ask it to double-check its work, whether it can see any mistakes in it, etc. You can even use multiple AIs/models and have them 'pair program' this way.
    2. Write your prompt in the most professional way possible. Research shows that prompting it as a novice gives you more novice-like answers.
    Note that Claud
- Re:"AI" always fails my first programming test. (Score:5, Informative)
  
  by Ksevio ( 865461 ) writes: on Monday June 16, 2025 @11:50PM (#65454847) Homepage
  
  Just gave it a shot with claude and the sorting algorithm doesn't appear to leak memory. The list is created with malloc and never freed, but the program also terminates so the memory is freed automatically. If I ask it to do something else with it like remove items, it correctly calls free on items being removed:
  Basic prompt "sort a linked list in C":
  https://claude.ai/public/artif... [claude.ai]
  Extended prompt "Sort a linked list in C. Also include functions to remove a list item":
  https://claude.ai/public/artif... [claude.ai]
  
- Re: (Score:2)
  
  by mattr ( 78516 ) writes:
  
  You could ask it for something a lot simpler. Before reading this thread I happened to ask Claude to add a bit where a map of userid to record counts was outputted sorted by count in descending order. I was just kind of tired and ordinary Comparison didn't work in the dev console. Guess what Claude did? It made an array of counts, sorted THAT, then USED THAT AS A KEY to get the user ids. After I asked what on earth caused you to do such a silly thing it gave me a bubble sort. Okay fine, I don't actually dep
In other words: Crappy code by incompetents (Score:3)

by gweihir ( 88907 ) writes: on Monday June 16, 2025 @11:50PM (#65454845)

So Google has joined the cloud version of the race to the bottom now. Not much of a surprise.

Slack off in mission-critical ... (Score:2)

by Qbertino ( 265505 ) writes:

... devops and eventually pay the price.
You'd think the l33t coders at Google would have this little basic truth deeply engrained in their genes. Humans doing human stuff I guess.
At least they did make a decent rollback happen.
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  Kind of hilarious that they passive-aggressively blame "business requirements" for the problem:
  " Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally"
Someone gonna get fired (Score:2)

by ZipNada ( 10152669 ) writes:

" the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code"
"the company changed a policy that contained "unintended blank fields."
Its very likely that Google has a test suite for code like this, and mock data that is used as input which should cover all code paths. My guess is that someone added code and didn't create a unit test for it, and/or the mock data didn't have empty fields. There's going to be a penalty for that.
- Re: Someone gonna get fired (Score:3)
  
  by LindleyF ( 9395567 ) writes:
  
  Fired, probably not. But can you think of a better teaching opportunity?
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  LLMs don't provide 100% code coverage when writing tests (and probably can't)
  - Re: (Score:2)
    
    by ZipNada ( 10152669 ) writes:
    
    I don't depend on AI do always do the right thing. I give it small tasks and look carefully at the results, often will reject it. But AI sure will do a lot of tedious grunt work for you if you learn how to use it.
    - Re: (Score:2)
      
      by phantomfive ( 622387 ) writes:
      
      Why do you have so much tedious grunt work? Haven't you learned how to use a function?
      - Re: (Score:2)
        
        by ZipNada ( 10152669 ) writes:
        
        Yawn.
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        That explains why. You have a lot of boilerplate code because you don't think about how to reduce it.
        
        Please eliminate complexity.
        
        Re: (Score:2)
        
        by ZipNada ( 10152669 ) writes:
        
        Still begging for my attention?
Move fast & break things (Score:2)

by haruchai ( 17472 ) writes:

right?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Blame Gemini (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Null pointer error (Score:3)

Re: (Score:2)

was this (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: was this (Score:4, Funny)

Re: (Score:2)

Re: was this (Score:4, Insightful)

Re: (Score:3)

Re: punishment already ongoing (Score:1)

Re: (Score:3)

Re: was this (Score:3)

Re: (Score:2)

Re: was this (Score:2)

Re: (Score:2)

Re: was this (Score:2)

Re: (Score:2)

"AI" always fails my first programming test. (Score:5, Insightful)

Re: (Score:1)

Re: (Score:3)

Re:"AI" always fails my first programming test. (Score:5, Informative)

Re: (Score:2)

In other words: Crappy code by incompetents (Score:3)

Slack off in mission-critical ... (Score:2)

Re: (Score:2)

Someone gonna get fired (Score:2)

Re: Someone gonna get fired (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Move fast & break things (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals