Cloudflare Explains Its Worst Outage Since 2019 57

Posted by msmash on Wednesday November 19, 2025 @10:47AM from the hardcoded-limits dept.

Cloudflare suffered its worst network outage in six years on Tuesday, beginning at 11:20 UTC. The disruption prevented the content delivery network from routing traffic for roughly three hours. The failure, writes Cloudflare in a blog post, originated from a database permissions change deployed at 11:05 UTC. The modification altered how a database query returned information about bot detection features. The query began returning duplicate entries. A configuration file used to identify automated traffic doubled in size and spread across the network's machines. Cloudflare's traffic routing software reads this file to distinguish bots from legitimate users. The software had a built-in limit of 200 bot detection features. The enlarged file contained more than 200 entries. The software crashed when it encountered the unexpected file size.

Users attempting to access websites behind Cloudflare's network received error messages. The outage affected multiple services. Turnstile security checks failed to load. The Workers KV storage service returned elevated error rates. Users could not log into Cloudflare's dashboard. Access authentication failed for most customers.

Engineers initially suspected a coordinated attack. The configuration file was automatically regenerated every five minutes. Database servers produced either correct or corrupted files during a gradual system update. Services repeatedly recovered and failed as different versions of the file circulated. Teams stopped generating new files at 14:24 UTC and manually restored a working version. Most traffic resumed by 14:30 UTC. All systems returned to normal at 17:06 UTC.

Cloudflare Explains Its Worst Outage Since 2019

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 57 Comments Log In/Create an Account

Comments Filter:

n/a (Score:4, Informative)

by rogoshen1 ( 2922505 ) writes: on Wednesday November 19, 2025 @10:53AM (#65804507)

maybe centralization isn't such a good thing after all?

- Re:n/a (Score:5, Interesting)
  
  by AmiMoJo ( 196126 ) writes: on Wednesday November 19, 2025 @11:15AM (#65804553) Homepage Journal
  
  In this case centralization isn't a bad idea. Okay, occasionally there is a problem, but when there is a massive amount of resources are thrown at it, and it gets fixed quickly. Meanwhile their software is updated and constantly tested, so it's more secure and stable than most in-house efforts. It's their full time job, where as it's usually just the IT guy's background task when the company manages it themselves.
  What matters is that there is still competition, to keep the market working properly, and that such services are properly regulated.
  
  - Re:n/a (Score:5, Insightful)
    
    by DarkVader ( 121278 ) writes: on Wednesday November 19, 2025 @11:37AM (#65804593)
    
    Nope, not seeing it.
    No centralization: One site goes down, inconveniences a few people, problem gets fixed a bit more slowly.
    Centralization: A quarter of the internet becomes nonfunctional.
    Centralization still seems like a really, really bad idea to me. It makes it MUCH harder for the internet to just route around damage.
    
    - Re: (Score:3)
      
      by higuita ( 129722 ) writes:
      
      i don't care if 1/4 of the internet goes down, i care about my site.
      do the CF downtime was bigger or smaller than a downtime on my side?
      can i replicate their features even, to get a similar service?
      So while this downtime is always bad, all sites have some downtime... maybe i was luck , but was little affected by this
      either way, no, i can not really replicate the cloudflare solution locally, the costs would be huge, more people, more servers, more knowledge and still would not reach the same level. Just the
      - Re: (Score:2)
        
        by higuita ( 129722 ) writes:
        
        >>i don't care if 1/4 of the internet goes down, i care about my site.
        >To the 1/4th of the Internet that couldn't reach your site, as far as they care, your site is down
        i actually mean that my site is down 100%, that is what i care, i don't care other sites being down (unless they are a requirement for my site)
        >>do the CF downtime was bigger or smaller than a downtime on my side?
        >Bigger. One is inclusive of the other.
        again, you didn't understood what i mean, i'm saying that a downtime in
    - Re: n/a (Score:1)
      
      by kangsterizer ( 1698322 ) writes:
      
      this is basically the political debate of communism vs federal republics haha
    - Re: (Score:3)
      
      by AmiMoJo ( 196126 ) writes:
      
      It's occasional mass outages for a short time, vs more frequent small outages and security issues.
      Don't forget that Cloudflare handles a lot of the security for sites that use it. Not just DDOS protection, but things like user authentication and HTTPS.
  - Re: (Score:2)
    
    by serafean ( 4896143 ) writes:
    
    > What matters is that there is still competition, to keep the market working properly, and that such services are properly regulated.
    What world are you living in?
    - Re: (Score:2)
      
      by AmiMoJo ( 196126 ) writes:
      
      I mean it matters, not that there is good competition. Cloudflare is rather dominant.
      - Re: (Score:2)
        
        by xevioso ( 598654 ) writes:
        
        20% of the world's traffic goes through Cloudflare. The other 80% would like a word with you.
  - Re: (Score:2)
    
    by PPH ( 736903 ) writes:
    
    and that such services are properly regulated.
    Shame Ted Stevens isn't around to oversee that effort.
  - Re: (Score:2)
    
    by Waccoon ( 1186667 ) writes:
    
    ...when there is a massive amount of resources are thrown at it, and it gets fixed quickly.
    This sounds like a variation of the concept that if a company makes gargantuan profits, they have the means to provide top-quality service because they can afford it.
    Yeah, I'm not buying it. I've had huge, HUGE problems with Cloudflare randomly not working with my web browser of choice, and I get really upset when I hear about another of my favorite sites adopting it.
- Re: n/a (Score:2)
  
  by pele ( 151312 ) writes:
  
  Dunno.
  What strikes me as odd is that 20% of ALL sites use cloudflare. Why?
  This in my book makes them as big and potentially evil just as much as google/amazon/meta
  - Businesses don't make decisions based on evilness (Score:3)
    
    by algaeman ( 600564 ) writes:
    
    3 hours out of 6 years is 99.994% uptime.
  - Re: (Score:3)
    
    by organgtool ( 966989 ) writes:
    
    What strikes me as odd is that 20% of ALL sites use cloudflare. Why?
    - Caching all around the world
    - DDOS protection
    - Gatekeeping bots
    
    I have no affiliation with Cloudflare and I use almost none of their services, but I can understand why 20% of the internet does.
- Automation (Score:2)
  
  by algaeman ( 600564 ) writes:
  
  It's not the centralization that did this, but the automation. They had a control file being managed by a central control system, and pushed out to edge devices. The alternative to this sort of automation is manual distribution, which takes a lot more human time, and tends to make the edge devices bespoke snowflakes. Clearly the answer is to let AI handle this automation, so you can be sure it is consistently making inscrutable changes to the systems, and preparing for the deployment of Skynet.
  - Re: (Score:2)
    
    by fahrbot-bot ( 874524 ) writes:
    
    It's not the centralization that did this, but the automation.
    
    Coupled with arbitrary program limits and poor error handling and reporting. From TFS:
    The software had a built-in limit of 200 bot detection features.
    The enlarged file contained more than 200 entries.
    The software crashed when it encountered the unexpected file size.
- Re: (Score:2)
  
  by fahrbot-bot ( 874524 ) writes:
  
  maybe centralization isn't such a good thing after all?
  More like arbitrary program limits and poor error handling and reporting. From TFS:
  The software had a built-in limit of 200 bot detection features.
  The enlarged file contained more than 200 entries.
  The software crashed when it encountered the unexpected file size.
- Re: (Score:2)
  
  by TwistedGreen ( 80055 ) writes:
  
  There's no other option. Very few providers can withstand a multi-Tbps DDoS attack without huge expense. In this case it's an umbrella we all need to huddle under, for better or worse. Any business of sufficient size not using Cloudflare or some other cloud provider's DDoS protection offering is vulnerable.
- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  This incident appears to me to be similar to the CrowdStrike debacle: because they updated every customer all at once, they had a major outage before they could take measures to correct it.
  If they had used blue-green deployment, neither the CrowdStrike, nor the CloudFlare outage, would have been so bad.
  It's not that centralization is bad per se, but if a large system is centralized, it needs to be segmented to make it resilient.
- Re: (Score:2)
  
  by higuita ( 129722 ) writes:
  
  who said that that flow (dev->qa->staging->prod) didn't happen?
  how many people had bugs in prod that didn't show up in the previous steps? how many people had problems only hours after deploying to prod?
  you can't not always test all conditions, dev and qa may not have the amount of users/access/info to really replicate a problem, staging may read the prod DB, but not trigger the conditions, maybe they are intermittent, require a special corner case, etc
  in this case, a prod DB change is ALWAYS harde
Wait wait wait.... (Score:5, Funny)

by andyring ( 100627 ) writes: on Wednesday November 19, 2025 @10:55AM (#65804513) Homepage

Is this real?!?
Like, really really for real?
For once, IT WASN'T DNS!!!!!!!!!!!!!!!!!!

- NOT DNS! (Score:2)
  
  by JcMorin ( 930466 ) writes:
  
  I'm also shocked!
- The World's most interesting engineer (Score:2)
  
  by Thud457 ( 234763 ) writes:
  
  "I don't always test,
  but when I do, I test in production^W The Internet "
Built In Limit? (Score:2)

by bill_mcgonigle ( 4333 ) * writes:

> The software had a built-in limit of 200 bot detection features. The enlarged file contained more than 200 entries. The software crashed when it encountered the unexpected file size.
A built in limit is:
if ( rule_count > 200 )
log_urgent('rule count exceeded')
break
else
rule_count++
process_rule
This sounds like it did not have a built-in limit but rather walked off the end of an array or something when the count went over 200.
- Re: (Score:1)
  
  by gweihir ( 88907 ) writes:
  
  Indeed. Space-limit, hard-placed default-deny or something. In any case something placed incompetently and then not tested for. Amateurs.
  - Re: (Score:2)
    
    by higuita ( 129722 ) writes:
    
    can i see your code to see what assumptions did you make on early stages, that you never bother to go back and fix?! :D
- Re:Built In Limit? (Score:4, Informative)
  
  by RJFerret ( 1279530 ) writes: on Wednesday November 19, 2025 @11:13AM (#65804543)
  
  They explain it and you can see their code toward the end of the linked blog post.
  > Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.
  
  - Re: (Score:2)
    
    by Viol8 ( 599362 ) writes:
    
    Thats an explanation, not an excuse. There should have been a limit check, end of. Probably written by some clueless kid just out of college because any semi competant dev would have put that check in.
    - Re: (Score:2)
      
      by PPH ( 736903 ) writes:
      
      Probably written by some clueless kid
      CompSci 101, Lesson 1, Day 1: Never check for an error condition you don't know how to handle.
    - Re: (Score:2)
      
      by srmalloy ( 263556 ) writes:
      
      If there is a limit on the size of the config file used to identify automated traffic, then the process for updating it needs to save the updated file in temporary storage and sanity-check it; if it fails the sanity check, it should throw an error and reject the update. If it passes the sanity check, the old file should be backed up, and only then should the updated file be moved to replace the old one. That way, if there's an identifiably out-of-band update, it doesn't stomp on the current good data, and e
  - Re: (Score:2)
    
    by Amazing Proton Boy ( 2005 ) writes:
    
    This is the code they show: ||/ Fetch edge features based on 'input' struct into ['Features ] buffer. pub fn fetch_features ( &mut self, input: &dyn BotsInput, features: &mut Features, ) -> Result<(), (ErrorFlags, 132)> { // update features checksum (lower 32 bits) and copy edge feature names features.checksum &= 0xFFFF_FFFF_0000_0000; features.checksum |= u64::from(self.config.checksum); let (feature_values, _) = features
  - Re: (Score:2)
    
    by fahrbot-bot ( 874524 ) writes:
    
    Sure, but if there's a limit, the code should handle/ignore excesses gracefully *and* report it so the data and/or code can be reviewed and updated.
- Re: (Score:2)
  
  by JcMorin ( 930466 ) writes:
  
  I had a sneak peak at their code it was more like that if ( rule_count > 200 ) crash_whole_system() else process_rule()
- Re: (Score:3)
  
  by Fly Swatter ( 30498 ) writes:
  
  Sshhh, you will awaken the Rust zealots.
- Re: Cough ... AI ... cough (Score:1)
  
  by blue trane ( 110704 ) writes:
  
  AI DS?
Engineers initially suspected a coordinated attack (Score:1)

by doconnor ( 134648 ) writes:

When something goes wrong your mind always jumps to hackers, but most of the time its your own fault.
- - Re: Engineers initially suspected a coordinated at (Score:2)
    
    by RegistrationIsDumb83 ( 6517138 ) writes:
    
    They also knowingly provide services to a number of illegal businesses (ex: pirate sites, etc). When a company profits off theft, you know they're shit.
- Re: (Score:2)
  
  by PPH ( 736903 ) writes:
  
  Never attribute to malice that which can be explained by a segmentation fault.
- Re: (Score:2)
  
  by fahrbot-bot ( 874524 ) writes:
  
  When something goes wrong your mind always jumps to hackers, but most of the time its your own fault.
  Hanlon's razor [wikipedia.org]:
  Never attribute to malice that which is adequately explained by stupidity / incompetence.
Rust (Score:4, Funny)

by dotslashw ( 7488844 ) writes: on Wednesday November 19, 2025 @11:20AM (#65804569)

Amusingly, it was a bug in some FL code they recently rewrote in Rust.

Here they are two months ago saying "Cloudflare just got faster and more secure, powered by Rust"

https://blog.cloudflare.com/20... [cloudflare.com]

Will the Rust fanatics be sure to keep mentioning that Rust rewrites can cause 30 million errors per second? LOL.

History repeats (Score:3)

by sconeu ( 64226 ) writes: on Wednesday November 19, 2025 @12:41PM (#65804721) Homepage Journal

Apparently nobody learned anything from the FalconStrike crash.

- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  Precisely.
  Blue-green deployments would have caught this before it blew up worldwide.
Damned if you do, damned if you don't. (Score:2)

by 6Yankee ( 597075 ) writes:

A fairly quiet site I help run got taken out by this. But we're using Cloudflare because, without it, we get overrun with AI scraper bots and the database server falls over.
No internal testing before deployment? (Score:5, Insightful)

by Fly Swatter ( 30498 ) writes: on Wednesday November 19, 2025 @01:20PM (#65804843) Homepage

Oh wait, that's the modern developer maintenance schedule: release something that might work then fix it after all hell breaks loose. engineering failure #1

And if you have a hard limit of 200 entries then you ignore the remainder, but instead by crashing you have proved their was no actual limit just one giant ticking time bomb. engineering failure #201

Ah, for the old days... (Score:2)

by whitroth ( 9367 ) writes:

...when /. readers including a lot more programmers...
Why was there a permission change? The permissions on that should *not* be changing.
Then there's the software issue - when it his 200, it should have either rolled the list, or cut it, it should *not* have crashed.
Otter said it well long ago ... (Score:1)

by gewalker ( 57809 ) writes:

You fucked up. You trusted us!
Ref: You Tube [youtube.com]
Oops (Score:2)

by technomom ( 444378 ) writes:

"The query began returning duplicate entries."
My mind immediately went to, "somehow f'ed up a JOIN".

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

n/a (Score:4, Informative)

Re:n/a (Score:5, Interesting)

Re:n/a (Score:5, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: n/a (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: n/a (Score:2)

Businesses don't make decisions based on evilness (Score:3)

Re: (Score:3)

Automation (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Wait wait wait.... (Score:5, Funny)

NOT DNS! (Score:2)

The World's most interesting engineer (Score:2)

Built In Limit? (Score:2)

Re: (Score:1)

Re: (Score:2)

Re:Built In Limit? (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: Cough ... AI ... cough (Score:1)

Engineers initially suspected a coordinated attack (Score:1)

Re: Engineers initially suspected a coordinated at (Score:2)

Re: (Score:2)

Re: (Score:2)

Rust (Score:4, Funny)

History repeats (Score:3)

Re: (Score:2)

Damned if you do, damned if you don't. (Score:2)

No internal testing before deployment? (Score:5, Insightful)

Ah, for the old days... (Score:2)

Otter said it well long ago ... (Score:1)

Oops (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals