Cloudflare Explains Its Worst Outage Since 2019 57
Cloudflare suffered its worst network outage in six years on Tuesday, beginning at 11:20 UTC. The disruption prevented the content delivery network from routing traffic for roughly three hours. The failure, writes Cloudflare in a blog post, originated from a database permissions change deployed at 11:05 UTC. The modification altered how a database query returned information about bot detection features. The query began returning duplicate entries. A configuration file used to identify automated traffic doubled in size and spread across the network's machines. Cloudflare's traffic routing software reads this file to distinguish bots from legitimate users. The software had a built-in limit of 200 bot detection features. The enlarged file contained more than 200 entries. The software crashed when it encountered the unexpected file size.
Users attempting to access websites behind Cloudflare's network received error messages. The outage affected multiple services. Turnstile security checks failed to load. The Workers KV storage service returned elevated error rates. Users could not log into Cloudflare's dashboard. Access authentication failed for most customers.
Engineers initially suspected a coordinated attack. The configuration file was automatically regenerated every five minutes. Database servers produced either correct or corrupted files during a gradual system update. Services repeatedly recovered and failed as different versions of the file circulated. Teams stopped generating new files at 14:24 UTC and manually restored a working version. Most traffic resumed by 14:30 UTC. All systems returned to normal at 17:06 UTC.
Users attempting to access websites behind Cloudflare's network received error messages. The outage affected multiple services. Turnstile security checks failed to load. The Workers KV storage service returned elevated error rates. Users could not log into Cloudflare's dashboard. Access authentication failed for most customers.
Engineers initially suspected a coordinated attack. The configuration file was automatically regenerated every five minutes. Database servers produced either correct or corrupted files during a gradual system update. Services repeatedly recovered and failed as different versions of the file circulated. Teams stopped generating new files at 14:24 UTC and manually restored a working version. Most traffic resumed by 14:30 UTC. All systems returned to normal at 17:06 UTC.
n/a (Score:4, Informative)
maybe centralization isn't such a good thing after all?
Re:n/a (Score:5, Interesting)
In this case centralization isn't a bad idea. Okay, occasionally there is a problem, but when there is a massive amount of resources are thrown at it, and it gets fixed quickly. Meanwhile their software is updated and constantly tested, so it's more secure and stable than most in-house efforts. It's their full time job, where as it's usually just the IT guy's background task when the company manages it themselves.
What matters is that there is still competition, to keep the market working properly, and that such services are properly regulated.
Re:n/a (Score:5, Insightful)
Nope, not seeing it.
No centralization: One site goes down, inconveniences a few people, problem gets fixed a bit more slowly.
Centralization: A quarter of the internet becomes nonfunctional.
Centralization still seems like a really, really bad idea to me. It makes it MUCH harder for the internet to just route around damage.
Re: (Score:3)
i don't care if 1/4 of the internet goes down, i care about my site.
do the CF downtime was bigger or smaller than a downtime on my side?
can i replicate their features even, to get a similar service?
So while this downtime is always bad, all sites have some downtime... maybe i was luck , but was little affected by this
either way, no, i can not really replicate the cloudflare solution locally, the costs would be huge, more people, more servers, more knowledge and still would not reach the same level. Just the
Re: (Score:2)
>>i don't care if 1/4 of the internet goes down, i care about my site.
>To the 1/4th of the Internet that couldn't reach your site, as far as they care, your site is down
i actually mean that my site is down 100%, that is what i care, i don't care other sites being down (unless they are a requirement for my site)
>>do the CF downtime was bigger or smaller than a downtime on my side?
>Bigger. One is inclusive of the other.
again, you didn't understood what i mean, i'm saying that a downtime in
Re: n/a (Score:1)
this is basically the political debate of communism vs federal republics haha
Re: (Score:3)
It's occasional mass outages for a short time, vs more frequent small outages and security issues.
Don't forget that Cloudflare handles a lot of the security for sites that use it. Not just DDOS protection, but things like user authentication and HTTPS.
Re: (Score:2)
> What matters is that there is still competition, to keep the market working properly, and that such services are properly regulated.
What world are you living in?
Re: (Score:2)
I mean it matters, not that there is good competition. Cloudflare is rather dominant.
Re: (Score:2)
20% of the world's traffic goes through Cloudflare. The other 80% would like a word with you.
Re: (Score:2)
and that such services are properly regulated.
Shame Ted Stevens isn't around to oversee that effort.
Re: (Score:2)
...when there is a massive amount of resources are thrown at it, and it gets fixed quickly.
This sounds like a variation of the concept that if a company makes gargantuan profits, they have the means to provide top-quality service because they can afford it.
Yeah, I'm not buying it. I've had huge, HUGE problems with Cloudflare randomly not working with my web browser of choice, and I get really upset when I hear about another of my favorite sites adopting it.
Re: n/a (Score:2)
Dunno.
What strikes me as odd is that 20% of ALL sites use cloudflare. Why?
This in my book makes them as big and potentially evil just as much as google/amazon/meta
Businesses don't make decisions based on evilness (Score:3)
Re: (Score:3)
- Caching all around the world
- DDOS protection
- Gatekeeping bots
I have no affiliation with Cloudflare and I use almost none of their services, but I can understand why 20% of the internet does.
Automation (Score:2)
Re: (Score:2)
It's not the centralization that did this, but the automation.
Coupled with arbitrary program limits and poor error handling and reporting. From TFS:
The software had a built-in limit of 200 bot detection features.
The enlarged file contained more than 200 entries.
The software crashed when it encountered the unexpected file size.
Re: (Score:2)
maybe centralization isn't such a good thing after all?
More like arbitrary program limits and poor error handling and reporting. From TFS:
The software had a built-in limit of 200 bot detection features.
The enlarged file contained more than 200 entries.
The software crashed when it encountered the unexpected file size.
Re: (Score:2)
There's no other option. Very few providers can withstand a multi-Tbps DDoS attack without huge expense. In this case it's an umbrella we all need to huddle under, for better or worse. Any business of sufficient size not using Cloudflare or some other cloud provider's DDoS protection offering is vulnerable.
Re: (Score:2)
This incident appears to me to be similar to the CrowdStrike debacle: because they updated every customer all at once, they had a major outage before they could take measures to correct it.
If they had used blue-green deployment, neither the CrowdStrike, nor the CloudFlare outage, would have been so bad.
It's not that centralization is bad per se, but if a large system is centralized, it needs to be segmented to make it resilient.
Re: (Score:2)
who said that that flow (dev->qa->staging->prod) didn't happen?
how many people had bugs in prod that didn't show up in the previous steps? how many people had problems only hours after deploying to prod?
you can't not always test all conditions, dev and qa may not have the amount of users/access/info to really replicate a problem, staging may read the prod DB, but not trigger the conditions, maybe they are intermittent, require a special corner case, etc
in this case, a prod DB change is ALWAYS harde
Wait wait wait.... (Score:5, Funny)
Is this real?!?
Like, really really for real?
For once, IT WASN'T DNS!!!!!!!!!!!!!!!!!!
NOT DNS! (Score:2)
The World's most interesting engineer (Score:2)
"I don't always test,
but when I do, I test in production^W The Internet "
Built In Limit? (Score:2)
> The software had a built-in limit of 200 bot detection features. The enlarged file contained more than 200 entries. The software crashed when it encountered the unexpected file size.
A built in limit is:
if ( rule_count > 200 )
log_urgent('rule count exceeded')
break
else
rule_count++
process_rule
This sounds like it did not have a built-in limit but rather walked off the end of an array or something when the count went over 200.
Re: (Score:1)
Indeed. Space-limit, hard-placed default-deny or something. In any case something placed incompetently and then not tested for. Amateurs.
Re: (Score:2)
can i see your code to see what assumptions did you make on early stages, that you never bother to go back and fix?! :D
Re:Built In Limit? (Score:4, Informative)
They explain it and you can see their code toward the end of the linked blog post.
> Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.
Re: (Score:2)
Thats an explanation, not an excuse. There should have been a limit check, end of. Probably written by some clueless kid just out of college because any semi competant dev would have put that check in.
Re: (Score:2)
Probably written by some clueless kid
CompSci 101, Lesson 1, Day 1: Never check for an error condition you don't know how to handle.
Re: (Score:2)
Re: (Score:2)
||/ Fetch edge features based on 'input' struct into ['Features ] buffer.
pub fn fetch_features (
&mut self, input: &dyn BotsInput,
features: &mut Features,
) -> Result<(), (ErrorFlags, 132)> {
// update features checksum (lower 32 bits) and copy edge feature names
features.checksum &= 0xFFFF_FFFF_0000_0000;
features.checksum |= u64::from(self.config.checksum);
let (feature_values, _) = features
Re: (Score:2)
Sure, but if there's a limit, the code should handle/ignore excesses gracefully *and* report it so the data and/or code can be reviewed and updated.
Re: (Score:2)
Re: (Score:3)
Re: Cough ... AI ... cough (Score:1)
AI DS?
Engineers initially suspected a coordinated attack (Score:1)
When something goes wrong your mind always jumps to hackers, but most of the time its your own fault.
Re: Engineers initially suspected a coordinated at (Score:2)
Re: (Score:2)
Never attribute to malice that which can be explained by a segmentation fault.
Re: (Score:2)
When something goes wrong your mind always jumps to hackers, but most of the time its your own fault.
Hanlon's razor [wikipedia.org]:
Never attribute to malice that which is adequately explained by stupidity / incompetence.
Rust (Score:4, Funny)
Here they are two months ago saying "Cloudflare just got faster and more secure, powered by Rust"
https://blog.cloudflare.com/20... [cloudflare.com]
Will the Rust fanatics be sure to keep mentioning that Rust rewrites can cause 30 million errors per second? LOL.
History repeats (Score:3)
Apparently nobody learned anything from the FalconStrike crash.
Re: (Score:2)
Precisely.
Blue-green deployments would have caught this before it blew up worldwide.
Damned if you do, damned if you don't. (Score:2)
A fairly quiet site I help run got taken out by this. But we're using Cloudflare because, without it, we get overrun with AI scraper bots and the database server falls over.
No internal testing before deployment? (Score:5, Insightful)
And if you have a hard limit of 200 entries then you ignore the remainder, but instead by crashing you have proved their was no actual limit just one giant ticking time bomb. engineering failure #201
Ah, for the old days... (Score:2)
...when /. readers including a lot more programmers...
Why was there a permission change? The permissions on that should *not* be changing.
Then there's the software issue - when it his 200, it should have either rolled the list, or cut it, it should *not* have crashed.
Otter said it well long ago ... (Score:1)
You fucked up. You trusted us!
Ref: You Tube [youtube.com]
Oops (Score:2)
"The query began returning duplicate entries."
My mind immediately went to, "somehow f'ed up a JOIN".