Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
The Internet IT

Cloudflare Explains Its Worst Outage Since 2019 57

Cloudflare suffered its worst network outage in six years on Tuesday, beginning at 11:20 UTC. The disruption prevented the content delivery network from routing traffic for roughly three hours. The failure, writes Cloudflare in a blog post, originated from a database permissions change deployed at 11:05 UTC. The modification altered how a database query returned information about bot detection features. The query began returning duplicate entries. A configuration file used to identify automated traffic doubled in size and spread across the network's machines. Cloudflare's traffic routing software reads this file to distinguish bots from legitimate users. The software had a built-in limit of 200 bot detection features. The enlarged file contained more than 200 entries. The software crashed when it encountered the unexpected file size.

Users attempting to access websites behind Cloudflare's network received error messages. The outage affected multiple services. Turnstile security checks failed to load. The Workers KV storage service returned elevated error rates. Users could not log into Cloudflare's dashboard. Access authentication failed for most customers.

Engineers initially suspected a coordinated attack. The configuration file was automatically regenerated every five minutes. Database servers produced either correct or corrupted files during a gradual system update. Services repeatedly recovered and failed as different versions of the file circulated. Teams stopped generating new files at 14:24 UTC and manually restored a working version. Most traffic resumed by 14:30 UTC. All systems returned to normal at 17:06 UTC.
This discussion has been archived. No new comments can be posted.

Cloudflare Explains Its Worst Outage Since 2019

Comments Filter:
  • n/a (Score:4, Informative)

    by rogoshen1 ( 2922505 ) on Wednesday November 19, 2025 @09:53AM (#65804507)

    maybe centralization isn't such a good thing after all?

    • Re:n/a (Score:5, Interesting)

      by AmiMoJo ( 196126 ) on Wednesday November 19, 2025 @10:15AM (#65804553) Homepage Journal

      In this case centralization isn't a bad idea. Okay, occasionally there is a problem, but when there is a massive amount of resources are thrown at it, and it gets fixed quickly. Meanwhile their software is updated and constantly tested, so it's more secure and stable than most in-house efforts. It's their full time job, where as it's usually just the IT guy's background task when the company manages it themselves.

      What matters is that there is still competition, to keep the market working properly, and that such services are properly regulated.

      • Re:n/a (Score:5, Insightful)

        by DarkVader ( 121278 ) on Wednesday November 19, 2025 @10:37AM (#65804593)

        Nope, not seeing it.

        No centralization: One site goes down, inconveniences a few people, problem gets fixed a bit more slowly.

        Centralization: A quarter of the internet becomes nonfunctional.

        Centralization still seems like a really, really bad idea to me. It makes it MUCH harder for the internet to just route around damage.

        • by higuita ( 129722 )

          i don't care if 1/4 of the internet goes down, i care about my site.

          do the CF downtime was bigger or smaller than a downtime on my side?
          can i replicate their features even, to get a similar service?

          So while this downtime is always bad, all sites have some downtime... maybe i was luck , but was little affected by this
          either way, no, i can not really replicate the cloudflare solution locally, the costs would be huge, more people, more servers, more knowledge and still would not reach the same level. Just the

        • this is basically the political debate of communism vs federal republics haha

        • by AmiMoJo ( 196126 )

          It's occasional mass outages for a short time, vs more frequent small outages and security issues.

          Don't forget that Cloudflare handles a lot of the security for sites that use it. Not just DDOS protection, but things like user authentication and HTTPS.

      • > What matters is that there is still competition, to keep the market working properly, and that such services are properly regulated.

        What world are you living in?

      • by PPH ( 736903 )

        and that such services are properly regulated.

        Shame Ted Stevens isn't around to oversee that effort.

      • ...when there is a massive amount of resources are thrown at it, and it gets fixed quickly.

        This sounds like a variation of the concept that if a company makes gargantuan profits, they have the means to provide top-quality service because they can afford it.

        Yeah, I'm not buying it. I've had huge, HUGE problems with Cloudflare randomly not working with my web browser of choice, and I get really upset when I hear about another of my favorite sites adopting it.

    • Dunno.

      What strikes me as odd is that 20% of ALL sites use cloudflare. Why?
      This in my book makes them as big and potentially evil just as much as google/amazon/meta

    • It's not the centralization that did this, but the automation. They had a control file being managed by a central control system, and pushed out to edge devices. The alternative to this sort of automation is manual distribution, which takes a lot more human time, and tends to make the edge devices bespoke snowflakes. Clearly the answer is to let AI handle this automation, so you can be sure it is consistently making inscrutable changes to the systems, and preparing for the deployment of Skynet.
      • It's not the centralization that did this, but the automation.

        Coupled with arbitrary program limits and poor error handling and reporting. From TFS:

        The software had a built-in limit of 200 bot detection features.
        The enlarged file contained more than 200 entries.
        The software crashed when it encountered the unexpected file size.

    • maybe centralization isn't such a good thing after all?

      More like arbitrary program limits and poor error handling and reporting. From TFS:

      The software had a built-in limit of 200 bot detection features.
      The enlarged file contained more than 200 entries.
      The software crashed when it encountered the unexpected file size.

    • There's no other option. Very few providers can withstand a multi-Tbps DDoS attack without huge expense. In this case it's an umbrella we all need to huddle under, for better or worse. Any business of sufficient size not using Cloudflare or some other cloud provider's DDoS protection offering is vulnerable.

    • This incident appears to me to be similar to the CrowdStrike debacle: because they updated every customer all at once, they had a major outage before they could take measures to correct it.

      If they had used blue-green deployment, neither the CrowdStrike, nor the CloudFlare outage, would have been so bad.

      It's not that centralization is bad per se, but if a large system is centralized, it needs to be segmented to make it resilient.

  • by andyring ( 100627 ) on Wednesday November 19, 2025 @09:55AM (#65804513) Homepage

    Is this real?!?

    Like, really really for real?

    For once, IT WASN'T DNS!!!!!!!!!!!!!!!!!!

  • > The software had a built-in limit of 200 bot detection features. The enlarged file contained more than 200 entries. The software crashed when it encountered the unexpected file size.

    A built in limit is:

    if ( rule_count > 200 )
    log_urgent('rule count exceeded')
    break
    else
    rule_count++
    process_rule

    This sounds like it did not have a built-in limit but rather walked off the end of an array or something when the count went over 200.

    • by gweihir ( 88907 )

      Indeed. Space-limit, hard-placed default-deny or something. In any case something placed incompetently and then not tested for. Amateurs.

      • by higuita ( 129722 )

        can i see your code to see what assumptions did you make on early stages, that you never bother to go back and fix?! :D

    • Re:Built In Limit? (Score:4, Informative)

      by RJFerret ( 1279530 ) on Wednesday November 19, 2025 @10:13AM (#65804543)

      They explain it and you can see their code toward the end of the linked blog post.

      > Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.

      • by Viol8 ( 599362 )

        Thats an explanation, not an excuse. There should have been a limit check, end of. Probably written by some clueless kid just out of college because any semi competant dev would have put that check in.

        • by PPH ( 736903 )

          Probably written by some clueless kid

          CompSci 101, Lesson 1, Day 1: Never check for an error condition you don't know how to handle.

        • If there is a limit on the size of the config file used to identify automated traffic, then the process for updating it needs to save the updated file in temporary storage and sanity-check it; if it fails the sanity check, it should throw an error and reject the update. If it passes the sanity check, the old file should be backed up, and only then should the updated file be moved to replace the old one. That way, if there's an identifiably out-of-band update, it doesn't stomp on the current good data, and e
      • This is the code they show:

        ||/ Fetch edge features based on 'input' struct into ['Features ] buffer.
        pub fn fetch_features (
        &mut self, input: &dyn BotsInput,
        features: &mut Features,
        ) -> Result<(), (ErrorFlags, 132)> {
        // update features checksum (lower 32 bits) and copy edge feature names
        features.checksum &= 0xFFFF_FFFF_0000_0000;
        features.checksum |= u64::from(self.config.checksum);
        let (feature_values, _) = features
      • Sure, but if there's a limit, the code should handle/ignore excesses gracefully *and* report it so the data and/or code can be reviewed and updated.

    • by JcMorin ( 930466 )
      I had a sneak peak at their code it was more like that if ( rule_count > 200 ) crash_whole_system() else process_rule()
    • Sshhh, you will awaken the Rust zealots.
  • When something goes wrong your mind always jumps to hackers, but most of the time its your own fault.

  • Rust (Score:4, Funny)

    by dotslashw ( 7488844 ) on Wednesday November 19, 2025 @10:20AM (#65804569)
    Amusingly, it was a bug in some FL code they recently rewrote in Rust.

    Here they are two months ago saying "Cloudflare just got faster and more secure, powered by Rust"

    https://blog.cloudflare.com/20... [cloudflare.com]

    Will the Rust fanatics be sure to keep mentioning that Rust rewrites can cause 30 million errors per second? LOL.
  • by sconeu ( 64226 ) on Wednesday November 19, 2025 @11:41AM (#65804721) Homepage Journal

    Apparently nobody learned anything from the FalconStrike crash.

  • A fairly quiet site I help run got taken out by this. But we're using Cloudflare because, without it, we get overrun with AI scraper bots and the database server falls over.

  • by Fly Swatter ( 30498 ) on Wednesday November 19, 2025 @12:20PM (#65804843) Homepage
    Oh wait, that's the modern developer maintenance schedule: release something that might work then fix it after all hell breaks loose. engineering failure #1

    And if you have a hard limit of 200 entries then you ignore the remainder, but instead by crashing you have proved their was no actual limit just one giant ticking time bomb. engineering failure #201
  • ...when /. readers including a lot more programmers...

    Why was there a permission change? The permissions on that should *not* be changing.

    Then there's the software issue - when it his 200, it should have either rolled the list, or cut it, it should *not* have crashed.

  • You fucked up. You trusted us!

    Ref: You Tube [youtube.com]

  • "The query began returning duplicate entries."

    My mind immediately went to, "somehow f'ed up a JOIN".

"I'm a mean green mother from outer space" -- Audrey II, The Little Shop of Horrors

Working...