Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
The Internet Network Programming

Cloudflare Comes Clean On Crashing a Chunk of the Web Earlier This Month 111

Cloudflare has published a detailed and refreshingly honest report into precisely what went wrong earlier this month when its systems fell over and took a big chunk of the internet with it. The Register reports: We already knew from a quick summary published the next day, and our interview with its CTO John Graham-Cumming, that the 30-minute global outage had been caused by an error in a single line of code in a system the company uses to push rapid software changes. [...] First up the error itself -- it was in this bit of code: .*(?:.*=.*). We won't go into the full workings as to why because the post does so extensively (a Friday treat for coding nerds) but very broadly the code caused a lot of what's called "backtracking," basically repetitive looping. This backtracking got worse -- exponentially worse -- the more complex the request and very, very quickly maxed out the company's CPUs.

The impact wasn't noticed for the simple reason that the test suite didn't measure CPU usage. It soon will -- Cloudflare has an internal deadline of a week from now. The second problem was that a software protection system that would have prevented excessive CPU consumption had been removed "by mistake" just a weeks earlier. That protection is now back in although it clearly needs to be locked down. The software used to run the code -- the expression engine -- also doesn't have the ability to check for the sort of backtracking that occurred. Cloudflare says it will shift to one that does.
The post goes on to talk about the speed with which it impacted everyone, why it took them so long to fix it, and why it didn't just do a rollback within minutes and solve the issue while it figured out what was going on.

You can read the full postmortem here.
This discussion has been archived. No new comments can be posted.

Cloudflare Comes Clean On Crashing a Chunk of the Web Earlier This Month

Comments Filter:
  • Why link to Register (Score:5, Informative)

    by ChoGGi ( 522069 ) <slashdot @ c h o g g i.org> on Friday July 12, 2019 @09:57PM (#58917606) Homepage

    and not the actual report?
    https://blog.cloudflare.com/de... [cloudflare.com]

    • by Anonymous Coward

      Because itâ(TM)s slashfuckingdot. Coerced the liveleak moderators to migrate, and went down in quality.

      Thanks for the link.

    • by Z00L00K ( 682162 )

      Because the person submitting the story didn't do correct research, the editors just pick the bones thrown at them and look at the votes for submissions.

  • Why is this story both "news for nerds" and "stuff that matters"? Why haven't you shoehorned in some awkward reference to global warming?
    • Why is this story both "news for nerds" and "stuff that matters"? Why haven't you shoehorned in some awkward reference to global warming?

      Well, if all the CPUs in all of CloudFlare’s data centers were at 100% utilization, as described, they were plowing though electricity like nobody’s business - and actually did have a significant carbon footprint during that time period.

      But back to the actual story... now I feel a little better about some of the stupid regex mistakes I’ve made.

  • Comment removed based on user account deletion
  • Trashed once again by shitty software, welcome to the future.

  • "Don't worry, we're still the #1 MiTM service in the world and growing. Government-funded agencies still love us."

  • by Waffle Iron ( 339739 ) on Friday July 12, 2019 @11:37PM (#58917864)

    The risk of this kind of issue is why, for example, Rust and Golang's regular expressions don't support backrefrences, which can make them such a PITA to use.

    Although I reluctantly admit that not crashing the internet and stuff could be more important.

    • If the pattern is actually a regular expression (as this one appears to be) then it can be matched in O(1) space and O(n) time with respect to the length of the input. True regular expressions never require backtracking. The problem is that many languages implement Perl-inspired "regular expressions", which are not actually regular due to misfeatures like backreferences that can't be represented as deterministic finite automata. Rather than detecting the common cases involving proper regular expressions and

  • by Anonymous Coward

    The real stunner is how there are even regex libraries out there that do NOT run in linear time. The blog post itself links to a classic article describing how to compile a regex to a nondeterministic finite state automaton that can match in linear time (the size of the automaton might be another issue, but that is discovered at compile time).

  • Weinberg's second law (dating back to at least 1989 - probably much earlier):
    If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization

    And it will probably remain a valid observation. Right up until civilisation is destroyed.

    • And when future archaeologists dig down to that layer ... "Yup, just as we thought -- a regular expression with misconfigured backtracking. Those poor people -- they never had a chance against Skynet."

  • Metaphysical HTTP error code 600: CPU exhausted, not enough Ken Thompson.

  • ...now you have 2 problems.

  • The software used to run the code -- the expression engine...

    Surely they don't mean this thing [expressionengine.com]?!

  • Goddamn fucking updates again! Get it right or go home!

You are in a maze of little twisting passages, all different.

Working...