Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
Network Cloud

A Single Point of Failure Triggered the Amazon Outage Affecting Million (arstechnica.com) 32

An anonymous reader quotes a report from Ars Technica: The outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon's sprawling network, according to a post-mortem from company engineers. [...] Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers' control. The result can be unexpected behavior and potentially harmful failures.

In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it "experienced unusually high delays needing to retry its update on several of the DNS endpoints." While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them. The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. [...] The failure caused systems that relied on the DynamoDB in Amazon's US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected.

The damage resulting from the DynamoDB failure then put a strain on Amazon's EC2 services located in the US-East-1 region. The strain persisted even after DynamoDB was restored, as EC2 in this region worked through a "significant backlog of network state propagations needed to be processed." The engineers went on to say: "While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation." In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center.
Amazon has temporarily disabled its DynamoDB DNS Planner and DNS Enactor automation globally while it fixes the race condition and add safeguards against incorrect DNS plans. Engineers are also updating EC2 and its network load balancer.

Further reading: Amazon's AWS Shows Signs of Weakness as Competitors Charge Ahead
This discussion has been archived. No new comments can be posted.

A Single Point of Failure Triggered the Amazon Outage Affecting Million

Comments Filter:
  • It is always DNS (Score:5, Insightful)

    by codemachine ( 245871 ) on Friday October 24, 2025 @05:29PM (#65748710)

    Subject says it all

  • Here [xkcd.com] is the xkcd you were thinking of.
  • ...sounds like the Exercise #1 in the first tutorial of my Operating Systems class in 1982.

    I think I still have the notes in a 3-ring if Amazon needs them.

    • Re: (Score:2, Funny)

      by thegarbz ( 1787294 )

      It's great that you can remember every example you've read in your life in a way that precludes you from ever repeating that mistake elsewhere. Unfortunately the rest of us are human beings.

    • by Anonymous Coward

      As I've told all junior devs for decades race conditions at scale are never "if"s but "when"s... they must be solved correctly /before/ deploying something at scale.
      Polling with retries is a recipe for a cascade of events. I've seen logging systems DOS the entire system because one system failed and the retry code path spammed the logger, and when that code is present at scale it's a huge multiplier.

  • Reading through the summary, my thought is that the kind of stuff Amazon (and others) do is some really complicated shit. I can't begin to imagine how the people involved learn, design, operate and manage all that shit. I mean....damn! No wonder they make the big bucks.

    • It's not that complicated. When you run a system like this, at every single point, you ask yourself, "What if this goes wrong? What should will happen? How can I make it fail safe?"

      • I bet that you couldn't design, build, deploy and manage it.

        • It's a pretty safe bet that no one person could do all the things when the system is this big and complicated, so that doesn't prove anything.

          Also, Amazon apparently can't either.

        • Indeed. I do not have the resources of one of the largest companies in the world. I still know they fucked up badly in ways that any serious engineer should be able to handle, let alone teams of thousands of well-paid engineers.

          I have failsafes at my office. Power goes out? We have emergency lights and lanterns. Forgot my computer at home? I have a spare there, with needed data in the cloud. UPSs for extra runtime. Google goes down? I use Thunderbird, so I can get emails not sent last minute. Int

  • by Tony Isaac ( 1301187 ) on Friday October 24, 2025 @06:26PM (#65748796) Homepage

    If builders built buildings like programmers write programs, the first woodpecker to come along would destroy civilization.

    -- Gerald Weinberg

  • Despite the title (Score:5, Insightful)

    by Tony Isaac ( 1301187 ) on Friday October 24, 2025 @06:33PM (#65748806) Homepage

    This was *not* a single point of failure. This was a failure *cascade*. Once the first failure was remedied, the other downstream failures were still far from solved.

    Those downstream failures were separate failures all their own. Sure, their failure was triggered by the original DNS issue, but if those downstream services had been written in a more resilient way, the DNS issue wouldn't have resulted in long, drawn-out failures of those systems (like ECS for example).

  • by rabun_bike ( 905430 ) on Friday October 24, 2025 @07:29PM (#65748878)
    The classic race condition occurs within a single machine within a single process running multiple threads accessing the same data usually stored in memory. The DNS Enactor and DNS Planner for DynamoDB are likely part of a distributed system, which means they can be deployed on separate machines to handle different tasks efficiently. And the article describes each as having their own cache of the same data. This just sounds like a really bad design with poor timing control more than anything whereby the components work on their own copy of stale data resulting erroneous decisions and run-away processing from subsequent components that are not coordinating efforts properly and just assuming the first process will complete before the second one gets started. Bad design.
    • And these types of people run the world.

    • And people blame DNS, but it nothing to do with DNS, as you say itâ(TM)s their mechanism that is broken. Iâ(TM)m a fan of text based zone files, but I guess it wouldnâ(TM)t scale fast enough for their need. Nevertheless they need to redesign it.
    • by Mr. Barky ( 152560 ) on Saturday October 25, 2025 @01:05AM (#65749314)

      I am sure there is some formal definition of a "race condition", but I don't think it requires a single multi-threaded machine. Perhaps it is "classic" in the sense of what is often used as an example of a race condition. My informal definition of a race condition: you have data that two different parts of the system need to modify and if it is done simultaneously it will result in an incorrect computation.

      A race condition can happen at any level - your scenario within a single process/multi-thread, one can happen with multiple single-threaded processes on the same machine or on multiple machines. They key is that the computation depends upon shared data that one instance modifies and results in an incorrect computation on the other instance.

      Just yesterday, I was looking into exactly this sort of bug (fortunately, early design phase... not production). Two processes (potentially on separate machines) were doing a SQL Get, doing a computation and updating and/or inserting data. The computation is based upon a shared state and if one process writes data, it results in an invalid computation. It wouldn't matter if there were two threads of the same process or if it were two machines. In this case the "cache" was a short-lived in-memory representation of the data which "should be" invalidated by the other calculation. The "classic" race condition usually also has this sort of cache (often called local variables). Caches exist at all levels.

      This sort of bug exists just about everywhere - it really is difficult to make sure all computations are done atomically, especially in distributed systems. It often isn't even obvious that the two systems are coupled. Shared state is just about everywhere - and there will be bugs related to the use of that shared state just about everywhere (but often the circumstances to trigger those bugs are sufficiently rare that they are never seen - or even if they are seen, the consequences are small enough that nobody looks into it).

      There are likely daily bugs that you experience where a program or service does odd things related to race conditions (I know there are also plenty of odd things that happen related to other sorts of bugs too).

      • I have one further comment. Modern software stacks attempt to abstract away the datastore as if this sort of detail doesn't matter (Hibernate, Entity Framework, ...). By default you don't see what transactions are going on and not seeing them means not thinking about the implications. This applies to the particular problem I was looking at - but is much more general. It inevitably leads to many data concurrency problems. You can be more explicit with these frameworks, but it isn't the default - and the defa

  • Wow, I'm on AWS and had zero downtime, apparently because I'm on a plain-vanilla Lightsail VPS. It's a tiny prod installation, but apparently too small to fail.

    Maybe something positive to be said about "old ways" if you don't have to have instant scale...

  • Not that this is a surprise. Was "AI" involved in the coding?

  • They can't foresee everything. And at the scale AWS operates, the things they miss will have massive impacts.

    This is why when we originally looked into Route 53, and found that it only did localised routing, we went with another DNS provider that could do geographic DNS resolution, failover between regions, and automatically rerouted traffic to different AWS regions when something went wrong. We tried to get AWS to see the value in having a similar DNS setup, but they seemed to believe that there would neve

  • Weren't they boasting about AI writing 75% of their code shortly before this? They don't seem to be as quick to assign it blame (or even state that it wasn't involved).
  • ...is that when people say this, the subtext usually is, "...but if we ever throw up our dog-food, we expect someone else to go grab the mop". AWS customers have long been on the brunt end of this for AWS services that rely on other AWS services. Did service X fail because service Y that it uses suffered a rate-limit or some other outage? "Sorry customer, to hear about YOUR problem. Here's some suggestions to fix YOUR problem." Blah blah blah... Now AWS has done it to themselves. There needs to be paradigm

Build a system that even a fool can use and only a fool will want to use it.

Working...