A Single Point of Failure Triggered the Amazon Outage Affecting Million (arstechnica.com) 32
An anonymous reader quotes a report from Ars Technica: The outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon's sprawling network, according to a post-mortem from company engineers. [...] Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers' control. The result can be unexpected behavior and potentially harmful failures.
In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it "experienced unusually high delays needing to retry its update on several of the DNS endpoints." While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them. The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. [...] The failure caused systems that relied on the DynamoDB in Amazon's US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected.
The damage resulting from the DynamoDB failure then put a strain on Amazon's EC2 services located in the US-East-1 region. The strain persisted even after DynamoDB was restored, as EC2 in this region worked through a "significant backlog of network state propagations needed to be processed." The engineers went on to say: "While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation." In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center. Amazon has temporarily disabled its DynamoDB DNS Planner and DNS Enactor automation globally while it fixes the race condition and add safeguards against incorrect DNS plans. Engineers are also updating EC2 and its network load balancer.
Further reading: Amazon's AWS Shows Signs of Weakness as Competitors Charge Ahead
In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it "experienced unusually high delays needing to retry its update on several of the DNS endpoints." While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them. The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. [...] The failure caused systems that relied on the DynamoDB in Amazon's US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected.
The damage resulting from the DynamoDB failure then put a strain on Amazon's EC2 services located in the US-East-1 region. The strain persisted even after DynamoDB was restored, as EC2 in this region worked through a "significant backlog of network state propagations needed to be processed." The engineers went on to say: "While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation." In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center. Amazon has temporarily disabled its DynamoDB DNS Planner and DNS Enactor automation globally while it fixes the race condition and add safeguards against incorrect DNS plans. Engineers are also updating EC2 and its network load balancer.
Further reading: Amazon's AWS Shows Signs of Weakness as Competitors Charge Ahead
It is always DNS (Score:5, Insightful)
Subject says it all
Re:It is always DNS (Score:4, Funny)
Anger: "Why doesn't DNS work?!"
Bargaining: "Maybe if I reload the network interface things will start working again."
Depression: "Where do I even begin troubleshooting? I can't believe this is happening to me."
Acceptance: "It was a faulty DNS record."
Re: (Score:2)
It's the lupus of the tech world.
xkcd (Score:1)
Race condition at a single point of failure... (Score:2)
...sounds like the Exercise #1 in the first tutorial of my Operating Systems class in 1982.
I think I still have the notes in a 3-ring if Amazon needs them.
Re: (Score:2, Funny)
It's great that you can remember every example you've read in your life in a way that precludes you from ever repeating that mistake elsewhere. Unfortunately the rest of us are human beings.
Re: (Score:1)
As I've told all junior devs for decades race conditions at scale are never "if"s but "when"s... they must be solved correctly /before/ deploying something at scale.
Polling with retries is a recipe for a cascade of events. I've seen logging systems DOS the entire system because one system failed and the retry code path spammed the logger, and when that code is present at scale it's a huge multiplier.
Complicated Shit (Score:1)
Reading through the summary, my thought is that the kind of stuff Amazon (and others) do is some really complicated shit. I can't begin to imagine how the people involved learn, design, operate and manage all that shit. I mean....damn! No wonder they make the big bucks.
Re: (Score:2)
It's not that complicated. When you run a system like this, at every single point, you ask yourself, "What if this goes wrong? What should will happen? How can I make it fail safe?"
Re: (Score:1)
I bet that you couldn't design, build, deploy and manage it.
Re: (Score:2)
It's a pretty safe bet that no one person could do all the things when the system is this big and complicated, so that doesn't prove anything.
Also, Amazon apparently can't either.
Re: (Score:2)
Indeed. I do not have the resources of one of the largest companies in the world. I still know they fucked up badly in ways that any serious engineer should be able to handle, let alone teams of thousands of well-paid engineers.
I have failsafes at my office. Power goes out? We have emergency lights and lanterns. Forgot my computer at home? I have a spare there, with needed data in the cloud. UPSs for extra runtime. Google goes down? I use Thunderbird, so I can get emails not sent last minute. Int
If builders built buildings like programmers (Score:4, Informative)
If builders built buildings like programmers write programs, the first woodpecker to come along would destroy civilization.
-- Gerald Weinberg
Despite the title (Score:5, Insightful)
This was *not* a single point of failure. This was a failure *cascade*. Once the first failure was remedied, the other downstream failures were still far from solved.
Those downstream failures were separate failures all their own. Sure, their failure was triggered by the original DNS issue, but if those downstream services had been written in a more resilient way, the DNS issue wouldn't have resulted in long, drawn-out failures of those systems (like ECS for example).
Re: (Score:2)
Yes, it was bad engineering every step of the way.
Not a classic race condition (Score:5, Informative)
Re: (Score:2)
And these types of people run the world.
Re: Not a classic race condition (Score:2)
Re:Not a classic race condition (Score:4, Insightful)
I am sure there is some formal definition of a "race condition", but I don't think it requires a single multi-threaded machine. Perhaps it is "classic" in the sense of what is often used as an example of a race condition. My informal definition of a race condition: you have data that two different parts of the system need to modify and if it is done simultaneously it will result in an incorrect computation.
A race condition can happen at any level - your scenario within a single process/multi-thread, one can happen with multiple single-threaded processes on the same machine or on multiple machines. They key is that the computation depends upon shared data that one instance modifies and results in an incorrect computation on the other instance.
Just yesterday, I was looking into exactly this sort of bug (fortunately, early design phase... not production). Two processes (potentially on separate machines) were doing a SQL Get, doing a computation and updating and/or inserting data. The computation is based upon a shared state and if one process writes data, it results in an invalid computation. It wouldn't matter if there were two threads of the same process or if it were two machines. In this case the "cache" was a short-lived in-memory representation of the data which "should be" invalidated by the other calculation. The "classic" race condition usually also has this sort of cache (often called local variables). Caches exist at all levels.
This sort of bug exists just about everywhere - it really is difficult to make sure all computations are done atomically, especially in distributed systems. It often isn't even obvious that the two systems are coupled. Shared state is just about everywhere - and there will be bugs related to the use of that shared state just about everywhere (but often the circumstances to trigger those bugs are sufficiently rare that they are never seen - or even if they are seen, the consequences are small enough that nobody looks into it).
There are likely daily bugs that you experience where a program or service does odd things related to race conditions (I know there are also plenty of odd things that happen related to other sorts of bugs too).
Re: (Score:2)
I have one further comment. Modern software stacks attempt to abstract away the datastore as if this sort of detail doesn't matter (Hibernate, Entity Framework, ...). By default you don't see what transactions are going on and not seeing them means not thinking about the implications. This applies to the particular problem I was looking at - but is much more general. It inevitably leads to many data concurrency problems. You can be more explicit with these frameworks, but it isn't the default - and the defa
AWS customer that stayed up (Score:2)
Wow, I'm on AWS and had zero downtime, apparently because I'm on a plain-vanilla Lightsail VPS. It's a tiny prod installation, but apparently too small to fail.
Maybe something positive to be said about "old ways" if you don't have to have instant scale...
Re: (Score:2)
Old ways of /etc/hosts rather than DNS for infra may have helped too perhaps.
So incompetence (Score:2)
Not that this is a surprise. Was "AI" involved in the coding?
Small issues have big impacts at AWS scale (Score:2)
They can't foresee everything. And at the scale AWS operates, the things they miss will have massive impacts.
This is why when we originally looked into Route 53, and found that it only did localised routing, we went with another DNS provider that could do geographic DNS resolution, failover between regions, and automatically rerouted traffic to different AWS regions when something went wrong. We tried to get AWS to see the value in having a similar DNS setup, but they seemed to believe that there would neve
AI involvement (Score:2)
the danger of eating your own dog-food... (Score:2)
Soooo no load testing was done then (Score:1)
Has Amazon recently hired Microsoft engineers? (Score:1)