Amazon EBS Failure Brings Down Reddit, Imgur, Others 176
Several readers have sent word of a significant Amazon EBS outage. Quoting:
"Amazon Web Services has confirmed that its Elastic Block Storage (EBS) service is experiencing degraded service, leading sites across the Internet to experience downtime, including Reddit, Imgur and many others. AWS confirmed on its status page at 2:11 p.m. ET that it is experiencing 'degraded performance for a small number of EBS volumes.' It says the issue is restricted to a single Availability Zone within the US-East-1 Region, which is in Northern Virginia. AWS later reported that its Relational Database Service (Amazon RDS) and its Elastic Beanstalk application plaform also experienced failures on Monday afternoon."
Other Victims (Score:5, Informative)
Single AZ my butt (Score:3, Informative)
We are seeing EBS problems across multiple AZs with our services, as are many others. Amazon is downplaying the issue.
See HN for ongoing discussion as well: http://news.ycombinator.com/
Same region as the storm in June (Score:5, Informative)
Bad luck if you're hosted in the US-East-1 Region [amazon.com], I guess.
Heh, I should really start advertising the LVS clusters I tend to as 'private clouds with better uptime than Amazon'.
Re:Low Availability? (Score:4, Informative)
>Reddit, Imgur, etc., don't have presences in multiple availability zones to prevent this kind of outage
They do. It's a multi-AZ outage, despite what Amazon is saying.
Re:Interestingly enough... (Score:4, Informative)
All of those things were done here before they were done at reddit. You might want to get a new prescription for your rose colored glasses.
Re:Low Availability? (Score:2, Informative)
Re:Same region as the storm in June (Score:4, Informative)
Desk phones and SIP clients out for 2.5 hours for me. Calls rolled over at the provider level like they were supposed to though. Didn't think I'd have to put that to the test so soon.
The server qualifies for the free tier, and that's probably why it just went straight unresponsive for two hours. Maybe I should upgrade to a slightly larger paid/reserved instance and..... Wait, I smell conspiracy.
Re:Low Availability? (Score:4, Informative)
Multi AZ IS "completely geographically separate zones" and yes, you can specifically define which ones.
Amazon is very clear that US East 1a,b,c,d are all the same physical data center. However, West is not. It's in Oregon (as opposed to VA for East)
I've seen no evidence that true Multi AZ instances (as described by Amazon) are down. If you've got some though, I would be interested to see it because I would be pretty concerned.
Availability Zones are not geographically separate - regions are:
http://aws.amazon.com/ec2/#features [amazon.com]
Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location. Regions consist of one or more Availability Zones, are geographically dispersed, and will be in separate geographic areas or countries
Re:multi AZ? (Score:4, Informative)
If they WERE using Multi AZ, or there is some other technical reason why it wouldn't help, I'm really curious to know why...
Here's your answer: cascading failures [wikipedia.org].
In short, the cascading failures don't happen because one local failure cause the entire capacity of the network to be exceeded... you see, it is not a case of every node connected to every node (O(N^2) connections), thus a failure only need to overload the capacity of the nodes connected to the failing one...