Amazon EC2 Failure Post-Mortem 117
CPE1704TKS tips news that Amazon has provided a post-mortem on why EC2 failed. Quoting:
"At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."
Re: (Score:3)
Unlike some other company... *cough* Sony *cough*
Re: (Score:3)
Re: (Score:1)
Re: (Score:2)
It was good that they were forthcoming, as competitors are both breathing down their necks, and also looking at their own infrastructure for possible race conditions that would crater post-failure storage isolation(s).
They also admitted but don't seem to get the message that their focus has been on developing novel customer solutions-- NOT keeping the core infrastructure bulletproof. Loose-and-fast rather than unrelenting QA will cause Amazon a lot of pain; it'll be hard to trust them until they can prove t
Re: (Score:2)
Re: (Score:2)
Or Google...
I realise this is "News for Nerds"... (Score:4, Funny)
But can I get an understandable car analogy here?
Re:I realise this is "News for Nerds"... (Score:5, Informative)
Instead of closing off one lane of highway for construction, they closed off all lanes and forced highway traffic to go through town. The roads in town weren't able to handle all the cars. Massive back-ups ensued.
Re:I realise this is "News for Nerds"... (Score:5, Funny)
Re: (Score:1)
Re: (Score:2)
There's so much wrong with this post, I don't even know where to begin, but what I really want to know is... ...there's a non-US version of baseball?!?
Re: (Score:2)
Re: (Score:1)
Re: (Score:2)
I suspect the comment was referring to simple classical mechanical systems. I do find it fascinating that on a windy day at the beach, I can throw a baseball well over a hundred feet and have the other person catch it without needing to move (and I'm not particularly coordinated). Granted, with such lax accuracy, the relevant calculations aren't too tricky, but I still find it neat that humans (and other animals) have
Re: (Score:2)
Yes, and I learned so much about city planning from Sim City. I should put that on my resume and become apply for a civil engineering job.
Come off it. You can't just look at a piece of metal and know what it's tensile strength is, how much load it can withstand, or how load over time will affect it. You have to know the composition of the metal, and all the varying factors that affect it to calculate those things.
Re: (Score:2)
Damn that's close. It's freaky how almost anything can be expressed as a car analogy.
Re: (Score:2)
Re: (Score:2)
Stickers make it go faster and gets girls.
Red paint makes it go faster and gets girls.
Neon lights and spinner hubcaps make it go faster and gets girls.
Blue lenses to make your headlights into fakey HID lights make it go faster and gets girls.
Slicked-back hair and sunglasses at night make it go faster and gets girls.
Chopping off two coils from the factory springs makes it go faster and gets girls.
Wings and spoilers and air dams and side skirts make it go faster and gets girls.
Replacing a hood latch wi
Re: (Score:2)
Re: (Score:2)
You mean they shut down the tubes and shit got clogged?
Re:I realise this is "News for Nerds"... (Score:4, Funny)
Instead of the usual commuter rail line, we've had to do some maintenance causing us to provide a single Yugo as transport for the NY morning rush.
After packing 25 angry commuters into the Yugo we left a few hundred thousand stranded on the platform, ping-ponging between the parking lot and home, completely confused how they would get to work.
In addition to that, unfortunately the Yugo couldn't handle the added weight of the passengers and the leaf springs shattered all over the ground. So the 25 passengers we initially planned for were left trapped, to die, inside of the disabled Yugo. They all starved in the days it took us to realize the Yugo never left the station parking lot.
We are sorry for any inconvenience this may have caused and have upgraded to AAA Gold status to prevent any further future disruptions. This will ensure that at least 25 people will actually reach their destinations should this occur again, though they may need to ride on a flat-bed to get there.
Re: (Score:2)
That analogy is just like Irwin Allen's movie "The Towering Inferno."
Re: (Score:1)
Re: (Score:3)
A classic Dilbert might be useful here:
http://dilbert.com/strips/comic/1995-02-26/ [dilbert.com]
Voltron (Score:2)
But can I get an understandable car analogy here?
15 cars tried to transform into Voltron [wikia.com] but instead turned into Snarf [cracked.com].
Re: (Score:2)
Because the buses were running late, everyone decided to take their own car to work. This further increased the amount of traffic on the tiny road.
The cops figured out that everyone was on the wrong road, and diverted traffic onto another freeway. However, by this point everyone was already taking their cars, so diverting to the other freeway didn't completely fix the problem.
All this traffic indirectly caused
Re: (Score:1)
That doesnt explain anything (Score:1)
That only explains the loss in availability of the AWS service. It in no way explains why the data is destroyed and unrecoverable
Re: (Score:2)
Re: (Score:3)
"At 12:30 PM PDT on April 24, we had finished the volumes that we could recover in this way and had recovered all but 1.04% of the affected volumes. At this point, the team began forensics on the remaining volumes which had suffered machine failure and for which we had not been able to take a snapshot. At 3:00 PM PDT, the team began restoring these. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state."
AOL's 19-hour outage (Score:1)
Whom else is reminded of AOL's 19-hour outage in 1996? Routers misconfigured to send data to the wrong place, cascading into failure?
Re: (Score:2)
No one. No one else remembers AOL.
Re: (Score:2)
Re: (Score:2)
I was a quite happy Compuserve user until AOL took it over and destroyed it. (OK, CIS was in slow decline at the time too. But that decline became a nose dive when AOL took over.)
There - what was difficult or embarrassing about that?
Isn't the point of a secondary network... (Score:1)
Re: (Score:1)
I think the secondary network is use to deal with the little overage you get at peak times.
Like most of the time a T1 may be fine and can deal with bandwidth, but sometime your backup ISDN comes up when the bandwidth is a little more then a T1 along can hold.
Re: (Score:2)
... to be able to handle loads if the primary fails?
No it's so marketing can make redundancy claims for 1/100 the cost of true redundancy.
Re: (Score:2, Informative)
No. That's the point of the redundant elements and backup of the primary network.
The secondary network they routed traffic to was designed for a different purpose, and never meant to receive traffic from the primary network.
Re: (Score:2)
No. That's the point of the redundant elements and backup of the primary network.
The secondary network they routed traffic to was designed for a different purpose,
and never meant to receive traffic from the primary network.
For example, management, monitoring, and logging traffic.
Amazon issues 10-day service credit (Score:5, Interesting)
Dear AWS Customer,
Starting at 12:47AM PDT on April 21st, there was a service disruption (for a period of a few hours up to a few days) for Amazon EC2 and Amazon RDS that primarily involved a subset of the Amazon Elastic Block Store (âoeEBSâ) volumes in a single Availability Zone within our US East Region. You can read our detailed summary of the event here:
http://aws.amazon.com/message/65648 [amazon.com]
Weâ(TM)ve identified that you had an attached EBS volume or a running RDS database instance in the affected Availability Zone at the time of the disruption. Regardless of whether your resources and application were impacted, we are going to provide a 10 day credit (for the
period 4/18-4/27) equal to 100% of your usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. This credit will be automatically applied to your April bill, and you donâ(TM)t need to do anything to receive it.
You can see your service credit by logging into your AWS Account Activity page after you receive your upcoming billing statement.
Last, but certainly not least, we want to apologize. We know how critical the services we provide are to our customersâ(TM) businesses and we will do everything we can to learn from this event and use it to drive improvement across our services.
Sincerely,
The Amazon Web Services Team
This message was produced and distributed by Amazon Web Services, LLC, 410 Terry Avenue
North, Seattle, Washington 98109-5210
Re: (Score:2)
Nice. You know what Sony offered me for disclosing all of my information?
They sent me an e-mail which pointed out that, by law, I can get one free credit report per year, and they encouraged me to take advantage of that to look for any fraudulent activity.
Re: (Score:2)
I don't know what bothers me more: the outage itself or the alternative codes Amazon used for punctuation in that email that made my post look messed up only after I posted it.
Re: (Score:1)
But yeah, Sony is lame. I had a DVD drive for just over a year when it went out. Wouldn't do anything about it. I won't buy Sony again if there's any other alternative out there.
Amazon EC2 outage analysis (Score:2)
"Last Thursday’s Amazon EC2 outage was the worst in cloud computing’s history .. I will try to summarize what happened, what worked and didn’t work, and what to learn from it. I’ll do my best to add signal to all the noise out there" link [rightscale.com]
The Cloud (Score:2)
So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?
Re: (Score:2)
So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?
How does this event lead to the conclusion that 'the promise of the cloud is a lie'? Be specific.
Re: (Score:2)
The marketing promise, that is. We know it because according to the hype, the cloud means you are NEVER down and your data is ALWAYS safe.
I'm sure there will be a few "no true cloud" marketing fallacies running about though.
Re: (Score:1)
"The Cloud" has always been nothing more than marketing buzz. All "The Cloud" is are physical servers running a hypervisor and running your machine instances as VMs.
There's still people, switches, routers, firewalls, servers, and storage that are used to build "The Cloud."
This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.
Re: (Score:2)
This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.
But profitable....
Re: (Score:2)
Actually, that's the whole point of doing things "in the cloud" versus just us
Re: (Score:2)
Out of curiosity, have you used Amazon EC2 services? You manage your hosts exactly like you would in a dedicated hosting environment, you even remote desktop into Windows hosts or SSH into Linux hosts. Unless you can use your existing server as a template and clone away you're going to need to do a lot of fine tuning and that's assuming your webapp can handle that as load balancing and others are all addons which can easily end up costing quite a bit.
I tried out EC2 for my last large web event knowing I ne
At least they admit it (Score:5, Insightful)
Re:At least they admit it (Score:5, Insightful)
We all benefit from these kinds of disclosures, I remember Google posting post-mortem analyses of some of their failures. Even Microsoft provided information on their Sidekick meltdown. This does seem to be the 'typical' melange of a human error and cascading consequences.
Someone first said, "You learn much more from failure than you do from success." If nothing else, it's the thesis of the classic Petrosky book, "To Engineer is Human: The Role of Failure in Successful Design" http://www.amazon.com/Engineer-Human-Failure-Successful-Design/dp/0679734163 [amazon.com] (If you haven't read this, you should!!)
And I'm also reminded of a core principle from safety critical system design, that you cannot provide 100% safety. The best you can do is a combination of probabilistic analysis against known hazards. As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky." That kind of reasoning also applies to the current Japanese nuke plant failure...
Re: (Score:2)
Re: (Score:2)
So how intense an earthquake, at what distance from the plant, and how high a tsunami should we plan for next time???
Re: (Score:2)
p.s. Wikipedia says there are 923 B777's out there, about 2x the number of nuke plants. http://en.wikipedia.org/wiki/Boeing_777 [wikipedia.org]
Re: (Score:2, Informative)
You have not taken a statistics course have you? You can have one airplane and have it fall out of the sky or you can 1,000,000 and never have one crash and both systems could 9 9's safey. This is the risk of failure it isn't destiny.
So to combat the FUD.
1. So far the death toll from the event is 18000. Death toll from radiation so far 0.
2. The nuclear plant didn't cause the disaster the earthquake and tsunami that followed it did.
3. People died in cars, buildings, on the street, and in a dam that also coll
Re: (Score:2)
As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky."
This doesn't even make any sense -- what am I missing? A 50-50 chance of falling out of the sky? I'm assuming that's hyperbole, but I'm not grasping the concept here.
For what its worth, the wiki article (linked in another post) indicates the 777 has been involved in 7 "incidents", although the only fatality was a ground crew worker who suffered fatal burns during a refueling fire.
Safe, unless something unexpected happens (Score:2)
Re: (Score:2)
I have an account on Playstation Network and they have already sent me a long and thorough e-mail explaining what happened and what the implications are. And since the problem is ongoing, that makes them MORE proactive than Amazon in getting the word out.
Not to mention that Playstation Network is free and any uptime over 0% ought to be considered a bonus. Whereas you are paying for a certain level of service with cloud computing.
Re: (Score:2)
I can't STAND when people say 'its free, so its ok if it goes down.' When i purchased a PS3, the PSN was a FEATURE that i considered when i bought it. As such, it's not really "free", its more like it was wrapped into the MSRP. By your logic, they should be able to take away the entire network for GOOD and everyone should be completely happy. is this true? Heck, let's start pulling out other features that you go
Re: (Score:2)
Re: (Score:3)
The issue is not uptime. It is the loss of sensitive data. If Sony is holding personal data they have an obligation to protect that data.
Can we get this in non-Amazon speak (Score:4, Interesting)
What is an EBS? Is it really just a Xen or VMWare disk image? Which data center corresponds with each availability zone? What are they using for storage iSCSI targets on a SAN?
Re: (Score:2)
While that would be nice to know I don't think it's relevant to a postmortem: they described the architectural elements which encountered the failures.
FYI, though, based on what they've said today and in the past: it seems that they are using regular servers with disks rather than a SAN & I believe they use GNBD to connect the EBS server disk images and EC2 instances (rather than iSCSI)
Re: (Score:2)
Amazon EC2 is Xen. The back end storage for that is empherical in that it goes away when you shut it down. So they introduced EBS which is persistent storage. So you could have a EC2 server and mount EBS volumes on it and those EBS volumes will exist even when the EC2(Xen) server goes away. You can even mount them on different EC2(Xen) servers(though not at the same time). Also today you can have the EC2 server itself run on top of EBS if you want the data on it to stay around after a reboot.
Then there's al
Re: (Score:2)
Re: (Score:2)
They use more than just Xen but they don't really publicize it. With paravirtualization they can use anything they want, but Xen seems to be the most prevalent. Some of my instances say "xen" and others say "paravirtual." Just because the kernel says "xen" or "paravirtual" does not necessarily mean that the hypervisor is Xen or something else.
Also, speaking towards migrating instances between "availability zones," I found out that I cannot use Windows Server EBS boot volumes on anything but the instance
Re: (Score:2)
They used Xen because that's what was mature at the time. In many ways Xen is still more mature than KVM, though it won't be that way for long.
Amazon supplies you a bunch of tools for dealing with the images. They call the "stored" images AMI's. There's a huge list of public AMI's you can choose from and anyone can create their own AMI off pretty much any Linux distribution.
You can snapshot a running instance using the ec2-ami-tools which are installed in your running instances. Using those you can easily c
Re: (Score:2)
Re: (Score:2)
And assuming you have a complicated setup with bunches of scripts, mounts, etc, how do you image the entire thing? We have to schedule off-hour downtime to do a snapshot (everything except data) for our internal servers since a new install / config from scratch would take too long for recovery -- but that involves a lot of control that you may or may not have in a "cloud" situation.
There are a couple of interesting tools to help you with this. One that's been around for a while, but is not maintained by Amazon, is ec2-consistent-snapshot. It is a tool that automates the process of quickly quiescing your database and filesystem, initiating a snapshot of your volume (or volumes, if you have a RAID array), and then restoring read/write access. It all happens very quickly, so the disruption to your application is short (although the I/O performance of your volume will suffer while the sna
Re: (Score:2)
Re: (Score:2)
EBS which is your high IO filesystem
Damnit, now you owe me a new monitor.
Re: (Score:1)
EBS is basically like iSCSI, but far more complex. There's a lot of proprietary stuff they're doing with it.
Anyone know how it compares in speed to iSCSI or a SAN? From reading the report it sounds like there is A LOT more going on, and I have even heard of people using multiple EBS volumes in a "RAID" like array for faster IO speed. Sounds like way too complex of a system.
Windows Azure Drives are like EBS but they are simply VHD files stored in Page Blobs (Azure's version of cloud storage, similar to Amazon S3) with local caching on each VM instance. I assume they have slower read/write speeds then EBS but se
Data loss? (Score:2)
And HOW THE HELL does such a procedure cause data loss?!
Are those geniuses using the service transfer procedures that do not perform clean transaction handling and instead just send stuff to be copied expecting that it will sync soon enough?
Re: (Score:2)
This should never be possible if transaction model is properly implemented -- data in memory would have to be stored and confirmed to be stored, or transaction should be cleanly reverted before anything is moved.
Other outage (Score:2)
I dont buy it (Score:2)
During the whole issue they never posted a cause and took them forever to even say 'still investigating'. Even if they have a bare bones monitoring system up, it should have been readily apparent that traffic was flowing over the wrong network.
So they're basically saying if the primary network has issues theres not really a point in the backup because the back
pure and utter BS (Score:1)
Re: (Score:3)
Go and read the entire notice, not just the pathetic snippet a bad submitter used. Makes more sense.
Also, this is a storage network, not an access network. Effectively it's like pulling the SAS cable out of the RAID card while the machines are running.
Re: (Score:2)
Re: (Score:2)
This was a cascade failure that affected multiple systems on multiple layers, with ramping race conditions that worsened over time. The engineer didn't hit the "Enter" key and suddenly the little green light turned red to tell him 1/3 of the grid was down.
Tsunami waves are not higher than 19 feet, (Score:2)
It's nice to see that everyone has the same problem:
There is no approach to identify wrong assumptions.
But what's the conclusion?
Should we stay away from huge systems, because the damage due to a wrong assumption in a huge system is huge?
1965 (Score:2)
Huh. Sounds like a 21st century version of the routing failure that caused the 1965 Northeast blackout, just with data instead of electricity.
http://en.wikipedia.org/wiki/Northeast_Blackout_of_1965 [wikipedia.org]
Circadian rhythms (Score:1)
Re: (Score:1)
If their network guys work like the ones I know, 00:47 is just right before lunch time.
There are human errors, sure, but the worst one I've seen come from management trying to rush things, so the network guys "just stay until it works", instead of leaving it in a known good state and go and take some rest.
Movie ... (Score:1)
Re: (Score:2)
Cloud computing is a form of shared hosting, just with more encapsulation; Clouds fall over the same way a server can fall over. It's hard to blame "The Cloud" when the reality is the people that were suckered in by obtuse, non-specific marketing are the ones at fault. The argument can even be made that Clouds are worse becuase instead of many discreet isolated servers you start sharing more single points of failure, which lead
Re: (Score:2)
Have you ever worked in a real environment?
There is ALWAYS a difference between test and production. No matter how many test cases and iterations of changes that you go through, there is always a non-zero percent chance that the change in production will behave differently.
This is why most companies require fall-back procedures for any production change in addition to testing.
It sounds like it may have taken them longer than some might be comfortable to reach the point where they did roll back changes...but