Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Cloud Networking The Internet

Amazon EC2 Failure Post-Mortem 117

CPE1704TKS tips news that Amazon has provided a post-mortem on why EC2 failed. Quoting: "At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."
This discussion has been archived. No new comments can be posted.

Amazon EC2 Failure Post-Mortem

Comments Filter:
  • by Haedrian ( 1676506 ) on Friday April 29, 2011 @07:56AM (#35973576)

    But can I get an understandable car analogy here?

    • by MagicM ( 85041 ) on Friday April 29, 2011 @07:59AM (#35973604)

      Instead of closing off one lane of highway for construction, they closed off all lanes and forced highway traffic to go through town. The roads in town weren't able to handle all the cars. Massive back-ups ensued.

      • by RealGene ( 1025017 ) on Friday April 29, 2011 @08:02AM (#35973626)
        ..and according to http://it.slashdot.org/story/11/04/29/0254215/Amazon-EC2-Crash-Caused-Data-Loss [slashdot.org], the DPW mistakenly pushed some of the cars into the old abandoned quarry.
      • Damn that's close. It's freaky how almost anything can be expressed as a car analogy.

        • Yeah. It's kind of like how you can bolt just about any part onto just about any car if you think it through enough.
          • Stickers make it go faster and gets girls.
            Red paint makes it go faster and gets girls.
            Neon lights and spinner hubcaps make it go faster and gets girls.
            Blue lenses to make your headlights into fakey HID lights make it go faster and gets girls.
            Slicked-back hair and sunglasses at night make it go faster and gets girls.
            Chopping off two coils from the factory springs makes it go faster and gets girls.
            Wings and spoilers and air dams and side skirts make it go faster and gets girls.
            Replacing a hood latch wi

          • by BranMan ( 29917 )
            Like a JATO rocket engine. Hey! That's the ticket!
      • You mean they shut down the tubes and shit got clogged?

    • by kingsqueak ( 18917 ) on Friday April 29, 2011 @08:21AM (#35973792)

      Instead of the usual commuter rail line, we've had to do some maintenance causing us to provide a single Yugo as transport for the NY morning rush.

      After packing 25 angry commuters into the Yugo we left a few hundred thousand stranded on the platform, ping-ponging between the parking lot and home, completely confused how they would get to work.

      In addition to that, unfortunately the Yugo couldn't handle the added weight of the passengers and the leaf springs shattered all over the ground. So the 25 passengers we initially planned for were left trapped, to die, inside of the disabled Yugo. They all starved in the days it took us to realize the Yugo never left the station parking lot.

      We are sorry for any inconvenience this may have caused and have upgraded to AAA Gold status to prevent any further future disruptions. This will ensure that at least 25 people will actually reach their destinations should this occur again, though they may need to ride on a flat-bed to get there.

      • by kriston ( 7886 )

        That analogy is just like Irwin Allen's movie "The Towering Inferno."

      • by Xserv ( 909355 )
        I think I just peed in my pants. My cubemates (who we lovingly refer to each other as cellmates) poked their heads around walls wondering what I thought was so funny. Bravo!
    • A classic Dilbert might be useful here:

      http://dilbert.com/strips/comic/1995-02-26/ [dilbert.com]

    • But can I get an understandable car analogy here?

      15 cars tried to transform into Voltron [wikia.com] but instead turned into Snarf [cracked.com].

    • Traffic was diverted from a major highway onto a 2-lane road. This caused the buses to run late.

      Because the buses were running late, everyone decided to take their own car to work. This further increased the amount of traffic on the tiny road.

      The cops figured out that everyone was on the wrong road, and diverted traffic onto another freeway. However, by this point everyone was already taking their cars, so diverting to the other freeway didn't completely fix the problem.

      All this traffic indirectly caused
  • by Anonymous Coward

    That only explains the loss in availability of the AWS service. It in no way explains why the data is destroyed and unrecoverable

    • It could be that in the process of isolating the problem, they rebooted servers that (due to network problems) may not have been able to fully replicate their local changes.
    • "At 12:30 PM PDT on April 24, we had finished the volumes that we could recover in this way and had recovered all but 1.04% of the affected volumes. At this point, the team began forensics on the remaining volumes which had suffered machine failure and for which we had not been able to take a snapshot. At 3:00 PM PDT, the team began restoring these. Ultimately, 0.07% of the volumes in the affected Availability Zone could not be restored for customers in a consistent state."

  • Whom else is reminded of AOL's 19-hour outage in 1996? Routers misconfigured to send data to the wrong place, cascading into failure?

    • No one. No one else remembers AOL.

      • Not that would admit it in public. ;-)
      • I remember AOL. I've still got some of the CDs propping up a broken cupboard somewhere. I remember them with a combination of mild loathing and some contempt.

        I was a quite happy Compuserve user until AOL took it over and destroyed it. (OK, CIS was in slow decline at the time too. But that decline became a nose dive when AOL took over.)

        There - what was difficult or embarrassing about that?

  • ... to be able to handle loads if the primary fails?
    • by Anonymous Coward

      I think the secondary network is use to deal with the little overage you get at peak times.

      Like most of the time a T1 may be fine and can deal with bandwidth, but sometime your backup ISDN comes up when the bandwidth is a little more then a T1 along can hold.

    • by ae1294 ( 1547521 )

      ... to be able to handle loads if the primary fails?

      No it's so marketing can make redundancy claims for 1/100 the cost of true redundancy.

    • Re: (Score:2, Informative)

      by mysidia ( 191772 ) *

      ... to be able to handle loads if the primary fails?

      No. That's the point of the redundant elements and backup of the primary network.

      The secondary network they routed traffic to was designed for a different purpose, and never meant to receive traffic from the primary network.

      • by natet ( 158905 )

        ... to be able to handle loads if the primary fails?

        No. That's the point of the redundant elements and backup of the primary network.

        The secondary network they routed traffic to was designed for a different purpose,
        and never meant to receive traffic from the primary network.

        For example, management, monitoring, and logging traffic.

  • by kriston ( 7886 ) on Friday April 29, 2011 @08:22AM (#35973806) Homepage Journal

    Dear AWS Customer,

    Starting at 12:47AM PDT on April 21st, there was a service disruption (for a period of a few hours up to a few days) for Amazon EC2 and Amazon RDS that primarily involved a subset of the Amazon Elastic Block Store (âoeEBSâ) volumes in a single Availability Zone within our US East Region. You can read our detailed summary of the event here:
    http://aws.amazon.com/message/65648 [amazon.com]

    Weâ(TM)ve identified that you had an attached EBS volume or a running RDS database instance in the affected Availability Zone at the time of the disruption. Regardless of whether your resources and application were impacted, we are going to provide a 10 day credit (for the
    period 4/18-4/27) equal to 100% of your usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. This credit will be automatically applied to your April bill, and you donâ(TM)t need to do anything to receive it.
    You can see your service credit by logging into your AWS Account Activity page after you receive your upcoming billing statement.

    Last, but certainly not least, we want to apologize. We know how critical the services we provide are to our customersâ(TM) businesses and we will do everything we can to learn from this event and use it to drive improvement across our services.

    Sincerely,
    The Amazon Web Services Team

    This message was produced and distributed by Amazon Web Services, LLC, 410 Terry Avenue
    North, Seattle, Washington 98109-5210

    • Nice. You know what Sony offered me for disclosing all of my information?

      They sent me an e-mail which pointed out that, by law, I can get one free credit report per year, and they encouraged me to take advantage of that to look for any fraudulent activity.

      • by kriston ( 7886 )

        I don't know what bothers me more: the outage itself or the alternative codes Amazon used for punctuation in that email that made my post look messed up only after I posted it.

      • Then they're wrong. You can get one free credit report each year from each credit reporting bureau. You don't have to get them all at the same time. Do Experian one time, then a few months later do TransUnion, then a few months later do EquiFax. Rinse and repeat.

        But yeah, Sony is lame. I had a DVD drive for just over a year when it went out. Wouldn't do anything about it. I won't buy Sony again if there's any other alternative out there.

  • "Last Thursday’s Amazon EC2 outage was the worst in cloud computing’s history .. I will try to summarize what happened, what worked and didn’t work, and what to learn from it. I’ll do my best to add signal to all the noise out there" link [rightscale.com]

  • So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?

    • So we now know that the promise of the cloud is a lie. How long before we get a new buzz word for turning over all of our data to the new Internet Barron's because they know what is best?

      How does this event lead to the conclusion that 'the promise of the cloud is a lie'? Be specific.

      • by sjames ( 1099 )

        The marketing promise, that is. We know it because according to the hype, the cloud means you are NEVER down and your data is ALWAYS safe.

        I'm sure there will be a few "no true cloud" marketing fallacies running about though.

    • "The Cloud" has always been nothing more than marketing buzz. All "The Cloud" is are physical servers running a hypervisor and running your machine instances as VMs.
      There's still people, switches, routers, firewalls, servers, and storage that are used to build "The Cloud."

      This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.

      • by ae1294 ( 1547521 )

        This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.

        But profitable....

      • by tlhIngan ( 30335 )

        "The Cloud" has always been nothing more than marketing buzz. All "The Cloud" is are physical servers running a hypervisor and running your machine instances as VMs.
        There's still people, switches, routers, firewalls, servers, and storage that are used to build "The Cloud."

        This belief that doing things in "The Cloud" makes them impervious to hardware failure, power outage, network connection drops, etc. has always been misinformed.

        Actually, that's the whole point of doing things "in the cloud" versus just us

        • Out of curiosity, have you used Amazon EC2 services? You manage your hosts exactly like you would in a dedicated hosting environment, you even remote desktop into Windows hosts or SSH into Linux hosts. Unless you can use your existing server as a template and clone away you're going to need to do a lot of fine tuning and that's assuming your webapp can handle that as load balancing and others are all addons which can easily end up costing quite a bit.

          I tried out EC2 for my last large web event knowing I ne

  • by jesseck ( 942036 ) on Friday April 29, 2011 @08:27AM (#35973838)
    I commend Amazon for providing us with this information. Yes, bad things happened, and data is gone forever. Amazon knows what happened and why, and I'm sure they will implement controls to prevent this again. I doubt we'll hear as much from Sony, though.
    • by david.emery ( 127135 ) on Friday April 29, 2011 @08:46AM (#35974030)

      We all benefit from these kinds of disclosures, I remember Google posting post-mortem analyses of some of their failures. Even Microsoft provided information on their Sidekick meltdown. This does seem to be the 'typical' melange of a human error and cascading consequences.

      Someone first said, "You learn much more from failure than you do from success." If nothing else, it's the thesis of the classic Petrosky book, "To Engineer is Human: The Role of Failure in Successful Design" http://www.amazon.com/Engineer-Human-Failure-Successful-Design/dp/0679734163 [amazon.com] (If you haven't read this, you should!!)

      And I'm also reminded of a core principle from safety critical system design, that you cannot provide 100% safety. The best you can do is a combination of probabilistic analysis against known hazards. As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky." That kind of reasoning also applies to the current Japanese nuke plant failure...

      • by ccady ( 569355 )
        No, that kind of reasoning does not apply to nuke plant failures. There are not millions [straightdope.com] of nuke plants running each day. There are only 442 [euronuclear.org] nuke plants. If we cannot secure 442 plants from having disasters, then we need to do something else that does not cause disasters.
        • So how intense an earthquake, at what distance from the plant, and how high a tsunami should we plan for next time???

        • Re: (Score:2, Informative)

          by LWATCDR ( 28044 )

          You have not taken a statistics course have you? You can have one airplane and have it fall out of the sky or you can 1,000,000 and never have one crash and both systems could 9 9's safey. This is the risk of failure it isn't destiny.
          So to combat the FUD.
          1. So far the death toll from the event is 18000. Death toll from radiation so far 0.
          2. The nuclear plant didn't cause the disaster the earthquake and tsunami that followed it did.
          3. People died in cars, buildings, on the street, and in a dam that also coll

      • by Draknor ( 745036 )

        As a Boeing 777 safety engineer told me, "9 9's of safety, i.e. chance of failure 1/10 ^-9, applied over the expected flying hours of the 777 fleet, still means a 50-50 chance of an aircraft falling out of the sky."

        This doesn't even make any sense -- what am I missing? A 50-50 chance of falling out of the sky? I'm assuming that's hyperbole, but I'm not grasping the concept here.

        For what its worth, the wiki article (linked in another post) indicates the 777 has been involved in 7 "incidents", although the only fatality was a ground crew worker who suffered fatal burns during a refueling fire.

      • The nuclear industry claims a chance of major accidents around 1 in 10^7 reactor years, based on this kind of probabilistic analysis. But then we've seen 2 major incidents at western-style nuclear plants (Three Mile Island and Fukushima Daiichi) over a period of about 15,000 reactor-years. The problem is, these studies only account for the risk of simultaneous failures of pre-identified critical components within the engineered system. They don't account for acts of nature or people doing something dumb.
    • I doubt we'll hear as much from Sony, though.
      I have an account on Playstation Network and they have already sent me a long and thorough e-mail explaining what happened and what the implications are. And since the problem is ongoing, that makes them MORE proactive than Amazon in getting the word out.
      Not to mention that Playstation Network is free and any uptime over 0% ought to be considered a bonus. Whereas you are paying for a certain level of service with cloud computing.
      • by afex ( 693734 )
        this has gone mildly offtopic, but as a PSN user i just wanted to chime and say the following...

        I can't STAND when people say 'its free, so its ok if it goes down.' When i purchased a PS3, the PSN was a FEATURE that i considered when i bought it. As such, it's not really "free", its more like it was wrapped into the MSRP. By your logic, they should be able to take away the entire network for GOOD and everyone should be completely happy. is this true? Heck, let's start pulling out other features that you go
        • +1 Insightful. The Playstation would never have acquired the market share it has without PSN. People would have bought something else. It's a major part of the promotion of the product.
      • The issue is not uptime. It is the loss of sensitive data. If Sony is holding personal data they have an obligation to protect that data.

  • by gad_zuki! ( 70830 ) on Friday April 29, 2011 @08:29AM (#35973858)

    What is an EBS? Is it really just a Xen or VMWare disk image? Which data center corresponds with each availability zone? What are they using for storage iSCSI targets on a SAN?

    • by pdbaby ( 609052 )

      While that would be nice to know I don't think it's relevant to a postmortem: they described the architectural elements which encountered the failures.

      FYI, though, based on what they've said today and in the past: it seems that they are using regular servers with disks rather than a SAN & I believe they use GNBD to connect the EBS server disk images and EC2 instances (rather than iSCSI)

    • by Synn ( 6288 )

      Amazon EC2 is Xen. The back end storage for that is empherical in that it goes away when you shut it down. So they introduced EBS which is persistent storage. So you could have a EC2 server and mount EBS volumes on it and those EBS volumes will exist even when the EC2(Xen) server goes away. You can even mount them on different EC2(Xen) servers(though not at the same time). Also today you can have the EC2 server itself run on top of EBS if you want the data on it to stay around after a reboot.

      Then there's al

      • Forgive my ignorance, but why not KVM instead of Xen? And assuming you have a complicated setup with bunches of scripts, mounts, etc, how do you image the entire thing? We have to schedule off-hour downtime to do a snapshot (everything except data) for our internal servers since a new install / config from scratch would take too long for recovery -- but that involves a lot of control that you may or may not have in a "cloud" situation.
        • by kriston ( 7886 )

          They use more than just Xen but they don't really publicize it. With paravirtualization they can use anything they want, but Xen seems to be the most prevalent. Some of my instances say "xen" and others say "paravirtual." Just because the kernel says "xen" or "paravirtual" does not necessarily mean that the hypervisor is Xen or something else.

          Also, speaking towards migrating instances between "availability zones," I found out that I cannot use Windows Server EBS boot volumes on anything but the instance

        • by Synn ( 6288 )

          They used Xen because that's what was mature at the time. In many ways Xen is still more mature than KVM, though it won't be that way for long.

          Amazon supplies you a bunch of tools for dealing with the images. They call the "stored" images AMI's. There's a huge list of public AMI's you can choose from and anyone can create their own AMI off pretty much any Linux distribution.

          You can snapshot a running instance using the ec2-ami-tools which are installed in your running instances. Using those you can easily c

        • And assuming you have a complicated setup with bunches of scripts, mounts, etc, how do you image the entire thing? We have to schedule off-hour downtime to do a snapshot (everything except data) for our internal servers since a new install / config from scratch would take too long for recovery -- but that involves a lot of control that you may or may not have in a "cloud" situation.

          There are a couple of interesting tools to help you with this. One that's been around for a while, but is not maintained by Amazon, is ec2-consistent-snapshot. It is a tool that automates the process of quickly quiescing your database and filesystem, initiating a snapshot of your volume (or volumes, if you have a RAID array), and then restoring read/write access. It all happens very quickly, so the disruption to your application is short (although the I/O performance of your volume will suffer while the sna

      • EBS which is your high IO filesystem

        Damnit, now you owe me a new monitor.

      • by gbrayut ( 715117 )

        EBS is basically like iSCSI, but far more complex. There's a lot of proprietary stuff they're doing with it.

        Anyone know how it compares in speed to iSCSI or a SAN? From reading the report it sounds like there is A LOT more going on, and I have even heard of people using multiple EBS volumes in a "RAID" like array for faster IO speed. Sounds like way too complex of a system.

        Windows Azure Drives are like EBS but they are simply VHD files stored in Page Blobs (Azure's version of cloud storage, similar to Amazon S3) with local caching on each VM instance. I assume they have slower read/write speeds then EBS but se

  • And HOW THE HELL does such a procedure cause data loss?!

    Are those geniuses using the service transfer procedures that do not perform clean transaction handling and instead just send stuff to be copied expecting that it will sync soon enough?

  • I'm trying to remember what the other outage was recently where the web service failed because they forgot to implement exponential backoff. Anyone remember?
  • During the whole issue they never posted a cause and took them forever to even say 'still investigating'. Even if they have a bare bones monitoring system up, it should have been readily apparent that traffic was flowing over the wrong network.

    [..] because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving.

    So they're basically saying if the primary network has issues theres not really a point in the backup because the back

  • If this was the cause why wasn't the change corrected immediately and the traffic routed to where it was originally intended. 3 days of downtime just doesn't happen when you fuck up a line in a config. If this was actually the case the downtime would have been minimal.
    • Go and read the entire notice, not just the pathetic snippet a bad submitter used. Makes more sense.

      Also, this is a storage network, not an access network. Effectively it's like pulling the SAS cable out of the RAID card while the machines are running.

    • by Zondar ( 32904 )

      This was a cascade failure that affected multiple systems on multiple layers, with ramping race conditions that worsened over time. The engineer didn't hit the "Enter" key and suddenly the little green light turned red to tell him 1/3 of the grid was down.

  • and the primary and secondary network will not fail simultaneously for a large number of nodes.

    It's nice to see that everyone has the same problem:
    There is no approach to identify wrong assumptions.

    But what's the conclusion?
    Should we stay away from huge systems, because the damage due to a wrong assumption in a huge system is huge?

  • by Triv ( 181010 )

    Huh. Sounds like a 21st century version of the routing failure that caused the 1965 Northeast blackout, just with data instead of electricity.

    http://en.wikipedia.org/wiki/Northeast_Blackout_of_1965 [wikipedia.org]

  • In thinking about why this happened, don't loose sight of the time they chose to make the configuration change was 00:47 local. Human performance on 3rd shift isn't what it is on day shift, and I would think it very likely the people managing this change had been up and working for a significant number of hours at that time. Would they have noticed something or done something differently at 10:00 local? Certainly making an upgrade at a time of lowest use sounds right, but it's not always as simple as that
    • If their network guys work like the ones I know, 00:47 is just right before lunch time.

      There are human errors, sure, but the worst one I've seen come from management trying to rush things, so the network guys "just stay until it works", instead of leaving it in a known good state and go and take some rest.

  • Hollywood has got to turn this into a movie ... I'd be first in line to buy a ticket

Keep up the good work! But please don't ask me to help.

Working...