Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
Cloud Networking The Internet

Amazon EC2 Failure Post-Mortem 117

Posted by Soulskill
from the ted-tripped-over-a-power-cord dept.
CPE1704TKS tips news that Amazon has provided a post-mortem on why EC2 failed. Quoting: "At 12:47 AM PDT on April 21st, a network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."
This discussion has been archived. No new comments can be posted.

Amazon EC2 Failure Post-Mortem

Comments Filter:
  • by MagicM (85041) on Friday April 29, 2011 @07:59AM (#35973604)

    Instead of closing off one lane of highway for construction, they closed off all lanes and forced highway traffic to go through town. The roads in town weren't able to handle all the cars. Massive back-ups ensued.

  • by mysidia (191772) * on Friday April 29, 2011 @08:27AM (#35973844)

    ... to be able to handle loads if the primary fails?

    No. That's the point of the redundant elements and backup of the primary network.

    The secondary network they routed traffic to was designed for a different purpose, and never meant to receive traffic from the primary network.

  • by LWATCDR (28044) on Friday April 29, 2011 @11:56AM (#35976318) Homepage Journal

    You have not taken a statistics course have you? You can have one airplane and have it fall out of the sky or you can 1,000,000 and never have one crash and both systems could 9 9's safey. This is the risk of failure it isn't destiny.
    So to combat the FUD.
    1. So far the death toll from the event is 18000. Death toll from radiation so far 0.
    2. The nuclear plant didn't cause the disaster the earthquake and tsunami that followed it did.
    3. People died in cars, buildings, on the street, and in a dam that also collapsed.
    4. A lot of the lives where lost because of a failure of a sea wall.

    So by your logic we really need to replace cars, buildings, streets, dams, and sea walls first since they all have caused so many deaths. Might I suggest a cave? Oh and no fire because that is also too risky. And keep away from those sharp stones as well.

Know Thy User.

Working...