Forgot your password?
typodupeerror
Cloud The Internet

Amazon EBS Failure Brings Down Reddit, Imgur, Others 176

Posted by Soulskill
from the karma-for-the-kindle-wiping dept.
Several readers have sent word of a significant Amazon EBS outage. Quoting: "Amazon Web Services has confirmed that its Elastic Block Storage (EBS) service is experiencing degraded service, leading sites across the Internet to experience downtime, including Reddit, Imgur and many others. AWS confirmed on its status page at 2:11 p.m. ET that it is experiencing 'degraded performance for a small number of EBS volumes.' It says the issue is restricted to a single Availability Zone within the US-East-1 Region, which is in Northern Virginia. AWS later reported that its Relational Database Service (Amazon RDS) and its Elastic Beanstalk application plaform also experienced failures on Monday afternoon."
This discussion has been archived. No new comments can be posted.

Amazon EBS Failure Brings Down Reddit, Imgur, Others

Comments Filter:
  • by Anonymous Coward
    Or else my afternoon is going to totally suck.
    • by IonOtter (629215)

      Nope, FB is alive and well.

      To be fair, I find a lot more entertainment in Reddit and Imgur than FB...

      • by sortius_nod (1080919) on Monday October 22, 2012 @04:24PM (#41733033) Homepage

        I'm just glad I moved my hosting away from AWS. It seems they've had a few problems lately in their datacentres. Local Aussie hosting seems to have better bandwidth anyway.

        • Local Aussie hosting is easily double the cost though. I have servers with Crucial Paradigm Australia and Crucial Paradigm USA. The websites appear just as fast to the average user but the USA hosting is 1/3rd the price.

        • If you primarily serve Australia then a local host is fine. If you serve an international audience that host is going to have poor latency for a majority of your visitors.

          AWS is an option. You could also use an edge caching service like Akamai. Akamai is likely much more expensive than AWS.

  • by Phisbut (761268) on Monday October 22, 2012 @04:05PM (#41732707)
    Productivity reached a record high this afternoon.
  • But But But (Score:5, Insightful)

    by Anonymous Coward on Monday October 22, 2012 @04:05PM (#41732713)

    It's the cloud! It's like never like down, and webscale!

  • by Anonymous Coward on Monday October 22, 2012 @04:06PM (#41732717)

    Since no one can go on reddit, they will come back to /. only to find out why reddit is down!

  • Other Victims (Score:5, Informative)

    by Revotron (1115029) on Monday October 22, 2012 @04:06PM (#41732719)
    Coursera is also down as a result.
  • by magarity (164372) on Monday October 22, 2012 @04:06PM (#41732721)

    /. is working just fine.

    Are those karma points in the mail?

  • Oblig (Score:5, Funny)

    by sortius_nod (1080919) on Monday October 22, 2012 @04:06PM (#41732723) Homepage

    It's as if millions of geek voices cried out in terror & were suddenly silenced.

    • by Bigby (659157)

      If a geek cries out in terror and there's not site to read it on, do they really cry out in terror?

  • Single AZ my butt (Score:3, Informative)

    by Anonymous Coward on Monday October 22, 2012 @04:07PM (#41732737)

    We are seeing EBS problems across multiple AZs with our services, as are many others. Amazon is downplaying the issue.

    See HN for ongoing discussion as well: http://news.ycombinator.com/

    • Yeah, I'm in southern Indiana and Reddit has been down all day.

  • by bill_mcgonigle (4333) * on Monday October 22, 2012 @04:07PM (#41732739) Homepage Journal

    Bad luck if you're hosted in the US-East-1 Region [amazon.com], I guess.

    Heh, I should really start advertising the LVS clusters I tend to as 'private clouds with better uptime than Amazon'.

    • by malakai (136531)

      According to amazon, it's not an outage, it's a "performance disruption". My guess is, this will negate costly concessions based on SLA's.

      • by Patch86 (1465427)

        Unless they have performance SLAs as well as uptime SLAs. Which they really should. Who the hell would move their system/site to a server hosting business without a performance SLA? I mean, you wanted 23 second page load times on your site, right?

    • by RulerOf (975607) on Monday October 22, 2012 @04:58PM (#41733469)
      Real bad luck.

      Desk phones and SIP clients out for 2.5 hours for me. Calls rolled over at the provider level like they were supposed to though. Didn't think I'd have to put that to the test so soon.

      The server qualifies for the free tier, and that's probably why it just went straight unresponsive for two hours. Maybe I should upgrade to a slightly larger paid/reserved instance and..... Wait, I smell conspiracy.
      • by mayko (1630637)
        Well don't worry about the conspiracy. Here we spend hundreds of thousands of dollars a month with AWS, much of that is reserved instances and nearly all of our US-East instances were affected. Mostly in the "east-1b" AZ, but it was not isolated to that. Anything using RDS in that region is still down.
      • The server qualifies for the free tier, and that's probably why it just went straight unresponsive for two hours. Maybe I should upgrade to a slightly larger paid/reserved instance and..... Wait, I smell conspiracy.

        I'm right now hacking away at an EC2 instance with an EBS volume in the affected region, with no disruptions. The EC2 is an "Extra Large Instance" (need it for the IOPS more than the CPU or memory), though I don't think that matters so far as EBS is concerned.

  • Low Availability? (Score:5, Interesting)

    by mkosmo (768069) <mkosmo@gmail.com> on Monday October 22, 2012 @04:08PM (#41732755) Homepage
    I have to admit, due to this outage I just logged in to Slashdot for the first time in a year. We're experiencing our own outages at work, unrelated to AWS, but I'd hate to be an AWS admin during one of these major outages. This makes me wonder why Reddit, Imgur, etc., don't have presences in multiple availability zones to prevent this kind of outage.
    • Re:Low Availability? (Score:4, Informative)

      by Anonymous Coward on Monday October 22, 2012 @04:09PM (#41732769)

      >Reddit, Imgur, etc., don't have presences in multiple availability zones to prevent this kind of outage

      They do. It's a multi-AZ outage, despite what Amazon is saying.

      • yeah my ec2 instance that is hosted in east-1a is up and the management console tells me it's just east-1d that is down...but i have a hard time believing that

      • Re:Low Availability? (Score:5, Interesting)

        by segedunum (883035) on Monday October 22, 2012 @04:30PM (#41733093)

        They do. It's a multi-AZ outage, despite what Amazon is saying.

        Amazon's multiple availability zones stuff is total bullshit. It has become painfully apparent during every single one of these outages that the so-called availability zones are not separate because an EBS problem propagates everywhere. No one can actually work the availability zones out either because what Amazon cunningly does is call zones by different letters for different customers, so availability zone 'a' for one might be availability zone 'c' for another so no one can actually compare. That fact alone sent my bullshit meter off the scale. It just seems excessively evasive and sneaky for my taste.

        If you want redundancy you are going to have to go to completely geographically separate zones. Keeping those zones in sync is prohibitively expensive for the vast majority. Either that or you have a backup cloud provider, but again you have to be so paranoid and trust Amazon so little that you have to be able to have your data out and off Amazon's infrastructure at least nightly at a moment's notice. Sorry, but that just doesn't work.

        • Re: (Score:2, Informative)

          by i_hate_robots (922668)
          Multi AZ IS "completely geographically separate zones" and yes, you can specifically define which ones. Amazon is very clear that US East 1a,b,c,d are all the same physical data center. However, West is not. It's in Oregon (as opposed to VA for East) I've seen no evidence that true Multi AZ instances (as described by Amazon) are down. If you've got some though, I would be interested to see it because I would be pretty concerned.
          • Re:Low Availability? (Score:4, Informative)

            by hawguy (1600213) on Monday October 22, 2012 @05:10PM (#41733627)

            Multi AZ IS "completely geographically separate zones" and yes, you can specifically define which ones.

            Amazon is very clear that US East 1a,b,c,d are all the same physical data center. However, West is not. It's in Oregon (as opposed to VA for East)

            I've seen no evidence that true Multi AZ instances (as described by Amazon) are down. If you've got some though, I would be interested to see it because I would be pretty concerned.

            Availability Zones are not geographically separate - regions are:

            http://aws.amazon.com/ec2/#features [amazon.com]

            Availability Zones are distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location. Regions consist of one or more Availability Zones, are geographically dispersed, and will be in separate geographic areas or countries

          • Re:Low Availability? (Score:5, Interesting)

            by segedunum (883035) on Monday October 22, 2012 @05:17PM (#41733713)

            Multi AZ IS "completely geographically separate zones" and yes...

            Availability zones are not geographically separate nor is there any evidence that they are geographically or even logically separate from the nature of every major EBS outage there has been.

            Amazon is very clear that US East 1a,b,c,d are all the same physical data center. However, West is not. It's in Oregon (as opposed to VA for East)

            a, b, c and d are availability zones. US East, West etc. are different regions. I'm afraid you're not understanding just what is meant by availability zones or just muddying the waters.

            I've seen no evidence that true Multi AZ instances (as described by Amazon) are down. If you've got some though, I would be interested to see it because I would be pretty concerned.

            As I've said above, Amazon makes it as difficult as possible to verify availability zone failures because AZ 'a' for one customer might be 'c' for another and 'b' for another, so you can't verify anything with others. However, it becomes very clear when you get on Amazon's forums and look at major sites that have implemented in multiple zones from their perspective that they are down and have EBS problems in different zones they have. You don't get much more evidence than that.

            If you're not concerned when looking at that then I smell some apologism I'm afraid.

      • It's a multi-AZ outage, despite what Amazon is saying.

        And/or AZ's aren't quite as physically isolated as Amazon makes out, which I've suspected for a while.

    • Re:Low Availability? (Score:5, Interesting)

      by segedunum (883035) on Monday October 22, 2012 @04:20PM (#41732951)

      We're experiencing our own outages at work, unrelated to AWS, but I'd hate to be an AWS admin during one of these major outages.

      I used to be an admin working on AWS through some of these outages, and it's not pleasant let me tell you. The amount of redundancy you need to get through this makes putting stuff in the cloud prohibitively expensive and things are basically out of your hands. When you run your own servers you know how long it will take to replace a piece of hardware or take emergency measures to keep things running. At least you know you have control over the process. Amazon? They recover what they can of your EBS disks in a few days without telling you anything and in the case of the European outage they actually screwed the EBS snapshots with a recovery job they ran. Thankfully I ran backups every night that took all data off Amazon's system. All I didn't know was when I could be back up and running.

      Using AWS for throwaway computing where you just want some computing power for a few weeks of the year? Yes, fine. Permanently running stuff in it? Nope.

      • Re:Low Availability? (Score:5, Interesting)

        by segedunum (883035) on Monday October 22, 2012 @04:52PM (#41733397)

        ....and in the case of the European outage they actually screwed the EBS snapshots with a recovery job they ran. Thankfully I ran backups every night that took all data off Amazon's system. All I didn't know was when I could be back up and running.

        I felt this was worth emphasising. These are EBS snapshots, not just the EBS disks - the ones supposedly stored in S3 and immune to corruption. Your backups, in other words. If you use RDS you rely on these completely for backup.

        AWS is OK to get yourself up and running without paying huge amounts up front for hardware, but be aware that you just simply cannot trust this infrastructure.

        • by eWarz (610883)
          I'm afraid to say, you guys are doing it wrong. Currently building an eCommerce platform that scales across any server, even if said servers are across multiple providers. Oh and it'll only cost us about a hundred bucks a month. The cloud isn't about throwaway computing, the cloud is about scalable applications. If you use EC2 for static hosting you are doing it wrong.
  • by IonOtter (629215) on Monday October 22, 2012 @04:09PM (#41732771) Homepage

    Do you still think that putting your digital life in the "cloud", without any ability to fall back on a physical hard drive or device, is a good idea?

    • Re: (Score:3, Insightful)

      by Anonymous Coward

      Because physical servers don't ever fail?

      • by rrohbeck (944847)

        No but you can make them reliable if needed.
        In the cloud you're at the mercy of the beancounters at Amazon & co.

        • by Smauler (915644)

          As much as I, like most people here grin with a certain kind of glee* when something this big goes down, the fact is that doing it yourself is nearly always less reliable.

          Also, there's nothing necessarily exclusive about the cloud - you can back up your data too, right?

          *Yes, it's evil - but it's because I've had the adrenaline in the past and know what it is like - despite it being one of the worst times, it can also be one of the best for loads of people.

        • Trusting a single "cloud" provider for all your hosting is silly. Something like a Xen vm image backed up nightly can be hosted on a large number of cloud provider systems, and you can use round-robin DNS to help limit the impact of downtime at any one provider. And if a provider will be down for a significant period (longer than it will take for DNS caching to expire), just remove that IP from the DNS records.

          Of course you can also have a local server/datacenter that can run your same VM image, and use it
    • by gstoddart (321705) on Monday October 22, 2012 @04:16PM (#41732887) Homepage

      Do you still think that putting your digital life in the "cloud", without any ability to fall back on a physical hard drive or device, is a good idea?

      My first thoughts as well.

      A friend was recently telling me about an issue they were having at work ... they host stuff for other people, and have very high-availability SLAs. Unfortunately, the support they have from some of their own internal people is "weekdays 9-5". So when an outage happened, they were dead in the water, because their own people basically said "sorry, we don't do after hours support".

      Your SLA is only as good as your weakest link. Granted, some of these sites may not have SLAs, but if you have an external vendor providing some of this stuff, and their service levels suck, then your service level can't be any better.

      For me, I can't see why companies would be willing to do this kind of thing. The risks are just too high.

      • by Anonymous Coward on Monday October 22, 2012 @04:23PM (#41733015)

        For me, I can't see why companies would be willing to do this kind of thing. The risks are just too high.

        That's because you don't have an MBA.

      • So when an outage happened, they were dead in the water, because their own people basically said "sorry, we don't do after hours support".

        This is not a system failure, it's a Human Resources failure.

        • by gstoddart (321705)

          No, it is more of a salesman failure, or a side effect of working in a large company.

          People sell outsourcing services, and they use products that other divisions of the company make and support.

          If something goes wrong, one division is on the hook for a high service level, and the other division provides the same level of crappy support they provide their customers.

          The group on the hook for the service has no clout over the group that makes the product in use.

          I have seen software sales where the salesman bun

      • by TubeSteak (669689)

        Your SLA is only as good as your weakest link.

        It seems like Amazon's weakest link is Virginia.
        I recall from the last Amazon outage thread on /. that Virginia seems to be the epicenter for epic fail.

      • by hawguy (1600213) on Monday October 22, 2012 @04:51PM (#41733381)

        Your SLA is only as good as your weakest link. Granted, some of these sites may not have SLAs, but if you have an external vendor providing some of this stuff, and their service levels suck, then your service level can't be any better.

        For me, I can't see why companies would be willing to do this kind of thing. The risks are just too high.

        Because many companies are not willing to spend what it takes to get availability greater than what they can get at Amazon - especially if they take advantage of multi-AZ or multi-region redundancy.

        Sure, having a physical server at the office that you know you can fix by buying parts at the local computer store sounds attractive. Until the day you find that your building has burnt to the ground. Or a truck knocked over the utility pole providing network and electricity to your building. Or you discover that when you looked at the flood maps to make sure you weren't in a flood zone, the maps didn't account for a water main breaking and flooding the basement where your telecom equipment is... or the clogged roof drains that let 20,000 gallons of water to build up on the roof during a rainstorm until the roof collapsed and flooded your datacenter. Or the earthquake (or hurricane or tornado or flood or whatever) that takes down your site for days or weeks or even months, and your employees are more concerned with surviving than trying to get your critical systems back online.

        Meeting an SLA for your own facility only works when that facility is running, and often the company that rents office space has little control over the facility.

        My company has a number critical services running in one Amazon region with replication to a second region for failover. The second region costs very little, just a single instance to hold data replicated from the primary instance, then if we need to spin up the servers in the secondary region, it takes about 10 minutes to push the data from the local copy to the other servers once we start them up.

        We could automate the whole process, but Amazon problems are rare enough that it hasn't been worth it.

        We do have a couple servers in us-east-1a but so far those servers appear to be fine, although the AWS management interface has not been working for managing servers in that region/AZ. If we ran servers out of our local office instead of Amazon, we would have had at least 2 instances of complete downtime in the past year - one 3 hour internet outage, and a 48 hour power failure on a weekend when a transformer blew and the power company didn't have an available spare and had to truck it in from out of area.

        • by CastrTroy (595695)
          You're talking like hosting your own servers on premises or being in the cloud are your only choices. You could also rent space in a high quality data center and replicate you data out to another high quality datacenter where you also rent space in a different geographic location. Then, when your primary data center goes down, you switch over to the other one. Or run off both at the same time if your architecture allows you do do that. That basically covers you in most instances. If both your rented data
          • by hawguy (1600213)

            You're talking like hosting your own servers on premises or being in the cloud are your only choices. You could also rent space in a high quality data center and replicate you data out to another high quality datacenter where you also rent space in a different geographic location. Then, when your primary data center goes down, you switch over to the other one. Or run off both at the same time if your architecture allows you do do that. That basically covers you in most instances. If both your rented datacenters go out at the same time, and they are in different locations, there's probably much bigger things to worry about. Or you didn't pick very good datacenters in the first place.

            Isn't that the same as putting your servers into multiple Amazon regions? You're still putting your destiny in your hands of the datacenter.

            • by aaarrrgggh (9205)

              Fundamentally it is different because modes of common failure are much less severe. If Amazon takes a hit to one facility, it is going to load up other facilities.

      • by MoNsTeR (4403)
        If you think the risks of running in the cloud are less than the risks of running in a traditional data center, you're very much mistaken.

        If one AWS AZ goes down I can bring up servers in a second one. If one AWS region goes down I can bring up servers in a second one. In fact to hedge against these risks I *already have* servers in multiple zones and regions.

        Sure you can do that with traditional data centers. Just host your stuff across more than one, right? Do you have any concept of what that COSTS? Espe
        • by makomk (752139)

          If the management API and web interface go down throughout the entireity of AWS, as they did in this outage according to some users, good luck bringing up another server. Besides, everyone else had the same idea - if one region goes down, we can just bring up a server in another region, so we don't have to pay for servers in multiple data centers - so it turned out there weren't actually nearly enough servers available for them to do this.

    • My life and business doesn't rely on ANY internet based social service things and I make sure my customers are not dependent on social media to know whats going on with my business. Hell even if the internet would go down I still have a phone book and a land line.

      • Hell even if the internet would go down I still have a phone book and a land line.

        Hey, me too! It's always good to have things lying around to club people with when civilization ends.

  • multi AZ? (Score:3, Interesting)

    by i_hate_robots (922668) on Monday October 22, 2012 @04:10PM (#41732781)
    An honest question, why don't these large, big-name sites utilize the Multi Availability Zone failover that Amazon offers? It seems these AWS outages make for good headlines, but shouldn't any large site be co-located in multiple physical locations to ensure uptime? If they WERE using Multi AZ, or there is some other technical reason why it wouldn't help, I'm really curious to know why...
    • by segedunum (883035)

      An honest question, why don't these large, big-name sites utilize the Multi Availability Zone failover that Amazon offers?

      They do. Plenty of people do. The problem is that these EBS failures always propagate across availability zones no matter what Amazon says.

      If they WERE using Multi AZ, or there is some other technical reason why it wouldn't help, I'm really curious to know why...

      Because you have no hard experience of what multiple availability zones practically means in Amazon's infrastructure.

      • by Pinhedd (1661735)

        Multi-AZ is only available for certain services. It's slower and costs twice as much. There's also replication delay issues between multi-AZ instances.

    • by hawguy (1600213)

      An honest question, why don't these large, big-name sites utilize the Multi Availability Zone failover that Amazon offers?

      It seems these AWS outages make for good headlines, but shouldn't any large site be co-located in multiple physical locations to ensure uptime?

      If they WERE using Multi AZ, or there is some other technical reason why it wouldn't help, I'm really curious to know why...

      There are rumors floating around that this affects more than one AZ - I'd never host critical infrastructure entirely in a single region even across multiple AZ's - much better to have it spread across multiple regions would eliminate most failure modes that could affect one region (like an East Coast Hurricane).

    • Re:multi AZ? (Score:4, Informative)

      by c0lo (1497653) on Monday October 22, 2012 @05:58PM (#41734227)

      If they WERE using Multi AZ, or there is some other technical reason why it wouldn't help, I'm really curious to know why...

      Here's your answer: cascading failures [wikipedia.org].

      In short, the cascading failures don't happen because one local failure cause the entire capacity of the network to be exceeded... you see, it is not a case of every node connected to every node (O(N^2) connections), thus a failure only need to overload the capacity of the nodes connected to the failing one...

  • But the cloud is so much better to use!
  • As Usual... (Score:3, Funny)

    by broginator (1955750) on Monday October 22, 2012 @04:13PM (#41732819)
    There's an oblig xkcd: http://xkcd.com/908/ [xkcd.com] Guess someone tripped over the wire.
  • turntable.fm is also down -- I guess the NYC tech startup community is going crazy right now. Time to diversify!
  • This too shall pass
  • by NinjaTekNeeks (817385) on Monday October 22, 2012 @04:18PM (#41732923)
    Looks like there won't be any fancy reports about the "cloud" having spectacular up times, with over an hour passed they can no longer claim more than 3 nines uptime.
    • by Revotron (1115029) on Monday October 22, 2012 @04:32PM (#41733133)
      Well... as it's currently referred to, the "cloud" is a singular entity. So, as long as there's one single server running as part of that infrastructure, you could weasel your way around any downtime and reassure the ignorant masses that "the cloud" is is still up, even if the only remaining piece is a Raspberry Pi running over a cable modem in some guy's basement.

      Hey, look everybody, the cloud is still up! You can't do near as much as you usually can, but it's up! 100% uptime! Woo!
  • Netcraft confirms it.

  • by Dan667 (564390) on Monday October 22, 2012 @04:30PM (#41733101)
    If only there were some lessons learned over decades and decades of mainframe use that that could be applied to the cloud.
    • If only there were some lessons learned over decades and decades of mainframe use that that could be applied to the cloud.

      Fans of, "I don't want to have to do my job" will never learn the lessons of the past.

    • by Smauler (915644)

      Yeah - it's like availability and uptime is getting worse, rather than better.

      What do you mean, it's not...

  • This zapping people's data is getting to be habit forming for Amazon I think.

    I guess we're just waiting to hear if it was a mistake or on purpose.

  • Minecraft login is down too!
  • In a virtual world, you put on your roller blades, and administer a failing data center. Level 1 is your home LAN. Level 2 is a law office and all the attorneys want the morning's court briefs immediately because court starts in 45 minutes and the file server screen says "RAID array offline". Level 3 is a small ISP. Level 4 is AWS. Level 5 is Google. Good luck!

"A mind is a terrible thing to have leaking out your ears." -- The League of Sadistic Telepaths

Working...