Amazon EBS Failure Brings Down Reddit, Imgur, Others 176
Several readers have sent word of a significant Amazon EBS outage. Quoting:
"Amazon Web Services has confirmed that its Elastic Block Storage (EBS) service is experiencing degraded service, leading sites across the Internet to experience downtime, including Reddit, Imgur and many others. AWS confirmed on its status page at 2:11 p.m. ET that it is experiencing 'degraded performance for a small number of EBS volumes.' It says the issue is restricted to a single Availability Zone within the US-East-1 Region, which is in Northern Virginia. AWS later reported that its Relational Database Service (Amazon RDS) and its Elastic Beanstalk application plaform also experienced failures on Monday afternoon."
But But But (Score:5, Insightful)
It's the cloud! It's like never like down, and webscale!
Bright and Sunny Skies Today! (Score:5, Insightful)
Do you still think that putting your digital life in the "cloud", without any ability to fall back on a physical hard drive or device, is a good idea?
Re:Bright and Sunny Skies Today! (Score:3, Insightful)
Because physical servers don't ever fail?
wow, mainframe problems in the cloud (Score:5, Insightful)
Re:Low Availability? (Score:5, Insightful)
Re:No Fancy Uptime Numbers for them (Score:4, Insightful)
Hey, look everybody, the cloud is still up! You can't do near as much as you usually can, but it's up! 100% uptime! Woo!
Re:Bright and Sunny Skies Today! (Score:5, Insightful)
Your SLA is only as good as your weakest link. Granted, some of these sites may not have SLAs, but if you have an external vendor providing some of this stuff, and their service levels suck, then your service level can't be any better.
For me, I can't see why companies would be willing to do this kind of thing. The risks are just too high.
Because many companies are not willing to spend what it takes to get availability greater than what they can get at Amazon - especially if they take advantage of multi-AZ or multi-region redundancy.
Sure, having a physical server at the office that you know you can fix by buying parts at the local computer store sounds attractive. Until the day you find that your building has burnt to the ground. Or a truck knocked over the utility pole providing network and electricity to your building. Or you discover that when you looked at the flood maps to make sure you weren't in a flood zone, the maps didn't account for a water main breaking and flooding the basement where your telecom equipment is... or the clogged roof drains that let 20,000 gallons of water to build up on the roof during a rainstorm until the roof collapsed and flooded your datacenter. Or the earthquake (or hurricane or tornado or flood or whatever) that takes down your site for days or weeks or even months, and your employees are more concerned with surviving than trying to get your critical systems back online.
Meeting an SLA for your own facility only works when that facility is running, and often the company that rents office space has little control over the facility.
My company has a number critical services running in one Amazon region with replication to a second region for failover. The second region costs very little, just a single instance to hold data replicated from the primary instance, then if we need to spin up the servers in the secondary region, it takes about 10 minutes to push the data from the local copy to the other servers once we start them up.
We could automate the whole process, but Amazon problems are rare enough that it hasn't been worth it.
We do have a couple servers in us-east-1a but so far those servers appear to be fine, although the AWS management interface has not been working for managing servers in that region/AZ. If we ran servers out of our local office instead of Amazon, we would have had at least 2 instances of complete downtime in the past year - one 3 hour internet outage, and a 48 hour power failure on a weekend when a transformer blew and the power company didn't have an available spare and had to truck it in from out of area.
Re:Low Availability? (Score:5, Insightful)
Seems to me that the answer is just to host things yourself, instead of relying on another company's infrastructure.
How do you host anything without relying on another company's infrastructure? Do you purchase right-of-way's between your site and all of your customers and string your own fiber? Do you run your own power plant? Do you build your own UPS, right down to the batteries so you don't need to trust a UPS vendor? Do you build and service your own CRAC's?
It's impossible for any company to *not* rely on another company's infrastructure even if just for internet connectivity, the only question is where to draw the line - do you really want to rack and stack your own servers? Do you trust a vendor to do periodic preventative maintenance on your generators, or do you use your own staff? Do you certify your own staff to service your fire suppression system, or do you contract out to a vendor? Do you want to own your own network equipment and do your own network admin? Do you want to swap out servers and disk drives when they fail? Do you keep staff electricians on-hand to take care of electrical issues? Do you want to run a 24x7 NOC to monitor and maintain your datacenter?
While a large company may be able to keep many of these tasks in-house, many small companies can't afford the staff it would take to control all of their infrastructure.