Car Hits Utility Pole, Takes Out EC2 Datacenter 250
1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."
Farmville updates on Facebook stop (Score:5, Insightful)
And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
Nothing of value was lost.
It's failure on multiple levels (Score:5, Insightful)
Re:Hurrr, durrr (Score:3, Insightful)
Stop building those things so fucking close to the roads, maybe?
What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.
Oil's Well (Score:3, Insightful)
It's a good thing that oil rigs are better managed than data centers. Who knows what might happen if one of them ever had a problem like this?
Re:It's failure on multiple levels (Score:5, Insightful)
You said it. They failed to test. I design/run datacenters, and have had exactly this kind of thing happen recently. No outage, hardly anyone even noticed. My most critical stuff runs active/active out of multiple data centers...you could nuke one of them and everything would still be up.
I'm actually a little blown away that the all powerful Amazon could possibly let this kind of thing happen. They are supposed to be pro team, a power failure is high school ball.
Re:UPS's (Score:1, Insightful)
Strips of steel with holes in them? You're kidding, right?
Re:An untested DR plan is a worthless DR plan (Score:4, Insightful)
What on earth leads you to suggest they don't have working disaster recovery? The experienced some disparate power outages and say they're implementing changes to improve their power distribution.
I've hosted in data centers where the UPS was regularly tested, yet on a real live incident switchover failed. Even though the UPS did come up there was a brief outage shutting down all the racks. Each rack needs brought back online one at a time to prevent overloading. Immediately you're looking at significant downtime.
I've hosted in another data center where someone hit the BIG RED BUTTON underneath the plastic case, cutting off power to the floor.
I'm sure Amazon could have done thing better and will learn lessons. That's life in a data center.
Nonetheless, Amazon allow you to keep your data at geographically diverse locations. As a customer you can pay the money and get geographic diversity that would have mitigated. If you don't take advantage of that, you can hardly blame Amazon for your decision.
Re:Redundancies, Redundancies (Score:5, Insightful)
Redundancy costs money. If it costs more than downtime, you don't get it.
Re:Obvious solution (Score:2, Insightful)
Re:It's failure on multiple levels (Score:5, Insightful)
Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.
Sounds like Amazon's tech monkeys didn't do their job when they received the hardware from the factory.
Or is it normal to just plug in mission critical hardware and not check that it is setup properly?
"We have already made configuration changes to the switch which will prevent it from misinterpreting any similar event in the future and have done a full audit to ensure no other switches in any of our data centers have this incorrect setting," Amazon reported.
I guess TFA answered that question.
If they're smart, they'll be creating policies for those types of audits to be done up front instead of after a failure.
Re:Murphy's law (Score:5, Insightful)
Funny thing, I thought "cloud" computing means that you're placed into an automatically redundant network of machines, so if there's a site wide outage it didn't interfere with the operations.
Now I see that Amazon's definition of "cloud" simply means "hosting provider". I guess in this case it means hosting provider with no DC power room, N+1 generators and regular testing to ensure the fallback systems actually work.
That kind of reminds me of a company (who will remain nameless) who did tape backups, but never verified their tapes. When the data was lost, a good percentage of the tapes didn't work.
I worked near a good datacenter. Out on smoke breaks late at night, you could hear them test fire their generators once a week. I was in there helping someone one night during a thunderstorm that sounded like it would rip the roof off, when I heard the generators spin up. The inside of the datacenter didn't miss a beat. When I left an hour later, I saw that there was no power (street lights, traffic lights, and normally illuminated buildings) for about 1/2 mile around it. The power company had it fixed by morning though. When I came back in the morning, everything was fine. Well, except my workstation in the office that didn't have redundant power.
Re:It's failure on multiple levels (Score:3, Insightful)
And you know they don't test it how? Oh, right. Testing is a magic wand that solves everything - except it doesn't. I've seen stuff fail literally seconds after being successfully tested. Welcome to the real world.
Horseshit. This has nothing to do with the grid, and everything to do with local supplies - which rarely if ever have redundancy. (Mostly because it increases the difficulty and cost of maintenance and considerably increases the capital cost - while only providing a benefit in a one-in-a-million situation.)
Again: The IT Uptime Lightweights (Score:4, Insightful)
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.
I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?
Re:Again: The IT Uptime Lightweights (Score:2, Insightful)
When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? The people who set the budget for IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to set the proper priorities to ensure -- really ensure -- their uptime.
There. Fixed that for you.
The reason you rarely see an ER go down for want of power is that, knowing that lives depend on it, the people responsible for providing for it are willing to spend what it takes, in capital investment and in manpower for ongoing maintenance and operation so that an acceptable level of availability is guaranteed. Amazon and (last year) Rackspace, not so much.
Re:Murphy's law (Score:3, Insightful)
Here's a wacky thing: the plural form of someone else is actually someone's else .
Ah, I can see the reason for your disclaimer about not having good grammar. "Someone else's" isn't plural, it's possessive! Still an interesting fact though.. does it mean the possessive form of someone else is someone's else? Looks pretty wrong to me...
Yes, I certainly meant "possessive," not "plural," and I don't claim any expertise at all with language. (I'm a math professor in part because I was always so bad at writing.)
Anyway, an English professor whom I asked about the puzzle explained to me that the correct, although archaic form is indeed someone's else. I pointed out to her that many on-line references use someone else's as the possessive form, and she explained that many on-line references are written by individuals who are catering toward the "business writer."
Evidently, the business audience isn't so much concerned with what is correct grammatically as opposed to what sounds correct because it is used most frequently. Hence, sites like dictionary.com will often list the most common usage even if it isn't technically correct.
For example, if you want to refer to the car belonging to the attorney general, it would be the attorney's general car not the attorney general's car. However, most readers would find the first form off-putting, so a business writer would prefer the second.
Of course, this leads to an endless digression as to grammar being a fixed set of rules to hold the language together as a standard or an amorphous description of common usage which must change with the times.
Well, I should probably stop commenting on this before I get too many more "offtopic" mods.