Car Hits Utility Pole, Takes Out EC2 Datacenter

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Car Hits Utility Pole, Takes Out EC2 Datacenter 250

Posted by timothy on Thursday May 13, 2010 @10:47PM from the nothing-can-go-wr dept.

1sockchuck writes "An Amazon cloud computing data center lost power Tuesday when a vehicle struck a nearby utility pole. When utility power was lost, a transfer switch in the data center failed to properly manage the shift to backup power. Amazon said a "small number" of EC2 customers lost service for about an hour, but the downtime followed three power outages last week at data centers supporting EC2 customers. Tuesday's incident is reminiscent of a 2007 outage at a Dallas data center when a truck crash took out a power transformer."

This discussion has been archived. No new comments can be posted.

Car Hits Utility Pole, Takes Out EC2 Datacenter

Load All Comments

Search 250 Comments Log In/Create an Account

Comments Filter:

Murphy's law (Score:2, Redundant)

by pwnies ( 1034518 ) writes:

Whatever can go wrong will rings pretty true here. Makes for an exciting day of work for them though I suppose; unlike yours truly.
*Goes back to reading /.*
- Re: (Score:3, Funny)
  
  by carp3_noct3m ( 1185697 ) writes:
  
  Karma and Murphys law, a deadly combination.
- Again: The IT Uptime Lightweights (Score:4, Insightful)
  
  by RobotRunAmok ( 595286 ) writes: on Friday May 14, 2010 @07:20AM (#32205502)
  
  When was the last time anyone heard of a TV Network going dark for an hour? A Hospital Emergency Room? IT guys always run around like self-important Star Trek Blue Shirts, but they never seem to take the proper steps to ensure -- really ensure -- their uptime.
  I'm sure there are exceptions, but it just seems that they have a ways to go, compared to the real "critical systems" industries to which they are so fond of comparing themselves. Is it money, arrogance, or ignorance?
  
  Parent Share
  twitter facebook
  - Re:Again: The IT Uptime Lightweights (Score:5, Informative)
    
    by Shimbo ( 100005 ) writes: on Friday May 14, 2010 @08:33AM (#32205862)
    
    When was the last time anyone heard of a TV Network going dark for an hour?
    Hmm, let me think. How about yesterday [bbc.co.uk]?
    
    Parent Share
    twitter facebook
  - Re: (Score:3, Informative)
    
    by TooMuchToDo ( 882796 ) writes:
    
    Usually, TV stations (that get fined for being off the air for not using their spectrum) and hospitals (which, you know, you can die at if the power goes out depending on your circumstances) have an easier time getting money for redundancy because the bad results are more expensive than if LOLcats is down.
- - Re:Murphy's law (Score:5, Funny)
    
    by turing_m ( 1030530 ) writes: on Thursday May 13, 2010 @11:19PM (#32203468)
    
    Nice try, but you still fail to grammar.
    This is why I long ago resolved to never, ever, ever correct someone else's grammar on slashdot. The risk in inadvertently failing to grammar is unacceptable.
    
    Parent Share
    twitter facebook
    - - Re: (Score:2, Offtopic)
        
        by DarkTempes ( 822722 ) writes:
        
        I think automated spell checking is a poor way to learn grammar and that such tools are frequently wrong.
        A quick review makes me suspect that the correct possessive form is still someone else's. (Sources: a dictionary [reference.com], a writing guide [essayinfo.com], and a google test)
      - Re: (Score:3, Insightful)
        
        by onionman ( 975962 ) writes:
        
        Here's a wacky thing: the plural form of someone else is actually someone's else .
        Ah, I can see the reason for your disclaimer about not having good grammar. "Someone else's" isn't plural, it's possessive! Still an interesting fact though.. does it mean the possessive form of someone else is someone's else? Looks pretty wrong to me...
        Yes, I certainly meant "possessive," not "plural," and I don't claim any expertise at all with language. (I'm a math professor in part because I was always so bad at writing.)
        Anyway, an English professor whom I asked about the puzzle explained to me that the correct, although archaic form is indeed someone's else. I pointed out to her that many on-line references use someone else's as the possessive form, and she explained that many on-line references are written by individuals who are catering toward the
    - - Re: (Score:3, Funny)
        
        by Dishevel ( 1105119 ) * writes:
        
        TLDR ; lol?
  - Re: (Score:2)
    
    by nmg196 ( 184961 ) writes:
    
    > but you still fail to grammar.
    We *all* fail to grammar occasionally. Especially you.
    - Re: (Score:2)
      
      by imakemusic ( 1164993 ) writes:
      
      Yeah. My spelling is perfect but I still get my grammer wrong sometimes.
- - Re: (Score:2)
    
    by fuzzyfuzzyfungus ( 1223518 ) writes:
    
    There are both economies and diseconomies to centralization. The real issue(in many "cloud" cases), is that some of the things that could be economies of centralization are being skipped in the mad rush for low costs, and since everything is hidden under a shiny layer of web APIs, people don't notice in time.
    
    In this case, for instance, the cost per server, or per unit work done, to have Real Serious Redundant Power(batteries, generators, multiple utility links, etc.) plummets as the number of servers in
  - Re:Murphy's law (Score:5, Insightful)
    
    by JWSmythe ( 446288 ) writes: <jwsmythe@nospam.jwsmythe.com> on Friday May 14, 2010 @02:30AM (#32204382) Homepage Journal
    
    Funny thing, I thought "cloud" computing means that you're placed into an automatically redundant network of machines, so if there's a site wide outage it didn't interfere with the operations.
    Now I see that Amazon's definition of "cloud" simply means "hosting provider". I guess in this case it means hosting provider with no DC power room, N+1 generators and regular testing to ensure the fallback systems actually work.
    That kind of reminds me of a company (who will remain nameless) who did tape backups, but never verified their tapes. When the data was lost, a good percentage of the tapes didn't work.
    I worked near a good datacenter. Out on smoke breaks late at night, you could hear them test fire their generators once a week. I was in there helping someone one night during a thunderstorm that sounded like it would rip the roof off, when I heard the generators spin up. The inside of the datacenter didn't miss a beat. When I left an hour later, I saw that there was no power (street lights, traffic lights, and normally illuminated buildings) for about 1/2 mile around it. The power company had it fixed by morning though. When I came back in the morning, everything was fine. Well, except my workstation in the office that didn't have redundant power.
    
    Parent Share
    twitter facebook
Farmville updates on Facebook stop (Score:5, Insightful)

by kriston ( 7886 ) writes: on Thursday May 13, 2010 @10:52PM (#32203340) Homepage Journal

And, as a result, Farmville/Mafiawars updates on Facebook temporarily stop.
Nothing of value was lost.

Share
twitter facebook
Where's your cloud now? (Score:4, Funny)

by TooMuchToDo ( 882796 ) writes: on Thursday May 13, 2010 @10:54PM (#32203352)

"The cloud" doesn't solve everything. Film at 11.

Share
twitter facebook
- Re: (Score:2)
  
  by GaryOlson ( 737642 ) writes:
  
  But a thick cloud with high density can cover up a lot of ugly infrastructure no one wants to see. Just ask the people who live in San Francisco.
  - Re: (Score:3, Informative)
    
    by tylerni7 ( 944579 ) writes:
    
    Sure, but a thick cloud with high density can also cover up a lot of important things, like roadways and utility poles.
- Re: (Score:2)
  
  by Richard_at_work ( 517087 ) writes:
  
  Nice generalisation - *this* cloud doesn't solve anything (but I never considered EC2 to be properly cloud anyway). Its the equivalent of how one vendors badly designed RAID card doesn't invalidate the entire concept of RAID.
- - - Re:Where's your cloud now? (Score:4, Funny)
      
      by Sarten-X ( 1102295 ) writes: on Friday May 14, 2010 @12:32AM (#32203832) Homepage
      
      The definition is a very nebulous concept.
      
      Parent Share
      twitter facebook
      - Re:Where's your cloud now? (Score:5, Funny)
        
        by plover ( 150551 ) * writes: on Friday May 14, 2010 @12:47AM (#32203920) Homepage Journal
        
        I'm kind of foggy on the details myself.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by dakameleon ( 1126377 ) writes:
        
        Not much chance it'll be reigned in any time soon, though.
        
        Re: (Score:2)
        
        by SeaFox ( 739806 ) writes:
        
        Read the article, and enlightenment will hit you like a bolt out of the blue.
        
        Re: (Score:2)
        
        by roman_mir ( 125474 ) writes:
        
        well the details are a bit obscure but I hear that what it is, is billions of tiny droplets of water that are suspended in the air above ground, not sure how that helps with computing though.
        
        Re:Where's your cloud now? (Score:5, Funny)
        
        by L4t3r4lu5 ( 1216702 ) writes: on Friday May 14, 2010 @05:07AM (#32204968)
        
        I'm sorry, I don't get the joke. I must have mist something.
        
        Parent Share
        twitter facebook
        
        Re: (Score:3, Funny)
        
        by StikyPad ( 445176 ) writes:
        
        Don't worry, it's over your head.
It's failure on multiple levels (Score:5, Insightful)

by GilliamOS ( 1313019 ) writes: on Thursday May 13, 2010 @10:57PM (#32203366)

Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.

Share
twitter facebook
- Re:It's failure on multiple levels (Score:5, Insightful)
  
  by OnlineAlias ( 828288 ) writes: on Thursday May 13, 2010 @11:35PM (#32203552)
  
  You said it. They failed to test. I design/run datacenters, and have had exactly this kind of thing happen recently. No outage, hardly anyone even noticed. My most critical stuff runs active/active out of multiple data centers...you could nuke one of them and everything would still be up.
  I'm actually a little blown away that the all powerful Amazon could possibly let this kind of thing happen. They are supposed to be pro team, a power failure is high school ball.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Itninja ( 937614 ) writes:
    
    No outage, hardly anyone even noticed.
    So, how is this different? A teeny, tiny percentage of users even noticed this and no data was lost. It's foolish to think one's data center is immune to outages (power or otherwise) from time to time, no matter how well it's designed. But apparently this is the latest in several outages over the past few weeks which is kind of like amateur hour.
  - Failure is often not a boolean (Score:5, Interesting)
    
    by mcrbids ( 148650 ) writes: on Friday May 14, 2010 @01:26AM (#32204070) Journal
    
    For years, I co-located at the top-rated 365 Main data center in San Francisco, CA [365main.com] until they had a power failure a few years ago. Despite having 5x redundant power that was regularly tested, it apparently wasn't tested against a *brown out*. So when Pacific Gas and Electric had a brownout, it failed to trigger 2 of the 5 redundant generators. Unfortunately, the system was designed so that any *one* of the redundant generators could fail and there wouldn't be any problem.
    So power was in a brownout condition, the voltage dropped from the usual 120 volts or so down to 90. Many power supplies have brownout detectors and will shut off. Many did, until the total system load dropped to the point where normal power was restored. All of this happened within a few seconds, and the brownout was fixed in just a few minutes. But at the end of it all, there was perhaps 20% of all the systems in the building shut down. The "24x7 hot hands" were beyond swamped. Techies all around the San Francisco area were pulled from whatever they were doing to converge on downtown SF. And me, 4 hours drive away, managed to restore our public-facing services on the one server (of four) I had that survived the voltage spikes before driving in. (Alas, my servers had the "higher end" power supplies with brownout detection)
    And so it was a long chain of almost success of well-tested, high-quality equipment that failed all in sequence because real life didn't happen to behave like the frequently performed tests did.
    When I did finally arrive, the normally quiet, meticulously clean facility was a shambles. Littered with bits of network cable, boxes of freshly-purchased computer equipment, pizza boxes, and other refuge were to be found in every corner. The aisles were crowded with techies performing disk checks and chattering tersely on cell phones. It was other-worldly.
    All of my systems came up normally; simply pushing the power switch and letting the fsck run did the trick, we were fully back up and all tests performed (and the system configuration returned to normal) in about an hour.
    Upon reflection, I realized that even though I had some down time, I was really in a pretty good position:
    1) I had backup hosting elsewhere, with a backup from the previous night. I could have switched over, but decided not to because we had current data on one system and we figured it was better not to have anybody lose any data than to have everybody lose the morning's work.
    2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.
    3) At no point did I have any less than two backups off site in two different location, so I had multiple, recent data snapshots off site. As long as the daisy chains of failure can be, it would be freakishly rare to have all of these points go down at once.
    4) Even with 75% of my hosting capacity taken offline, we were able to maintain uptime throughout all this because our configuration has full redundancy within our cluster - everything is stored in at least 2 places onsite.
    Moral of the story? Never, EVER have all your eggs in one basket.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Interesting)
      
      by drinkypoo ( 153816 ) writes:
      
      2) I had good quality equipment; the fact that none of my equipment was damaged from the event may have been partly due to the brownout detection in the power supplies of my servers.
      Having had the spade connector that carries power from the jack at the back of a machine in a 1000W power supply fail and apparently (from the pattern of smoke in its case) actually emit flames I can say that brownout protection is definitely worth some money.
  - Re: (Score:2)
    
    by L4t3r4lu5 ( 1216702 ) writes:
    
    you could nuke one of them and everything would still be up.
    How dramatic. The upstream ISP could also disconnect one for lack of payment of bills, but that wouldn't be nearly as exciting as nuking it, would it!
    
    Bravo on designing in redundancy, though. Maybe Amazon is hiring...
- Re:It's failure on multiple levels (Score:5, Informative)
  
  by fractalVisionz ( 989785 ) writes: on Thursday May 13, 2010 @11:41PM (#32203596) Homepage
  
  It seems you didn't RTFM. Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.
  
  Parent Share
  twitter facebook
  - Re:It's failure on multiple levels (Score:5, Insightful)
    
    by TubeSteak ( 669689 ) writes: on Friday May 14, 2010 @02:05AM (#32204252) Journal
    
    Only one switch out of many failed, due to it being set up from the factory incorrectly. The rest of the system switched over properly. I would say that is pretty good considering the data center size and number of switches needed for redundancy.
    Sounds like Amazon's tech monkeys didn't do their job when they received the hardware from the factory.
    Or is it normal to just plug in mission critical hardware and not check that it is setup properly?
    "We have already made configuration changes to the switch which will prevent it from misinterpreting any similar event in the future and have done a full audit to ensure no other switches in any of our data centers have this incorrect setting," Amazon reported.
    I guess TFA answered that question.
    If they're smart, they'll be creating policies for those types of audits to be done up front instead of after a failure.
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by ToasterMonkey ( 467067 ) writes:
  
  From the summary I gathered the problem was with the mechanical switch that disconnects external power when the generators are brought online, not a lack of capacity. Still requires testing, but it isn't going to be done often because isn't this the result when the power doesn't transition smoothly?
  - Re: (Score:2)
    
    by profplump ( 309017 ) writes:
    
    It is, but you can test during pre-defined maintenance windows when downtime is expected, or you can migrate active services to other hosts and leave these running as backups during the test, so that a failure does not bring down the primary.
  - Re: (Score:2)
    
    by omglolbah ( 731566 ) writes:
    
    It is much better to have a scheduled test with people ready to take care of any issues that may or may not pop up than to have a piece of equipment fail at a random time with few prepared...
    I sure as hell would rather have a blip every now and then than knowing that the system might fail catastrophically when something unexpected happens..
- Cloud is a poor metaphor anyway (Score:2)
  
  by dbIII ( 701233 ) writes:
  
  It's also completely expected by those not sold on pure science fiction.
  All it can take is a backhoe in the wrong place at the wrong time or an anchor cable dragging to cut you off from the very real single or two bits of infrastructure that people fantasise is their bit of the "cloud".
- Re: (Score:2)
  
  by RoFLKOPTr ( 1294290 ) writes:
  
  Amazon for not load-testing their emergency backup power on a regular basis, not having more than one connection the power grid, and the power grid for not having redundancies.
  It's not a matter of testing. These systems aren't things that you can just "test", because what if there is a problem? Then you have intentionally shut off power to your entire datacenter. Otherwise you could have scheduled downtime and just assume everything will fail, so have everybody shut off their servers in advance just in case, but then how often can you do that?
  No, it's a problem with the fundamental design of the power backup systems. I know somebody in charge of the electrical end of constructing
- Re: (Score:3, Insightful)
  
  by DerekLyons ( 302214 ) writes:
  
  Amazon for not load-testing their emergency backup power on a regular basis
  And you know they don't test it how? Oh, right. Testing is a magic wand that solves everything - except it doesn't. I've seen stuff fail literally seconds after being successfully tested. Welcome to the real world.
  and the power grid for not having redundancies. Our aging power grid is really beginning to show on so many levels that this is going to become a lot more common over the coming years.
  Horseshit. This has nothin
- - Re:It's failure on multiple levels (Score:5, Interesting)
    
    by GaryOlson ( 737642 ) writes: <slashdot AT garyolson DOT org> on Thursday May 13, 2010 @11:52PM (#32203656) Journal
    
    Most Americans these days are over-pampered self-absorbed malcontents. If the poles are not out in front where crews can service without going on property -- or even using predefined right of ways -- too many people complain or sue for negligible property damage.
    
    Where I grew up, the power poles ran on the property lines behind and between the houses. Once, lightning took out the transformer on the power pole [great light show and high speed spark ejection] ; and people were willing to take down the fence, put the dogs in a kennel, and remove landscaping which had encroached on the power pole so the crew could replace the transformer and other service. Today, I expect everyone shows up with a digital camera to document "property damage" to file for compensation for landscaping which has illegally encroached on the equipment.
    
    Many places various issues prevent burying the power cable: high water table, daytime temperatures which do not cool the ground -- and the power cables, or even fire ants.
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by jibjibjib ( 889679 ) writes:
    
    I'm in Melbourne, Australia and we have almost all our power cables above ground.
  - - Re: (Score:2)
      
      by afidel ( 530433 ) writes:
      
      Very true, especially since burying the kind of cables that a datacenter requires is very expensive and can lead to some interesting failure modes that don't happen with tower mounted high voltage lines =)
Obvious solution (Score:5, Funny)

by nebaz ( 453974 ) writes: on Thursday May 13, 2010 @10:59PM (#32203376)

Utility poles clearly need countermeasures. Hellfire missiles and such. That'll teach 'em to mess with a poor defenseless pole.

Share
twitter facebook
- Re: (Score:2, Informative)
  
  by binarylarry ( 1338699 ) writes:
  
  Think of the poor strippers man!
  - Re: (Score:2)
    
    by FooAtWFU ( 699187 ) writes:
    
    The strippers favor a pole tax, actually. [nypost.com]
  - - Re:stupid mods, trickz are for kidz (Score:4, Interesting)
      
      by Coopjust ( 872796 ) writes: on Friday May 14, 2010 @12:26AM (#32203810)
      
      Often, mods will give a funny post "insightful" instead of "funny" because it gives the user positive karma (whereas funny does not affect karma). Not a use intended by CmdrTaco, I'd imagine, but it's a common practice.
      
      Parent Share
      twitter facebook
- Re: (Score:2, Insightful)
  
  by wronskyMan ( 676763 ) writes:
  
  You know who else messed with poor defenseless Poles?
What are you doing, Dave? (Score:2)

by Bob_Who ( 926234 ) writes:

Stop driving like a dork, Dave...I'm getting sleepy...
An untested DR plan is a worthless DR plan (Score:4, Interesting)

by realmolo ( 574068 ) writes: on Thursday May 13, 2010 @11:06PM (#32203418)

Seriously, Amazon screwed up in a fairly major way with this.
What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?
Answer: Nothing. You'd be surprised how may US small-to-medium sized business are one fire/tornado/earthquake/hurricane away from bankruptcy. I'd bet it's over 80% of them.

Share
twitter facebook
- Re: (Score:2)
  
  by FictionPimp ( 712802 ) writes:
  
  The place I work just had the exact same problem. DC caps went bad and nobody noticed. Power went out and the backup didn't have enough juice to let the batteries kick in and move to the generator. At least I don't feel really bad now, just bad.
  - Re: (Score:2)
    
    by afidel ( 530433 ) writes:
    
    That's what maintenance contracts with at least biannual preventative maintenance is for =)
    - Re: (Score:2)
      
      by FictionPimp ( 712802 ) writes:
      
      Yea, our fault was that we let our maintenance department handle the power. They apparently let everything slide. It's back under IT's control now. The thing had been emailing them failed test messages for some time.
- Re:An untested DR plan is a worthless DR plan (Score:4, Insightful)
  
  by Albanach ( 527650 ) writes: on Friday May 14, 2010 @12:02AM (#32203712) Homepage
  
  Seriously, Amazon screwed up in a fairly major way with this.
  What more upsetting is this: If Amazon doesn't have working disaster recovery, what do other websites/companies have?
  What on earth leads you to suggest they don't have working disaster recovery? The experienced some disparate power outages and say they're implementing changes to improve their power distribution.
  I've hosted in data centers where the UPS was regularly tested, yet on a real live incident switchover failed. Even though the UPS did come up there was a brief outage shutting down all the racks. Each rack needs brought back online one at a time to prevent overloading. Immediately you're looking at significant downtime.
  I've hosted in another data center where someone hit the BIG RED BUTTON underneath the plastic case, cutting off power to the floor.
  I'm sure Amazon could have done thing better and will learn lessons. That's life in a data center.
  Nonetheless, Amazon allow you to keep your data at geographically diverse locations. As a customer you can pay the money and get geographic diversity that would have mitigated. If you don't take advantage of that, you can hardly blame Amazon for your decision.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by crazybit ( 918023 ) writes:
    
    What on earth leads you to suggest they don't have working disaster recovery?
    The fact that their service was partially cut due to a power failure. We know accidents DO happen and power failures DO happen, like the explosion in The Planet's power control room [1]. The guy's at Amazon cloud should be prepared for predictable problems like a power outage, specially when one of their selling arguments is service continuity.
    
    [1] [datacenterknowledge.com]
  - Re: (Score:2)
    
    by TubeSteak ( 669689 ) writes:
    
    I've hosted in data centers where the UPS was regularly tested, yet on a real live incident switchover failed. Even though the UPS did come up there was a brief outage shutting down all the racks. Each rack needs brought back online one at a time to prevent overloading. Immediately you're looking at significant downtime.
    Doesn't the "U" in "UPS" stands for "Uninterruptible"?
    Soooo.. Forgive my ignorance, but how does hardware hooked up to a UPS have "a brief outage"?
    - Re:An untested DR plan is a worthless DR plan (Score:5, Interesting)
      
      by thegarbz ( 1787294 ) writes: on Friday May 14, 2010 @04:14AM (#32204754)
      
      It is exactly that level of understanding that can cause most outages (and even failures of safety critical systems). There is one part of the UPS that is uninterruptible and that is the voltage at the battery. Between the voltage at the battery and the computer you have cables, electronics, control systems, charging circuits, and inverters. Beyond that if it's an industrial sized UPS there'll be circuit breakers, distribution boards, and other such equipment, each adding failure modes to the "uninterruptible" supply.
      I'll give you an example of what went wrong at my work (a large petro chemical plant in Australia). Like a lot of plants most pumps are redundant, and fed from two different sub stations, that doesn't prevent loss of power but the control circuits in those sub stations run from 24V. Those 24V come from two different cross linked UPS units (cross linked meaning that both redundant boards are fed from both redundant UPS). So in theory not only is there a backup at the plant, backup substations, and backup UPSs but in theory any component can fail and still keep upstream and downstream systems redundant.
      Anyway we had to take down one of the UPS for maintenance reasons following a procedure we'd used plenty of times before. The procedure is simple: 1. Check the current in the circuit breakers so that the redundant breakers can withstand the load, 2. close the circuit breakers upstream of the UPS that is being shut down, 3. Close main isolator to the UPS. So that's exactly what we did, and when we isolated one of the UPS, the upstream circuit breaker tripped from the OTHER UPS and control power was lost to half the plant as it was now effectively not only isolated from battery backup, but from the main 24V supply.
      So after lots of head scratching we did some thermal imagery of the installation. The circuit breaker which tripped in sympathy when we took down it's counterpart was running significantly hotter than the main one. The cause was determined to be a lose wire. So even though the load through the circuit breaker was much less than 1/2 of the total load, when we took down the redundant supply and the circuit breaker got loaded, the temperature pushed it over the edge.
      A carefully designed dually redundant UPS system providing 4 sources of power failed when we took down 2 of them in a careful way due to a lose wire in a circuit breaker. A UPS is never truly uninterruptible, and even internal batteries in servers would be protected by a fuse of some kind to ensure the equipment goes down, but ultimately survives a fault
      
      Parent Share
      twitter facebook
- Re: (Score:2)
  
  by AK Marc ( 707885 ) writes:
  
  Same with data backups. People just put in untested redundancy, a backup program that says "completed" and live happy. At least until the first time something fails.
  
  Testing costs time and money. It's easier to point to a job status that says "completed" or an invoice for the right pieces and say "it was the vendor's fault."
- Re: (Score:2)
  
  by lena_10326 ( 1100441 ) writes:
  
  Amazon has an insane amount of redundancy with dozens of physical data centers spread over the world. They regularly perform game day disaster scenarios taking out entire data centers to test the recovery of the infrastructure and Amazon applications.
  In this instance, you'll note only a few clients were impacted because a switch had incorrect configuration. There is not much you can do about some types of human errors, which can come from all sorts of unexpected angles. Regardless, a number EC2 nodes were l
- Re: (Score:2)
  
  by afidel ( 530433 ) writes:
  
  Uh, this is why Amazon tells you up front if you want true HA you have to have VM's in multiple zones to assure that they are served from different datacenters with no shared point of failure.
UPS's (Score:5, Interesting)

by MichaelSmith ( 789609 ) writes: on Thursday May 13, 2010 @11:14PM (#32203446) Homepage Journal

The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.
It also blew the power supply on an alphaserver and put a nice burn mark in the breaker panel. So the UPS guy comes out and he doesn't have two of the right sort of fuse. Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc. So we got going in the end.

Share
twitter facebook
- Re:UPS's (Score:5, Funny)
  
  by seanvaandering ( 604658 ) writes: <(sean.vaandering) (at) (gmail.com)> on Thursday May 13, 2010 @11:18PM (#32203462)
  
  Get that guy out of your datacenter pronto... no one can be THAT bloody unlucky in one shot.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by mystik ( 38627 ) writes:
  
  Funny -- Something almost exactly like that happened @ my datacenter last year. Some 'licensed' electrician accidentally shorted something in the main Power junction, which took the whole damn thing offline, generators, batteries and all. We were down 1+ hrs while they had to have the techs come on site to ensure that things were safe and online. Meanwhile, a small group of admins just outside w/ pitchforks (myself included) were waiting for the all clear to swarm the datacenter to get our equipment back
  - Re: (Score:3, Informative)
    
    by omglolbah ( 731566 ) writes:
    
    Just be glad nobody got killed...
    Shorting out something in a main power junction could easily have created a fairly nasty fire...
- Re: (Score:2)
  
  by seanadams.com ( 463190 ) * writes:
  
  shorted the white phase to ground
  
  What the hell is the "white phase"? Unless I am missing some newfangled data-center lingo, you are talking about the neutral, which is not a "phase" at all, and could never produce such a fault current when "shorted" to ground since it is already tied to ground at the panel. Am I missing something?
  - Re: (Score:3, Informative)
    
    by MichaelSmith ( 789609 ) writes:
    
    shorted the white phase to ground
    What the hell is the "white phase"? Unless I am missing some newfangled data-center lingo, you are talking about the neutral, which is not a "phase" at all, and could never produce such a fault current when "shorted" to ground since it is already tied to ground at the panel. Am I missing something?
    You have three actives (red, white, dark blue here in .AU), a neutral and an earth. The wikipedia page says different countries have different color codes so maybe that is the confusion.
    - Re:UPS's (Score:4, Informative)
      
      by seanadams.com ( 463190 ) * writes: on Friday May 14, 2010 @05:02AM (#32204948) Homepage
      
      The hots are black, red, and blue (in that order of prevalence) in the US.
      
      Parent Share
      twitter facebook
- Re: (Score:2)
  
  by ShakaUVM ( 157947 ) writes:
  
  The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.
  To be perfectly frank, I'm a bit scared of what sort of security system your datacenter has when the system can cause a blowout of that ma
  - Re: (Score:2)
    
    by MichaelSmith ( 789609 ) writes:
    
    The classic in my last job was when we had a security contractor in on the weekend hooking something up and he looped off a hot breaker in the computer room, slipped, and shorted the white phase to ground. This blew the 100A fuses both before and after the UPS and somehow caused the generator set to fault so that while we had power from the batteries, that was all we had.
    To be perfectly frank, I'm a bit scared of what sort of security system your datacenter has when the system can cause a blowout of that magnitude.
    I don't know what they were up too. Probably just installing a new proximity card reader or something and they wanted to use UPS power.
    The facilities department where had their own rules. Once they installed partitions in the computer room and they had a guy grinding aluminium so little particles sprayed on our monitors and fell into the ventilation slots.
    Parts of the security system would not let you out of the building or into the working areas without operator intervention. You could be stuck in the sta
- - Re: (Score:3, Informative)
    
    by MichaelSmith ( 789609 ) writes:
    
    Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc.
    Strips of steel with holes in them? You're kidding, right?
    No. It would be 50*15*5 mm steel with a 10mm hole drilled in each end. A bolt goes through each hole into a threaded attachment point.
    Now that you mention it I recall that a four inch nail is good for 100A slow blow but thats cylindrical so it conducts nicely. You'd think the rectangular cross section would not conduct quite as well (sharp corners, etc) but maybe it is also tuned for the desired current. A little saw cut half way between the holes would do that.
    - Re: (Score:2)
      
      by dbIII ( 701233 ) writes:
      
      Steel is an incredibly bad conductor as metals go - just think of a blacksmith safely holding a bit of steel that is red hot at the other end if you've never done that yourself. Electrical conductivity is related to thermal conductivity.
      - Re: (Score:2)
        
        by Black Gold Alchemist ( 1747136 ) writes:
        
        Only in metals [wikipedia.org]. In other systems, quantum mechanics takes over.
  - Not really (Score:5, Informative)
    
    by Sycraft-fu ( 314770 ) writes: on Friday May 14, 2010 @12:00AM (#32203702)
    
    All a fuse is is a piece of metal that will melt fairly quickly when a given amount of current is passed through it. Idea being that it heats up and melts before the wires can. So, the bigger the current, the more robust the metal connecting it. A 100A fuse is usually a fairly large strip of steel.
    Now I'll admit that just grabbing an approximate size of steel and placing it in as the GP did isn't going to yield a nice precise fuse. It may have been too high a current. However, it'd work for getting things running again and probably provide a modicum of protection in the event of a short.
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by grcumb ( 781340 ) writes:
    
    Fortunately 100A fuses are just strips of steel with two holes drilled in them and he had a file, and a drill, etc.
    Strips of steel with holes in them? You're kidding, right?
    Yeah, so what? I mean what could possibly go wro
Unreasonable expectations (Score:5, Interesting)

by KGBear ( 71109 ) writes: on Thursday May 13, 2010 @11:19PM (#32203470) Homepage

I expect this is just a scaled up version of the problems I deal with every day. And I'm sure I'm not the only one. Users have grown so dependent on system services and management has grown so apart from the trenches that completely unreasonable expectations are the norm. Where I work for instance it's almost impossible to even *test* backup power and failover mechanisms and procedures because users consider even minor outages in the middle of the night unacceptable and managers either don't have the clout or don't understand the problem well enough to put limits to such expectations. As a result often times the only tests such systems get happen during real emergencies, when they are actually needed. I don't know how, but I feel we should start educating our users and managers better, not to mention being realistic about risks and expectations.

Share
twitter facebook
- Re: (Score:2)
  
  by SheeEttin ( 899897 ) writes:
  
  users consider even minor [test] outages in the middle of the night unacceptable
  ...and this is why we have redundancy.
  Test the backup hardware. Works? Switch over to it, test the main hardware. Works? All good, no (or negligible) downtime.
  - Re: (Score:2)
    
    by omglolbah ( 731566 ) writes:
    
    Yes, but what the management probably worries about is "what if the redundant system fails while you are testing the primary?".
    So they wont let us lowly engineers do the test... opting instead of the chance of a disaster...
    I'm glad I work in the oil business... safety is ALWAYS the most important thing... since any failure will be horribly expensive :-p
Re: (Score:2, Informative)

by account_deleted ( 4530225 ) writes:

Comment removed based on user account deletion
- Re: (Score:3, Insightful)
  
  by MichaelSmith ( 789609 ) writes:
  
  Stop building those things so fucking close to the roads, maybe?
  What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.
  - Re:Hurrr, durrr (Score:5, Funny)
    
    by plover ( 150551 ) * writes: on Friday May 14, 2010 @12:55AM (#32203952) Homepage Journal
    
    What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.
    That's also the fastest way to get rescued off a desert island or out in the woods, and why you should always carry a piece of fiber in your pocket. Should you get stranded, you simply bury the fiber, and some asshole with a backhoe will be along in about five minutes to cut it. Ask him to rescue you.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Funny)
      
      by MichaelSmith ( 789609 ) writes:
      
      What about your power supply? Is that not allowed to go along a road? I am all for underground power BTW but I know that if you operate a digger and you want to find the owner of a cable the easiest way is to break it and wait for the complaints.
      That's also the fastest way to get rescued off a desert island or out in the woods, and why you should always carry a piece of fiber in your pocket. Should you get stranded, you simply bury the fiber, and some asshole with a backhoe will be along in about five minutes to cut it. Ask him to rescue you.
      In that same job we had a bunch of CCTV cameras on St Kilda road in Melbourne right outside the arts center. Its a mess of tram gear and traffic signals the there is a lot of fibre under the road.
      Ever stuck your fork into a plate of spaghetti, then spun it around? This guy had to bore a hole straight down right in the middle of the road. There is a number where you can "dial before your dig" but he omitted that. He wound up with ~50 metres of fibre wrapped around his borer. Quite a mess.
Transfer switches suck? (Score:3, Interesting)

by pavera ( 320634 ) writes: on Thursday May 13, 2010 @11:30PM (#32203528) Homepage Journal

The DC that my company colos a few racks in had this same thing happen about a year ago (not a car crash, just a transformer blew out). But the transfer switch failed to switch to backup power, and the DC lost power for 3 hours.
What is up with these transfer switches? Do the DCs just not test them? Or is it the sudden loss of power that freaks them out vs a controlled "ok we're cutting to backup power now" that would occur during a test? Someone with more knowledge of DC power systems might enlighten me...

Share
twitter facebook
- Re: (Score:2)
  
  by jroysdon ( 201893 ) writes:
  
  I don't get why you wouldn't have dual-redundant power supplies on all devices (routers, switches, servers), one on transfer switch A and the other on transfer switch B, each connecting to different backup power sources. Further, these should be tested on a regular basis (at least monthly). Test transfer switch A on the 1st, and transfer switch B on the 15th.
  Seems like a design flaw here and/or someone was just being cheap.
  - Re: (Score:2)
    
    by aaarrrgggh ( 9205 ) writes:
    
    The problem usually isn't the transfer switch itself, but how it works with everything else. Transfer switches usually only really fail with contact damage.
    Cascading failures are a bigger problem for most co-lo's, as they try and maximize infrastructure utilization to a fault.
    Restoring power can be quite difficult.
  - Re: (Score:3, Interesting)
    
    by Technonotice_Dom ( 686940 ) writes:
    
    I don't get why you wouldn't have dual-redundant power supplies on all devices (routers, switches, servers), .... [snip]
    Seems like a design flaw here and/or someone was just being cheap.
    It would be the latter. The AWS EC2 instances aren't marketed or intended to be high availability individually. They're designed to be cheap and Amazon do say instances will fail. They provide a good number of data centres and specifically say that systems within the same zone may fail - different data centres are entirely independent. They provide a number of extra services that can also tolerate the loss of one data centre.
    Anybody who believes they're getting highly available instances hasn't done a b
- Re: (Score:3, Interesting)
  
  by Renraku ( 518261 ) writes:
  
  It's not really the DC power system that's the issue.
  The people are the issue.
  Example: You're the lead technician for a new data center. You request backup power systems be written into the budget, and are granted your wish. You install the backup power systems and then ask to test them. Like a good manager, your boss asks you what that will involve. You say that it'll involve testing the components one by one, which he nods in agreement with. However, when you get to the 'throw the main breaker and s
- Re: (Score:3, Interesting)
  
  by Jeffrey Baker ( 6191 ) writes:
  
  The answer is "yes". Transfer switches often fail and are rarely tested. This is also true of other power equipment. If it's rarely used the probability of it working in an emergency are somewhat low.
  However, in this case the transfer switch worked fine, but it had been misconfigured by Amazon technicians. According to their status email from yesterday (posted in their AWS status RSS feed) the outage was a result of the fact that one transfer switch had not been loaded with the same configuration as the
Oil's Well (Score:3, Insightful)

by Aeonite ( 263338 ) writes: on Thursday May 13, 2010 @11:30PM (#32203530) Homepage

It's a good thing that oil rigs are better managed than data centers. Who knows what might happen if one of them ever had a problem like this?

Share
twitter facebook
- Re: (Score:2)
  
  by omglolbah ( 731566 ) writes:
  
  Yep, but what is important to keep in mind is that if an oil rig has to shut down for a day due to a power issue the oil will still be in the ground.
  The company might lose money due to having promised a certain supply (especially with gas!) but the resource is not lost.
  In a datacenter the uptime is all there is. Value is lost.
  The oil rigs in Norwegian waters are fairly secure when it comes to power faults. If the system cannot guarantee power it goes into a shutdown sequence to set everything in a "safe" po
  - - Re: (Score:2)
      
      by DNS-and-BIND ( 461968 ) writes:
      
      Who knows what might happen if one of them ever had a problem like this?
      *WHOOSH* He's not talking about the BP oil spill.
- Re: (Score:3, Informative)
  
  by mjwx ( 966435 ) writes:
  
  It's a good thing that oil rigs are better managed than data centres. Who knows what might happen if one of them ever had a problem like this?
  I have a friend who is an engineer on one of the projects in the North West Shelf (of Western Australia) a few weeks back he asked "how can they build a rig in the gulf of Mexico for one third of our costs". Two days later One blew up an he got his answer.
I'm confused (Score:4, Funny)

by OverlordQ ( 264228 ) writes: on Thursday May 13, 2010 @11:55PM (#32203676) Journal

Why couldn't they just get power from the cloud?

Share
twitter facebook
- Re: (Score:2)
  
  by MichaelSmith ( 789609 ) writes:
  
  The cloud should have a faxing service so I can get free paper from my fax machine.
- Re: (Score:2)
  
  by martin-boundary ( 547041 ) writes:
  
  Why couldn't they just get power from the cloud?
  
  The Indian who usually does the Rain and Lightning Dance was on vacation.
- Re: (Score:2)
  
  by roman_mir ( 125474 ) writes:
  
  They needed 1.21 Gigawatts and the cloud had it, but the Delorean didn't start this time.
Totally Unexpected (Score:2)

by NicknamesAreStupid ( 1040118 ) writes:

Who, while driving through a cloud, would ever expect to hit a utility pole? Clouds do not have utility poles. Now, tule fog has utility poles. That is not why they call it 'tule' (not a nickname for utility, but for a grass), but many a utility pole has been unduly undone because someone drove through the tule fog and into the utility pole.

If Amazon is going to put utility poles in its 'cloud', then they are really in a fog. Call it fog computing.
Who Cares? (Score:2, Interesting)

by Aerosiecki ( 147637 ) writes:

Doesn't EC2 let you request hosts in any of several particular datacentres (which they call an "availability zones") just so you can plan around such location-specific catastrophes? No matter how good the redundant systems, some day a meteor will hit one datacentre and you'll be S.O.L. no matter what if you put all your proverbial eggs in that basket.
Only a fool cares about a single-datacentre outage. This is why it's called "*distributed*-systems engineering", folks.
- Re:Redundancies, Redundancies (Score:5, Insightful)
  
  by mirix ( 1649853 ) writes: on Friday May 14, 2010 @01:06AM (#32204008)
  
  Redundancy costs money. If it costs more than downtime, you don't get it.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Jeian ( 409916 ) writes:
  
  It has nothing to do with "the cloud", other than that the datacenter affected happened to host one. It could've been a dedicated server and it would've had the same problem.
- Re: (Score:2)
  
  by jamesh ( 87723 ) writes:
  
  Is it actually worse than maintaining your own server room though? I test our UPS every 6 months and it works flawlessly every time, but the last two power failures caused it to drop the load instantly. Nothing is perfect, the difference with the cloud is when it goes wrong it can go wrong on a large scale and you are more likely to read about it in the news.
  - Re: (Score:2)
    
    by Richard_at_work ( 517087 ) writes:
    
    At my last job, our UPS worked flawlessly - until a fan needed replacing, and the switch back from maintenance bypass to protected flow caused a massive overvolt condition on two of the phases, killing a large chunk of our switches, PCs and a lot of redundant power supplies in the servers.
    
    When safety equipment goes bad, there's not a lot you can really do.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Murphy's law (Score:2, Redundant)

Re: (Score:3, Funny)

Again: The IT Uptime Lightweights (Score:4, Insightful)

Re:Again: The IT Uptime Lightweights (Score:5, Informative)

Re: (Score:3, Informative)

Re:Murphy's law (Score:5, Funny)

Re: (Score:2, Offtopic)

Re: (Score:3, Insightful)

Re: (Score:3, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Murphy's law (Score:5, Insightful)

Farmville updates on Facebook stop (Score:5, Insightful)

Where's your cloud now? (Score:4, Funny)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re:Where's your cloud now? (Score:4, Funny)

Re:Where's your cloud now? (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Where's your cloud now? (Score:5, Funny)

Re: (Score:3, Funny)

It's failure on multiple levels (Score:5, Insightful)

Re:It's failure on multiple levels (Score:5, Insightful)

Re: (Score:2)

Failure is often not a boolean (Score:5, Interesting)

Re: (Score:3, Interesting)

Re: (Score:2)

Re:It's failure on multiple levels (Score:5, Informative)

Re:It's failure on multiple levels (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Cloud is a poor metaphor anyway (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re:It's failure on multiple levels (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Obvious solution (Score:5, Funny)

Re: (Score:2, Informative)

Re: (Score:2)

Re:stupid mods, trickz are for kidz (Score:4, Interesting)

Re: (Score:2, Insightful)

What are you doing, Dave? (Score:2)

An untested DR plan is a worthless DR plan (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:An untested DR plan is a worthless DR plan (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re:An untested DR plan is a worthless DR plan (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

UPS's (Score:5, Interesting)

Re:UPS's (Score:5, Funny)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:3, Informative)

Re:UPS's (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Not really (Score:5, Informative)

Re: (Score:2)

Unreasonable expectations (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:3, Insightful)

Re:Hurrr, durrr (Score:5, Funny)

Re: (Score:3, Funny)