Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet Businesses

Explosion At ThePlanet Datacenter Drops 9,000 Servers 431

An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.
This discussion has been archived. No new comments can be posted.

Explosion At ThePlanet Datacenter Drops 9,000 Servers

Comments Filter:
  • by QuietLagoon ( 813062 ) on Sunday June 01, 2008 @02:12PM (#23618761)
    ... for posting frequent updates to the status of the outage.
  • by PPH ( 736903 ) on Sunday June 01, 2008 @02:26PM (#23618855)

    Being in the power systems engineering biz, I'd be interested in some more information on the type of building (age, original occupancy type, etc.) involved.

    To date. I've seen a number of data center power problems, from fires to isolated, dual source systems that turned out not to be. It raises the question of how well the engineering was done for the original facility, or the refit of an existing one. Or whether proper maintenance was carried out.

    From TFA:

    electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding their electrical equipment room.
    Properly designed systems should never result in any fault to become uncontained in this manner.
  • by ChowRiit ( 939581 ) on Sunday June 01, 2008 @02:27PM (#23618867)
    Only a few people need to have a lot of servers for there to be 18 servers for every 15 customers. To be honest, I'm surprised the ratio is so low, I would have guessed most hosting in a similar environment would be by people who'd want at least 2 servers for redundancy/backup/speed reasons...
  • Explosion? (Score:4, Insightful)

    by mrcdeckard ( 810717 ) on Sunday June 01, 2008 @02:30PM (#23618891) Homepage

    The only thing that I can imagine that could've caused an explosion in a datacenter is a battery bank (the data centers I've been in didn't have any large A/C transformers inside). And even then, I thought that the NEC had some fairly strict codes about firewalls, explosion-proof vaults and the like.

    I just find it curious, since it's not unthinkable that rechargeable batteries might explode.

    mr c
  • Re:Recovery costs (Score:5, Insightful)

    by macx666 ( 194150 ) * on Sunday June 01, 2008 @02:30PM (#23618893) Homepage

    Not to mention the cost of pulling all those consultants in, overnight, on a weekend...

    Also, only the electrical equipment (and structural stuff) was damaged - networking and customer servers are intact (but without power, obviously).
    I read that they pulled in vendors. Those types would be more than happy to show up at the drop of a hat for some un-negotiated products that insurance will pay for anyway, and they'll even throw in their time for "free" so long as you don't dent their commission.
  • by Hijacked Public ( 999535 ) on Sunday June 01, 2008 @02:30PM (#23618895)
    Probably less traditional explosion and more Arc Flash [electricityforum.com].
  • by 42forty-two42 ( 532340 ) <bdonlan@@@gmail...com> on Sunday June 01, 2008 @02:38PM (#23618971) Homepage Journal
    Wouldn't people who want such redundancy consider putting the other server in another DC?
  • by Anonymous Coward on Sunday June 01, 2008 @02:42PM (#23619009)
    I have 5 servers. Each of them is in a different city, on a different provider. I had a server at The Planet in 2005.

    I feel bad for their techs, but I have no sympathy for someone who's single-sourced, they should have propagated to their offsite secondary.

    Which they'll be buying tomorrow, I'm sure.

  • by p0tat03 ( 985078 ) on Sunday June 01, 2008 @02:42PM (#23619011)
    ThePlanet is a popular host for hosting resellers. Many of the no-name shared hosting providers out there host at ThePlanet, amongst other places. So... Many of these customers would be individuals (or very small companies), who in turn dole out space/bandwidth to their own clients. The total number of customers affected can be 10-20x the number reported because of this.
  • by wirelessbuzzers ( 552513 ) on Sunday June 01, 2008 @02:45PM (#23619053)
    I'm guessing that most of the customers are virtual-hosted, and therefore have only a fraction of a server, but some customers have many servers.
  • by larien ( 5608 ) on Sunday June 01, 2008 @02:47PM (#23619065) Homepage Journal
    It's probably less effort to spend a few minutes updating a forum than it would be to man the phones against irate customers demanding their servers be brought back online.
  • a bit wrong (Score:3, Insightful)

    by unity100 ( 970058 ) on Sunday June 01, 2008 @02:57PM (#23619155) Homepage Journal
    its not the 'no name' hosting resellers who host at the planet. no name resellers do not employ an entire server, they just use whm reseller panel that is being handed out by a company which hosts servers there.
  • Re:Explosion? (Score:4, Insightful)

    by RGRistroph ( 86936 ) <rgristroph@gmail.com> on Sunday June 01, 2008 @03:07PM (#23619221) Homepage
    Haven't you ever seen one of those gray garbage can sized transformers on a pole explode ? I used to live in a neighborhood that was right across the tracks from some sort of electrical switching station or something, they had rows of those things in a lot covered with white gravel. Explosions that were violent enough to feel like a granade going off a hundred yards away were not uncommon. I think most of them were simply the arcing of high voltage vaporizing everything and producing a shock wave, but sometimes the can-type transformers that are filled with cooling oil exploded and the burning oil sprayed everywhere.

    At one place I worked, every lightening storm my boss would rush to move his shitty old truck to underneath the can on the power pole, hoping the thing would blow and burn it so he could get insurance to replace it.
  • by ottawanker ( 597020 ) on Sunday June 01, 2008 @03:14PM (#23619289) Homepage

    so you're agreeing with me. The servers getting blown up was a huge mistake, one that certainly could have been avoided with a little proper planning.
    you are a fucking moron

  • by QuietLagoon ( 813062 ) on Sunday June 01, 2008 @03:22PM (#23619347)
    man the phones against irate customers

    It does not sound like the type of company that thinks of its customers as an enemy, as your message implies.

  • by billcopc ( 196330 ) <vrillco@yahoo.com> on Sunday June 01, 2008 @03:37PM (#23619449) Homepage
    What ? I run 4 servers myself. The small firm I work for, we run maybe 70-80 boxes in our cage.

    In fact I find it odd that this facility has so many individual customers. Seems like a lot of administrative overhead... If I were running that DC, I'd much rather lease out full or half racks, than individual units, then you let those people sublet to the small frys.

    That's how most of the big hosting companies operate. They don't own their own datacenters, they just lease a cage or two, cram it full of gear and sell you that godawful oversold web space you love to hate. That's also why colocating a single server can be so goddamned expensive - datacenters set per-unit pricing high to scare away the Joe Blows, and the resellers make a lot more money selling crap hosting than subletting their precious space. This is especially true in the USA/Canada.

  • by cowscows ( 103644 ) on Sunday June 01, 2008 @03:45PM (#23619517) Journal
    I think it depends on just how mission critical things are. If your business completely ceases to function if your website goes down, then remote redundancy certainly makes a lot of sense. If you can deal with a couple of days with no website, then maybe it's not worth the extra trouble. I'd imagine that a hardware failure confined to a single server is more common than explosions bringing entire data-centers offline, so maybe a backup server sitting right next to it isn't such a useless idea.

  • _The_ Power Room? (Score:3, Insightful)

    by John Hasler ( 414242 ) on Sunday June 01, 2008 @04:13PM (#23619691) Homepage
    > ...they claim redundant power...

    How the hell could they claim redundant power with only one power room?
  • by njcoder ( 657816 ) on Sunday June 01, 2008 @04:27PM (#23619793)

    I assure you issuing a one-day pro-rated credit to all your customers is cheaper.
    But not cheaper than losing 7500 accounts to another DC that can handle this type of event gracefully. The fact that it's complex doesn't mean you shouldn't expect it in a data center that claims to be "World Class"

    In related news, I was wondering why I wasn't getting much spam today and my sites didn't have strange spiders hitting them.
  • by aaarrrgggh ( 9205 ) on Sunday June 01, 2008 @04:34PM (#23619837)
    This isn't that uncommon with a 200kAIC board with air-power breakers, if there is a bolted fault. Instantaneous delays. Newer insulated-case style breakers all have an instantaneous override which will limit fault energy,

    The other possibility was that a tie was closed and the breakers over-dutied and could not clear the fault.

    Odd that nobody was hurt though; spontaneous shorts are very rare-- most involve either switching or work in live boards, either of which would kill someone.
  • by Anonymous Coward on Sunday June 01, 2008 @05:30PM (#23620317)
    If you host all DNS servers for your customers' domains in the same data center, you better have excellent support staff to make up for this rookie mistake.
  • by aronschatz ( 570456 ) on Sunday June 01, 2008 @05:42PM (#23620411) Homepage
    ThePlanet dropped the ball on redundant DNS. They had all the EV1 nameservers at that DC which is completely ridiculous...
  • My servers dropped off the net yesterday afternoon, and if all goes well they'll be up and running late tonight. At 1700PST they're supposed to do a power test, then start bringing up the environmentals, the switching gear, and blocks of servers.

    My thoughts as a customer of theirs:

    1. Good updates. Not as frequent or clear as I'd like, but mostly they didn't have much to add.

    2. Anyone bitching about the thousands of dollars per hour they're losing has not credibility to me. If your junk is that important, your hot standby server should be in another data center.

    3. This is a very rare event, and I will not be pulling out of what has been an excellent relationship so far with them.

    4. I am adding a fail over server in another data center (their Dallas facility). I'd planned this already but got caught being too slow this time.

    5. Because of the incident, I will probably make the new Dallas server the primary and the existing Houston one the backup. This is because I think there will be long term stability issues in this Houston data center for months to come. I know what concrete, drywall, and fire extinguisher dust does to servers. I also know they'll have a lot of work in reconstruction ahead, and that can lead to other issues.

    For now, I'll wait it out. I've heard of this cool place called "outside". maybe I'll check it out.
  • by fm6 ( 162816 ) on Sunday June 01, 2008 @05:53PM (#23620489) Homepage Journal

    This goes to show that no matter how much planning you do, Murphy's Law still applies.
    I am so tired of hearing that copout. Does the submitter know for a fact that ThePlanet did everything it could to keep its power system from exploding? I don't have any evidence one way or the other, but if they're anything like other independent data center operators, it's pretty unlikely.

    The lesson you should be taking from Murphy's Law is not "Shit Happens". The lesson you should be taking is that you can't assume that an unlikely problem (or one you can con yourself into thinking unlikely) is one you can ignore. It's only after you've prepared for every reasonable contingency that you're allowed to say "Shit Happens".
  • by kyoorius ( 16808 ) on Sunday June 01, 2008 @06:20PM (#23620695) Homepage
    There's no reason to use the forum software when they've locked the thread and are only using it to disseminate information. A Pentium one running lighttpd serving a static html page would be sufficient to handle the flood of requests.

  • by clare-ents ( 153285 ) on Sunday June 01, 2008 @06:38PM (#23620829) Homepage
    SLA is not a substitute for business insurance.

    If your business loses $1000/minute while it's offline, get a quote for insurance that pays out $1000/minute while you're offline. Alternatively if you're happy self insuring take the loss when it happens.

    It's almost as if people believe that SLAs are a form of service guarantee instead of a free very bad insurance deal.
  • by Anonymous Coward on Sunday June 01, 2008 @07:12PM (#23621089)
    Yes, the updates are better now. However, in the hours immediately after the servers went down, communication was terrible.

    Their website had no indication of the fact that there was a problem, and no one was responding when I called their 1-866 customer service number. After waiting for half an hour, their customer service number was disconnected, and you couldn't even call the number any more.

    They're better now, but the ETA they gave for having things working by "mid-afternoon" sunday looks unlikely now, and in the meantime, my business is hemorrhaging users...

    They've had serious problems since I joined a few months ago, and always with absolutely no communication (I have to call their customer service to learn that all packets to their data center are being lost, etc). I am definitely backing up my stuff and switching providers the moment they come back online.
  • by sciencewhiz ( 448595 ) on Sunday June 01, 2008 @07:17PM (#23621141)
    They are not running backup power because of the fire department told them not to, not because it doesn't exist.
  • First, that time was an estimate -- a target. Second, even if the initial power test passes, it will take hours to bring up the a/c systems, the switches, and the routers.

    The initial draw from each new bank of gear to be given power will be very high so it will need to go slow.

    The battery systems (be they on each rack or in large banks serving whole blocks) will try to charge all at once. If they're not careful, that'll heat those new power lines up like the filaments in a toaster. Remember, the battery plan they have was built with the idea that they'd be used very briefly during transition to generator power -- not drained down all at once.

    Only once all the switches and routing gear is back up can they start updating the network paths (do they use BGP for this -- that's not my area of expertise) so that peering data starts flowing.

    Only once the network is all up and stable (no small task on a site with dozens of high end peering points) can they even start doing banks of servers.

    Its also probably that each bank of servers will needs its own new power lines (and eventually replaced conduit) in the distribution center that was destroyed.

    Bank by bank they'll have to bring up all these servers, each of which will draw its maximum load during boot as disks are scanned and checked.

    Most of these servers probably haven't been shut down in months or years. Some drives may not spin up due to tired motors that can run fine but spinning from cold is just too much now. Other servers may have boot configuration problems undiscovered since the machines have been running without reboot for a long time -- linux ones anyway :-)

    This isn't something out of Young Frankenstein where they'll yell across the room "throw za main svitch!" and a watch the lights dim briefly while 9000 servers boot up with the deafening sound of system beeps. If they did try such a thing -- as if such a thing were possible -- it would immediately blow at least another transformer if not more.

    Think about it. 9000 servers @ an average of what, 300 watts, plus the networking gear, plus the air conditioning, plus charging all those batteries....you're talking megawatts.

    Without a Mr. Fusion or Harry Mudd stumbling in with some chicks wearing dilithium crystal jewelery this is going to take a while.
  • by aronschatz ( 570456 ) on Sunday June 01, 2008 @08:28PM (#23621645) Homepage
    Yeah, because everyone can afford redundancy like you can.

    Most people own a single server that they make backups of in case of it crashing OR have two servers in the same datacenter in case one fails.

    I don't know how you can easily do offsite switch over without a huge infrastructure to support it which most people don't have the time and money to do.

    Get off your high horse.
  • Look, when I go into a building in gear and carrying an axe and an extinguisher, breathing bottled air, wading through toxic smoke I couldn't give crap number one about your 100 sites being down.

    I have a crew to protect. In this case, I'm going into an extremely hazardous environment. There has already been one explosion. I don't know what I'm going to see when I get there, but I do know that this place is wall to wall danger. Wires everywhere to get tangled in when its dark and I'm crawling through the smoke. Huge amounts of currents. Toxic batteries everywhere that may or may not be stable. Wiring that may or may not be exposed.

    If its me in charge, and its my crew making entry, the power is going off. Its getting a lock-out tag on it. If you wont turn it off, I will. If I do it, you won't be turning it on so easily. If need be, I will have the police haul you away in cuffs if you try to stop me.

    My job, as a firefighter -- as a fire officer -- is to ensure the safety of the general public, of my crew, and then if possible of the property.

    NOW -- As a network guy and software developer -- I can say that if you're too short sighted or cheap to spring for a secondary DNS server at another facility, or if your servers are so critical to your livelihood that losing them for a couple of days will kill you but you haven't bothered to go with hot spares at another data center then you sir, are an idiot.

    At any data center - anywhere - anything can happen at any time. The f'ing ground could open up and swallow your data center. Terrorists could target it because the guy in the rack next to yours is posting cartoon photos of their most sacred religious icons. Monkeys could fly out of the site admin's [nose] and shut down all the servers. Whatever. If its critical, you have off site failover. If not, you're not very good at what you do.

    End of rant.
  • by CFD339 ( 795926 ) <andrewp@thenortI ... inus threevowels> on Sunday June 01, 2008 @10:10PM (#23622277) Homepage Journal
    Redundant power they have. Redundant power distribution grids they do not. This is common. The level of certification in redundancy on power for fully redundant grids is (I think) called 2N where they only claim N+1 -- which I understand means failover power. Its more than enough 99.9% of the the time. To have FULLY redundant power plus distribution from the main grid all the way into the building through the walls and to every rack is ridiculously more expensive. At that point, it is more sensible to buy another server at another facility for failover than to spend what it would cost to host a server with that kind of power redundancy -- on top of which, the server itself could still blow up and then where are you?

  • You sir, don't know what you're talking about. Reaching for ridiculous examples of someone doing their job wrong doesn't change that.

    Our S.O.G. (standard operating guidelines) are actually very specific about risk.

    We will risk our lives to save a human life.
    We will take reasonable risk to save the lives of pets and livestock.
    We will take minimal risks to save property.

    Sorry, but your building isn't worth the risk of my crew. That's reality.

    Don't you DARE tell me what is and isn't bravery or cowardly until you put 50 pounds of gear on and crawl into a pitch black house that's burning over your head.

    Don't you DARE tell me that you think you understand the difference between saving the blonde girl and saving your computer server.

    This isn't TV World. This is the real world. Fire on TV doesn't look like real fire. You know why? Because a real house on fire doesn't look like anything but pitch black and that makes for lousy TV.

    Get over yourself and go volunteer at your local fire department. 86% of the men and women in this country who will risk their lives for yours are volunteers. We could use your help if you have the guts for it. We'll teach you what you need to know -- and we'll keep you as safe as we can so you can go home to your family when its done.

    Your examples are stupid and insulting to the 800,000 brave men and women who volunteer to risk death in the most painful way possible to save your sorry butt.
  • Yep (Score:4, Insightful)

    by Sycraft-fu ( 314770 ) on Monday June 02, 2008 @02:50AM (#23623877)
    For example someone like Newegg.com probably has a redundant data centre. Reason being that if their site is down, their income drops to 0. Even if they had the phone techs to do the orders nobody knows their phone number and since the site is down, you can't look it up. However someone like Rotel.com probably doesn't. If their site is down it's inconvenient, and might possibly cost them some sales from people who can't research their products online, but ultimately it isn't a big deal even if it's gone for a couple of days. Thus it isn't so likely they'd spend the money on being in different data centres.

    You are also right on in terms of type of failure. I've been at the whole computer support business for quite a while now, and I have a lot of friends who do the same thing. I don't know that I could count the number of servers that I've seen die. I wouldn't call it a common occurrence, but it happens often enough that it is a real concern and thus important servers tend to have backups. However I've never heard of a data centre being taken out (I mean from someone I know personally, I've seen it on the news). Even when a UPS blew up in the university's main data centre, it didn't end up having to go down.

    I'm willing to bet that if you were able to get statistics on the whole of the US, you'd find my little sample is quite true. There'd be a lot of cases of servers dying, but very, very few of whole data centres going down, and then usually only because of things like hurricanes or the 9/11 attacks. Thus, a backup server makes sense, however unless it is really important a backup data centre may not.

"Life begins when you can spend your spare time programming instead of watching television." -- Cal Keegan

Working...