Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet Businesses

Explosion At ThePlanet Datacenter Drops 9,000 Servers 431

An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.
This discussion has been archived. No new comments can be posted.

Explosion At ThePlanet Datacenter Drops 9,000 Servers

Comments Filter:
  • by gardyloo ( 512791 ) on Sunday June 01, 2008 @02:11PM (#23618755)
    9000::7500?

    So I guess a "customer" in this case is a company or business, not an individual? Unless many of the individuals have several servers each.
  • Murphy's Law (Score:1, Interesting)

    by Anonymous Coward on Sunday June 01, 2008 @02:20PM (#23618813)

    While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies."

    And then they put it on the front page of Slashdot.

    It was Sunday, June 1, 2008. Xeon, my children, just don't belong in some places [crash.com].

    (About the only thing missing from this real-world version of the story is a YouTube video of a halon fire suppression system going off. Damn ozone-protection regs :)

  • by Zymergy ( 803632 ) * on Sunday June 01, 2008 @02:30PM (#23618901)
    I am wondering what UPS/Generator Hardware was in use?
    Where would the "failure" (Short/Electrical Explosion) have to be to cause everything to go dark?
    Sounds like the power distribution circuits downstream of the UPS/Generator were damaged.

    Whatever vendor provided the now vaporized components are likely praying that the specifics are not mentioned here.

    I recall something about Lithium Batteries exploding in Telecom DSLAMs... I wonder if their UPS system used Lithium Ion cells?
    http://www.lightreading.com/document.asp?doc_id=109923 [lightreading.com]
    http://tech.slashdot.org/article.pl?sid=07/08/25/1145216 [slashdot.org]
    http://hardware.slashdot.org/article.pl?sid=07/09/06/0431237 [slashdot.org]
  • by imipak ( 254310 ) on Sunday June 01, 2008 @02:45PM (#23619041) Journal
    Little-known fact: The Planet were the first ever retail ISP offering Internet access to the general public - from 1989. Hmmm, so the longest-established ISP in the world that they're not only working hard to get that DC back online, they're posting pretty open summaries of the state of play... coincidence? I don't think so.
  • Re:Recovery costs (Score:3, Interesting)

    by Yetihehe ( 971185 ) on Sunday June 01, 2008 @03:19PM (#23619327)
    So maybe it would make more sense to just skip their insurance?
  • by JoeShmoe ( 90109 ) <askjoeshmoe@hotmail.com> on Sunday June 01, 2008 @03:45PM (#23619515)

    Everyone loves firemen, right? Not me. While the guys you see in the movies running into burning buildings might be heroes, the real world firemen (or more specifically fire chiefs) are capricious, arbitrarty, ignorant little rulers of their own personal fiefdom. Did you know that if you are getting an inspection from your local firechief and he commands something, there is no appeal? His word is law, no matter how STUPID or IGNORANT. I'll give you some examples later.

    I'm one of the affected customers. I have about 100 domains down right now because both my nameservers were hosted at the facility, as is the control panel that I would use to change the nameserver IPs. Whoops. So I learned why I need to obviously have NS3 and ND4 and spread them around because even though the servers are spread everywhere, without my nameservers none of them currently resolve.

    It sounds like the facility was ordered to cut ALL power because of some fire chief's misguided fear that power flows backwards from a low-voltage source to a high-voltage one. I admit I don't know much about the engineering of this data center, but I'm pretty sure the "Y" junction where AC and generator power come together is going to be as close to the rack power as possible to avoid lossy transformation. It makes no sense why they would have 220 or 400 VAC generators running through the same high-voltage transformer when it would be far more efficient to have 120 or even 12VCD (if only servers would accept that). But I admit I could be wrong, and if it is a legit safety issue...then it's apparently a single point of failure for every data center out there because ThePlanet charged enough that they don't need to cut corners.

    Here's a couple of times that I've had my hackles raised by some fireman with no knowledge of technology. The first was when we switched alarm companies and required a fire inspector to come and sign off on the newly installed system. The inspector said we needed to shut down power for 24 hours to verify that the fire alarm would still work after that period of time (a code requirement). No problem, we said, reaching for the breaker for that circuit.

    No no, he said. ALL POWER. That meant the entire office complex, some 20-30 businesses, would need to be without power for an entire day so that this fing idiot could be sure that we weren't cheating by sneaking supplimentary power from another source.

    WHAT THE FRACK

    We ended up having to rent generators and park them outside to keep our racks and critical systems running, and then renting a conference room to relocate employees. We went all the way to the country commmissioners pointing out how absolutely stupid this was (not to mention, who the HELL is still going to be in a burning building 24 hours after the alarm's gone off) but we told that there was no override possible.

    The second time was at a different place when we installed a CO alarm as required for commercial property. Well, the inspector came and said we need to test it. OK, we said, pressing the test button. No no, he said, we need to spray it with carbon monoxide.

    Where the HELL can you buy a toxic substance like carbon monoxide, we asked. Not his problem but he wouldn't sign off until we did. After finding out that it was illegal to ship the stuff, and that there was no local supplier, we finally called the manufacturer of the device who pointed out that the device was void the second it was exposed to CO because the sensor was not reusuable. In other words, when the sensor was tripped, it was time to buy a new monitor. You can see the recursive loop that would have devloped if we actually had tested the device and then promptly had to replace it and get the new one retested by this idiot.

    So finally we got a letter from the manufacturer that pointed out the device was UL certified and that pressing the test button WAS the way you tested the device. It took four weeks of arguing before he finally found an excuse that let him safe face and
  • by Zebra_X ( 13249 ) on Sunday June 01, 2008 @03:49PM (#23619539)
    "The power is off due to the explosion but there servers themselves are A-OK."

    Physically OK maybe... lets see how many of them come back up when the power is restored ^ ^
  • by Animats ( 122034 ) on Sunday June 01, 2008 @04:09PM (#23619667) Homepage

    YouTube's home page is returning "Service unavailable". Is this related? (Google Video is up.)

  • by xaxa ( 988988 ) on Sunday June 01, 2008 @04:10PM (#23619669)
    I was very impressed that a new bridge that was being extended [blogspot.com] over a busy railway line [blogspot.com] didn't cause any damage when they dropped it [blogspot.com] (they were lucky no trains were going under the bridge at the time, it's a very busy railway line -- about 40 trains [livedepart...ards.co.uk] in the next hour on a Sunday night, so you can imagine what it's like on a weekday. It did cause massive disruption, as they closed the line. And I don't know why they didn't have backup jacks if the failure of one left it unsupported.)

    I know it's not really relevant, but I didn't realise I was so interested in construction/engineering before reading about the past year's worth of posts on that blog (well, the construction ones. Not the "I was first on the new train!" ones. Though I admire the guy's dedication, to be awake at 4.00 to get the first ever train from the new Heathrow Airport station or whatever).
  • by jd ( 1658 ) <imipak@yahoGINSBERGo.com minus poet> on Sunday June 01, 2008 @04:31PM (#23619825) Homepage Journal
    *wonders how many remember the live incident at the BBC, many years ago, when the Grandstand teleprinter stopped displaying match results and started printing updates on a fire running through the building.
  • by njcoder ( 657816 ) on Sunday June 01, 2008 @06:05PM (#23620571)

    Part of my point that you apparently missed was that even a full 2N power system end-to-end doesn't guarantee uptime. There are very few - and I'd even go so far as to say "if any" - datacenters in the world that could handle an explosion / fire without going down.
    The dc didn't explode, just the power room. It seems there was just one power room. I've been to data centers around here, even small ones that have 2 power rooms.

    While it may be the fire dept that is erroneously preventing them from bringing up their back-up power, it's part of a poor disaster recovery plan to not engage with the fire dept, electric co, etc. before a disaster happens, so that everyone is on-board with your disaster recovery plans and that you have the ability to implement that plan.

    The explosion was isolated to the power room. The servers are fine, the backup generators and batteries are fine. The servers should have been back online if they had a good disaster recovery plan. The whole point of disaster recovery is being able to handle a disaster. You can't say "oh there was a disaster, you can't help that". This is exactly what their plan should have been able to handle. The power room goes offline. It shouldn't matter if it was because of an explosion, a fire, equipment failure or being beamed into outer space.

    It also shouldn't matter who is telling them to keep the power off. Part of the disaster recovery plan should have been making sure local authorities allowed them to carry it out. Fine, they have to shut off all power when firemen are in there with hoses. I understand that. But once the fire is out your plan should allow you to bring up backup power. It didn't. So I don't see how they can call themselves a "World Class Data Center". Part of what they sell and what customers expect is disaster recovery. And there are data centers that can provide this.

    ThePlanet is pretty cheap compared to datacenters like NAC that have more redundancy and security. But ThePlanet wants to advertise that they are just as good. Now they were caught with their pants down when there was actually a disaster and their disaster recovery plan failed.
  • by CptNerd ( 455084 ) <adiseker@lexonia.net> on Sunday June 01, 2008 @08:37PM (#23621693) Homepage
    From what they were saying (I'm a customer, with both servers in that datacenter) it was a high-voltage transformer, so it might very well have been one that size. They did say it was much larger than the kind on power poles, but not indication of exactly how much it was handling. This is probably one of those times when architecture and esthetics took primary status over safety when the building was built. I would have thought a transformer as large as what blew up would be outside the building proper. At any rate, it's a major fustercluck that's going to take time to fix.

    Maybe in the post-mortem, someone will figure out it's time to start looking at ways to use less power, maybe switching to servers that use the lower-power CPUs that are coming out, so that the very high power infrastructure isn't as necessary. I have a feeling there'll be a "fire sale" on server subscriptions once a lot of customers leave (I'm not one of them, but I will likely swap one of mine for another at another location, much much later).
  • by LostCluster ( 625375 ) * on Sunday June 01, 2008 @09:44PM (#23622083)
    RackShack was also the company with the "screwdriver incident" where the a tech working in the power room dropped the tool into a UPS and shorted out the facility. No customer data was lost, but the power outage caused them to be offline for more than a day.

Old programmers never die, they just hit account block limit.

Working...