Explosion At ThePlanet Datacenter Drops 9,000 Servers 431
An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.
Server/customer ratio? (Score:2, Interesting)
So I guess a "customer" in this case is a company or business, not an individual? Unless many of the individuals have several servers each.
Murphy's Law (Score:1, Interesting)
And then they put it on the front page of Slashdot.
It was Sunday, June 1, 2008. Xeon, my children, just don't belong in some places [crash.com].
(About the only thing missing from this real-world version of the story is a YouTube video of a halon fire suppression system going off. Damn ozone-protection regs :)
Lithium Batteries in their UPS setup?? (Score:3, Interesting)
Where would the "failure" (Short/Electrical Explosion) have to be to cause everything to go dark?
Sounds like the power distribution circuits downstream of the UPS/Generator were damaged.
Whatever vendor provided the now vaporized components are likely praying that the specifics are not mentioned here.
I recall something about Lithium Batteries exploding in Telecom DSLAMs... I wonder if their UPS system used Lithium Ion cells?
http://www.lightreading.com/document.asp?doc_id=109923 [lightreading.com]
http://tech.slashdot.org/article.pl?sid=07/08/25/1145216 [slashdot.org]
http://hardware.slashdot.org/article.pl?sid=07/09/06/0431237 [slashdot.org]
Re:Kudo to their support team (Score:3, Interesting)
Re:Recovery costs (Score:3, Interesting)
Ignorant firemen = single point-of-failure (Score:4, Interesting)
Everyone loves firemen, right? Not me. While the guys you see in the movies running into burning buildings might be heroes, the real world firemen (or more specifically fire chiefs) are capricious, arbitrarty, ignorant little rulers of their own personal fiefdom. Did you know that if you are getting an inspection from your local firechief and he commands something, there is no appeal? His word is law, no matter how STUPID or IGNORANT. I'll give you some examples later.
I'm one of the affected customers. I have about 100 domains down right now because both my nameservers were hosted at the facility, as is the control panel that I would use to change the nameserver IPs. Whoops. So I learned why I need to obviously have NS3 and ND4 and spread them around because even though the servers are spread everywhere, without my nameservers none of them currently resolve.
It sounds like the facility was ordered to cut ALL power because of some fire chief's misguided fear that power flows backwards from a low-voltage source to a high-voltage one. I admit I don't know much about the engineering of this data center, but I'm pretty sure the "Y" junction where AC and generator power come together is going to be as close to the rack power as possible to avoid lossy transformation. It makes no sense why they would have 220 or 400 VAC generators running through the same high-voltage transformer when it would be far more efficient to have 120 or even 12VCD (if only servers would accept that). But I admit I could be wrong, and if it is a legit safety issue...then it's apparently a single point of failure for every data center out there because ThePlanet charged enough that they don't need to cut corners.
Here's a couple of times that I've had my hackles raised by some fireman with no knowledge of technology. The first was when we switched alarm companies and required a fire inspector to come and sign off on the newly installed system. The inspector said we needed to shut down power for 24 hours to verify that the fire alarm would still work after that period of time (a code requirement). No problem, we said, reaching for the breaker for that circuit.
No no, he said. ALL POWER. That meant the entire office complex, some 20-30 businesses, would need to be without power for an entire day so that this fing idiot could be sure that we weren't cheating by sneaking supplimentary power from another source.
WHAT THE FRACK
We ended up having to rent generators and park them outside to keep our racks and critical systems running, and then renting a conference room to relocate employees. We went all the way to the country commmissioners pointing out how absolutely stupid this was (not to mention, who the HELL is still going to be in a burning building 24 hours after the alarm's gone off) but we told that there was no override possible.
The second time was at a different place when we installed a CO alarm as required for commercial property. Well, the inspector came and said we need to test it. OK, we said, pressing the test button. No no, he said, we need to spray it with carbon monoxide.
Where the HELL can you buy a toxic substance like carbon monoxide, we asked. Not his problem but he wouldn't sign off until we did. After finding out that it was illegal to ship the stuff, and that there was no local supplier, we finally called the manufacturer of the device who pointed out that the device was void the second it was exposed to CO because the sensor was not reusuable. In other words, when the sensor was tripped, it was time to buy a new monitor. You can see the recursive loop that would have devloped if we actually had tested the device and then promptly had to replace it and get the new one retested by this idiot.
So finally we got a letter from the manufacturer that pointed out the device was UL certified and that pressing the test button WAS the way you tested the device. It took four weeks of arguing before he finally found an excuse that let him safe face and
Re:More planning could have prevented this (Score:4, Interesting)
Physically OK maybe... lets see how many of them come back up when the power is restored ^ ^
Is this why YouTube is down? (Score:3, Interesting)
YouTube's home page is returning "Service unavailable". Is this related? (Google Video is up.)
Re:Photos or informaton on building? (Score:3, Interesting)
I know it's not really relevant, but I didn't realise I was so interested in construction/engineering before reading about the past year's worth of posts on that blog (well, the construction ones. Not the "I was first on the new train!" ones. Though I admire the guy's dedication, to be awake at 4.00 to get the first ever train from the new Heathrow Airport station or whatever).
Re:Explosion? (Score:5, Interesting)
Re:Printer ignition source (Score:5, Interesting)
Re:More planning could have prevented this (Score:4, Interesting)
While it may be the fire dept that is erroneously preventing them from bringing up their back-up power, it's part of a poor disaster recovery plan to not engage with the fire dept, electric co, etc. before a disaster happens, so that everyone is on-board with your disaster recovery plans and that you have the ability to implement that plan.
The explosion was isolated to the power room. The servers are fine, the backup generators and batteries are fine. The servers should have been back online if they had a good disaster recovery plan. The whole point of disaster recovery is being able to handle a disaster. You can't say "oh there was a disaster, you can't help that". This is exactly what their plan should have been able to handle. The power room goes offline. It shouldn't matter if it was because of an explosion, a fire, equipment failure or being beamed into outer space.
It also shouldn't matter who is telling them to keep the power off. Part of the disaster recovery plan should have been making sure local authorities allowed them to carry it out. Fine, they have to shut off all power when firemen are in there with hoses. I understand that. But once the fire is out your plan should allow you to bring up backup power. It didn't. So I don't see how they can call themselves a "World Class Data Center". Part of what they sell and what customers expect is disaster recovery. And there are data centers that can provide this.
ThePlanet is pretty cheap compared to datacenters like NAC that have more redundancy and security. But ThePlanet wants to advertise that they are just as good. Now they were caught with their pants down when there was actually a disaster and their disaster recovery plan failed.
Re:What does a server room (Score:3, Interesting)
Maybe in the post-mortem, someone will figure out it's time to start looking at ways to use less power, maybe switching to servers that use the lower-power CPUs that are coming out, so that the very high power infrastructure isn't as necessary. I have a feeling there'll be a "fire sale" on server subscriptions once a lot of customers leave (I'm not one of them, but I will likely swap one of mine for another at another location, much much later).
Re:Who's hosted on ThePlanet? (Score:3, Interesting)