Explosion At ThePlanet Datacenter Drops 9,000 Servers 431
An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.
Server/customer ratio? (Score:2, Interesting)
So I guess a "customer" in this case is a company or business, not an individual? Unless many of the individuals have several servers each.
Re: (Score:3, Insightful)
Re:Server/customer ratio? (Score:5, Insightful)
Re: (Score:3, Informative)
Re: (Score:3, Insightful)
Yep (Score:4, Insightful)
You are also right on in terms of type of failure. I've been at the whole computer support business for quite a while now, and I have a lot of friends who do the same thing. I don't know that I could count the number of servers that I've seen die. I wouldn't call it a common occurrence, but it happens often enough that it is a real concern and thus important servers tend to have backups. However I've never heard of a data centre being taken out (I mean from someone I know personally, I've seen it on the news). Even when a UPS blew up in the university's main data centre, it didn't end up having to go down.
I'm willing to bet that if you were able to get statistics on the whole of the US, you'd find my little sample is quite true. There'd be a lot of cases of servers dying, but very, very few of whole data centres going down, and then usually only because of things like hurricanes or the 9/11 attacks. Thus, a backup server makes sense, however unless it is really important a backup data centre may not.
Re: (Score:3, Informative)
http://lists.debian.org/debian-devel/2002/11/msg01926.html [debian.org]
Re: (Score:3, Insightful)
Re:Server/customer ratio? (Score:5, Insightful)
a bit wrong (Score:3, Insightful)
9 Volts of Love (Score:5, Funny)
Re: (Score:3, Funny)
It must have been HACKERS (Score:5, Funny)
Kudo to their support team (Score:5, Insightful)
Re: (Score:3, Interesting)
Re:Kudo to their support team (Score:5, Informative)
Re:Kudo to their support team (Score:4, Funny)
Re:Kudo to their support team (Score:5, Insightful)
Re:Kudo to their support team (Score:4, Insightful)
It does not sound like the type of company that thinks of its customers as an enemy, as your message implies.
Re: (Score:3, Funny)
"Hello, this is the Planet, our servers are down for the moment but we're working on it, thank you for your comprehension... Oh no, Smith is on fire ! Someone get him !!! *click*"
Update 11:14 PM CST (Score:4, Informative)
explosion? (Score:5, Funny)
Re:explosion? (Score:5, Funny)
Re:explosion? (Score:4, Funny)
Printer ignition source (Score:5, Funny)
lp0 printer on fire!
Re:Printer ignition source (Score:5, Interesting)
Re: (Score:3, Informative)
In short, most unix printing systems understand a very small number of printer status codes, usually consisting of "READY, ONLINE, OFFLINE, and PRINTER ON FIRE"
The latter status message was actually semi-serious, and was thrown whenever the printer was encountering a serious error, but for some reason was continuing to print anyway. In the case of a high-speed mainframe printer, if the printer jammed but con
trying to read it (Score:5, Funny)
Re:trying to read it (Score:5, Funny)
UMM.. USE STATIC PAGE?? (Score:5, Insightful)
Recovery costs (Score:5, Funny)
Re: (Score:2)
Re:Recovery costs (Score:5, Insightful)
Also, only the electrical equipment (and structural stuff) was damaged - networking and customer servers are intact (but without power, obviously).
Re: (Score:2)
Re: (Score:3, Interesting)
No, not necessarily (Score:5, Informative)
Well with building insurance, that's not the case. You aren't really a significant risk factor. Risk is instead calculated of of things like what kind of structure it is, how far it is from the fire department, what it's used for, what it contains (that determines what they are on the hook for) etc. So when something happens, unless it was because of a previously unknown risk factor, your rates don't necessarily change. Nothing changed with regards to risk.
Insurance is really all just risk based. They take the probability of having to make a payout and the amount of said payout vs time and come up with a rate. If something changes the risk, the rate will change as well, but if not then it doesn't change. It isn't as though your one single payout is of any significance to their overall operation.
Also, the idea of "Just pay for it yourself," is extremely silly. It smacks of someone who's never owned something of any significant value. The reason behind insurance is that you CAN'T just pay for it yourself. For example I have insurance on my house. The reason is that if I lost it, I can't afford to replace it. I don't have a couple hundred grand just lying around in the bank. That's the point of insurance. You are insuring that if something happens that you can't afford, someone will pay for it. The insurance company is then, of course, that it isn't likely to happen and they get to keep the money.
Re: (Score:3, Funny)
Re: (Score:3, Funny)
Dollys - $500, Truck rentals - $5000, Labour - $10000, Sending internets on trucks - Priceless
This is BAD KARMA!! (Score:5, Funny)
What does a server room (Score:2, Funny)
Re:What does a server room (Score:4, Insightful)
Re: (Score:3, Interesting)
Kevin Hazard? (Score:3, Funny)
Helpful Slashdot! (Score:5, Funny)
and as of this posting, make that 152,476.
Re: (Score:2)
Photos or informaton on building? (Score:3, Insightful)
Being in the power systems engineering biz, I'd be interested in some more information on the type of building (age, original occupancy type, etc.) involved.
To date. I've seen a number of data center power problems, from fires to isolated, dual source systems that turned out not to be. It raises the question of how well the engineering was done for the original facility, or the refit of an existing one. Or whether proper maintenance was carried out.
From TFA:
Re: (Score:2)
I'm sure at one point it was well designed.. but that was, I'm guessing, a few years ago and at a lot lower current and more than a few modifications ago.
That's of course not counting the possibility of contractor stupidity.
I don't know what makes people so freaking stupid when it comes to electricity. But then I'm still annoyed by a roofing contractor having two employee's lives
Re: (Score:2)
Re: (Score:3, Insightful)
The other possibility was that a tie was closed and the breakers over-dutied and could not clear the fault.
Odd that nobody was hurt though; spontaneous shorts are very rare-- most involve either switching or work in live boards, either of which would kill someone.
Re:Photos or informaton on building? (Score:5, Informative)
I'm a mechanical/electrical engineer by training, and what you're saying makes no sense to us. Mistakes are made in the laboratory, where things are allowed to blow up and start fires. Once you hit the real world the considerations are *very different*. While it's possible that this fire could be caused by something entirely unforeseeable (unlikely given our experience in this field), it's also possible that this was due to improperly designed systems.
I don't suppose you'd be singing the same tune if this was a bridge collapse that killed hundreds. There's a reason why engineering costs a lot, and that's directly correlated to how little failure we can tolerate.
Re: (Score:3, Interesting)
I know it's not really rel
Explosion? (Score:4, Insightful)
The only thing that I can imagine that could've caused an explosion in a datacenter is a battery bank (the data centers I've been in didn't have any large A/C transformers inside). And even then, I thought that the NEC had some fairly strict codes about firewalls, explosion-proof vaults and the like.
I just find it curious, since it's not unthinkable that rechargeable batteries might explode.
mr c
Re: (Score:2)
Re:Explosion? (Score:5, Informative)
Re:Explosion? (Score:4, Insightful)
At one place I worked, every lightening storm my boss would rush to move his shitty old truck to underneath the can on the power pole, hoping the thing would blow and burn it so he could get insurance to replace it.
Re:Explosion? (Score:5, Interesting)
Coral cached LOFI status page (Score:5, Informative)
Kudos to them for their timely updates as to system status. Having their status page listed on /. doesn't help them much, but I
was encouraged to see a Coral Cache link to their status page. In that light, here's:
a link to the Coral Cache lofiversion of their status page:
Lithium Batteries in their UPS setup?? (Score:3, Interesting)
Where would the "failure" (Short/Electrical Explosion) have to be to cause everything to go dark?
Sounds like the power distribution circuits downstream of the UPS/Generator were damaged.
Whatever vendor provided the now vaporized components are likely praying that the specifics are not mentioned here.
I recall something about Lithium Batteries exploding in Telecom DSLAMs... I wonder if their UPS system used Lithium Ion cells?
http://www.lightreading.com/document.asp?doc_id=109923 [lightreading.com]
http://tech.slashdot.org/article.pl?sid=07/08/25/1145216 [slashdot.org]
http://hardware.slashdot.org/article.pl?sid=07/09/06/0431237 [slashdot.org]
Re: (Score:2, Informative)
kaboom (Score:2, Funny)
5 servers, 5 cities, 5 providers (Score:2, Insightful)
I feel bad for their techs, but I have no sympathy for someone who's single-sourced, they should have propagated to their offsite secondary.
Which they'll be buying tomorrow, I'm sure.
Re:5 servers, 5 cities, 5 providers (Score:4, Insightful)
Most people own a single server that they make backups of in case of it crashing OR have two servers in the same datacenter in case one fails.
I don't know how you can easily do offsite switch over without a huge infrastructure to support it which most people don't have the time and money to do.
Get off your high horse.
More details on the outage (Score:3, Informative)
Re: (Score:3, Informative)
DNS is *THE* *MOST* critical part of infrastructure. If the HTTP server fail, ok. If mail fails, ok. If data center explodes, you still have DNS so anyone sending email will just be stuck for a few days. But if DNS is offline, then email is offline. You are off the internet.
I've had a server motherboard die and it took a few days to get new one installed and
No servers were damaged (Score:4, Funny)
Monty Python (Score:5, Funny)
Ignorant firemen = single point-of-failure (Score:4, Interesting)
Everyone loves firemen, right? Not me. While the guys you see in the movies running into burning buildings might be heroes, the real world firemen (or more specifically fire chiefs) are capricious, arbitrarty, ignorant little rulers of their own personal fiefdom. Did you know that if you are getting an inspection from your local firechief and he commands something, there is no appeal? His word is law, no matter how STUPID or IGNORANT. I'll give you some examples later.
I'm one of the affected customers. I have about 100 domains down right now because both my nameservers were hosted at the facility, as is the control panel that I would use to change the nameserver IPs. Whoops. So I learned why I need to obviously have NS3 and ND4 and spread them around because even though the servers are spread everywhere, without my nameservers none of them currently resolve.
It sounds like the facility was ordered to cut ALL power because of some fire chief's misguided fear that power flows backwards from a low-voltage source to a high-voltage one. I admit I don't know much about the engineering of this data center, but I'm pretty sure the "Y" junction where AC and generator power come together is going to be as close to the rack power as possible to avoid lossy transformation. It makes no sense why they would have 220 or 400 VAC generators running through the same high-voltage transformer when it would be far more efficient to have 120 or even 12VCD (if only servers would accept that). But I admit I could be wrong, and if it is a legit safety issue...then it's apparently a single point of failure for every data center out there because ThePlanet charged enough that they don't need to cut corners.
Here's a couple of times that I've had my hackles raised by some fireman with no knowledge of technology. The first was when we switched alarm companies and required a fire inspector to come and sign off on the newly installed system. The inspector said we needed to shut down power for 24 hours to verify that the fire alarm would still work after that period of time (a code requirement). No problem, we said, reaching for the breaker for that circuit.
No no, he said. ALL POWER. That meant the entire office complex, some 20-30 businesses, would need to be without power for an entire day so that this fing idiot could be sure that we weren't cheating by sneaking supplimentary power from another source.
WHAT THE FRACK
We ended up having to rent generators and park them outside to keep our racks and critical systems running, and then renting a conference room to relocate employees. We went all the way to the country commmissioners pointing out how absolutely stupid this was (not to mention, who the HELL is still going to be in a burning building 24 hours after the alarm's gone off) but we told that there was no override possible.
The second time was at a different place when we installed a CO alarm as required for commercial property. Well, the inspector came and said we need to test it. OK, we said, pressing the test button. No no, he said, we need to spray it with carbon monoxide.
Where the HELL can you buy a toxic substance like carbon monoxide, we asked. Not his problem but he wouldn't sign off until we did. After finding out that it was illegal to ship the stuff, and that there was no local supplier, we finally called the manufacturer of the device who pointed out that the device was void the second it was exposed to CO because the sensor was not reusuable. In other words, when the sensor was tripped, it was time to buy a new monitor. You can see the recursive loop that would have devloped if we actually had tested the device and then promptly had to replace it and get the new one retested by this idiot.
So finally we got a letter from the manufacturer that pointed out the device was UL certified and that pressing the test button WAS the way you tested the device. It took four weeks of arguing before he finally found an excuse that let him safe face and
I'm a firefighter AND a geek. You, not so much. (Score:5, Insightful)
I have a crew to protect. In this case, I'm going into an extremely hazardous environment. There has already been one explosion. I don't know what I'm going to see when I get there, but I do know that this place is wall to wall danger. Wires everywhere to get tangled in when its dark and I'm crawling through the smoke. Huge amounts of currents. Toxic batteries everywhere that may or may not be stable. Wiring that may or may not be exposed.
If its me in charge, and its my crew making entry, the power is going off. Its getting a lock-out tag on it. If you wont turn it off, I will. If I do it, you won't be turning it on so easily. If need be, I will have the police haul you away in cuffs if you try to stop me.
My job, as a firefighter -- as a fire officer -- is to ensure the safety of the general public, of my crew, and then if possible of the property.
NOW -- As a network guy and software developer -- I can say that if you're too short sighted or cheap to spring for a secondary DNS server at another facility, or if your servers are so critical to your livelihood that losing them for a couple of days will kill you but you haven't bothered to go with hot spares at another data center then you sir, are an idiot.
At any data center - anywhere - anything can happen at any time. The f'ing ground could open up and swallow your data center. Terrorists could target it because the guy in the rack next to yours is posting cartoon photos of their most sacred religious icons. Monkeys could fly out of the site admin's [nose] and shut down all the servers. Whatever. If its critical, you have off site failover. If not, you're not very good at what you do.
End of rant.
Re: (Score:3, Insightful)
Our S.O.G. (standard operating guidelines) are actually very specific about risk.
We will risk our lives to save a human life.
We will take reasonable risk to save the lives of pets and livestock.
We will take minimal risks to save property.
Sorry, but your building isn't worth the risk of my crew. That's reality.
Don't you DARE tell me what is and isn't bravery or cowardly until
Not so simple. (Score:5, Informative)
1. It assumes that the only problem is with the original transformer. When I arrive on scene I don't know what the problem was -- even if you tell me you do know, I can't believe it. I also don't know what the secondary problems are.
2. Feeding power into a building that has been physically damaged is very very dangerous. We're not talking about a transformer "failing to work" we're talking about something that blew the walls off the room it was in.
3. We already know that things didn't go the way they were supposed to. Something failed. Some safety plan didn't work. We have to assume that we're dealing with chaos until proved otherwise.
So, as a fire officer I arrive on scene and have a smoke filled building with reports of an explosion and MAYBE a report that everyone is out. I need to go in and find out what happened, if anything is still burning or in immediate danger, and if anyone is still in side. To do that safely, the first thing I want to do is secure the power to the building (shut it off) as well as any other utility feeds (oil, steam, liquefied petroleum or natural gas).
The gear I carry -- even the radio -- is designed to never create even the tiniest spark in its operation. We call it "intrinsically safe". Its one of a great many precautions we take.
We go in to a place like this not knowing the equipment, not knowing its condition.
My final proof point --
If in fact The Planet had powered up their generators, they'd have fried a lot more stuff and caused more fire. The may have destroyed their chances of salvaging the grid within 48 hours at all. Why? It turns out (we now know) that the force of the initial explosion moved three walls in the power distribution center more than a foot (I heard 3 feet I think) off their base. This tore out electrical connections, cables, conduits and power switches. Just now, after 28 hours, they've figured out how to get power to the servers on the second floor, but for the first floor servers they're having to rig up a line from the generators to that floor and it will take until tomorrow to do that. Why? Because the electrical connections from that distribution room to the first floor servers are destroyed. They're going to be running 3000 servers on the first floor off those generators for a week while they get the equipment to rebuild the connectivity to the main distribution room.
What does this prove?
1. It proves the fire marshal was right in not allowing them to feed power in their.
2. It proves that when that big dumb fireman you see (who may be a volunteer who's also a network guy and software developer with an IQ above 95% of the world) may in fact have a good reason for the way they do things on scene.
Look, as a firefighter I don't set out to ruin someone's day. I set out to keep them safe. If that sounds paternalistic, well, It is paternalistic. It very much feels that way. In my small town, its how I feel. I wonder ever time I walk into a building, how I would protect MY PEOPLE in this building if a fire broke out or a hazmat incident started or whatever. You can't help it, its what you're trained to do.
"short in a high-volume wire conduit."? (Score:3, Informative)
They supposedly had a "short in a high-volume wire conduit." That leads to questions as to whether they exceeded the NEC limits [generalcable.com] on how much wire and how much current you can put through a conduit of a given size. Wires dissipate heat, and the basic rule is that conduits must be no more than 40% filled with wire. The rest of the space is needed for air cooling. The NEC rules are conservative, and if followed, overheating should not be a problem.
This data center is in a hot climate, and a data center is often a continuous maximum load on the wiring, so if they do exceed the packing limits for conduit, a wiring failure through overheat is a very real possibility.
Some fire inspector will pull charred wires out of damaged conduit and compare them against the NEC rules. We should know in a few days.
Is this why YouTube is down? (Score:3, Interesting)
YouTube's home page is returning "Service unavailable". Is this related? (Google Video is up.)
_The_ Power Room? (Score:3, Insightful)
How the hell could they claim redundant power with only one power room?
Re: (Score:3, Insightful)
I'm a customer in that DC, and I'm a firefighter (Score:5, Insightful)
My thoughts as a customer of theirs:
1. Good updates. Not as frequent or clear as I'd like, but mostly they didn't have much to add.
2. Anyone bitching about the thousands of dollars per hour they're losing has not credibility to me. If your junk is that important, your hot standby server should be in another data center.
3. This is a very rare event, and I will not be pulling out of what has been an excellent relationship so far with them.
4. I am adding a fail over server in another data center (their Dallas facility). I'd planned this already but got caught being too slow this time.
5. Because of the incident, I will probably make the new Dallas server the primary and the existing Houston one the backup. This is because I think there will be long term stability issues in this Houston data center for months to come. I know what concrete, drywall, and fire extinguisher dust does to servers. I also know they'll have a lot of work in reconstruction ahead, and that can lead to other issues.
For now, I'll wait it out. I've heard of this cool place called "outside". maybe I'll check it out.
1700 test not necessarily a failure (Score:3, Insightful)
The initial draw from each new bank of gear to be given power will be very high so it will need to go slow.
The battery systems (be they on each rack or in large banks serving whole blocks) will try to charge all at once. If they're not careful, that'll heat those new power lines up like the filaments in a toaster. Remember, the battery
"Murphy's Law" != "Shit Happens" (Score:3, Insightful)
The lesson you should be taking from Murphy's Law is not "Shit Happens". The lesson you should be taking is that you can't assume that an unlikely problem (or one you can con yourself into thinking unlikely) is one you can ignore. It's only after you've prepared for every reasonable contingency that you're allowed to say "Shit Happens".
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Blank Label Comics, Schlock Mercenary (Score:2, Informative)
Re: (Score:3, Informative)
Re: (Score:2)
Houston affected, Dallas data center unaffected (Score:2)
Re: (Score:2)
Oh, and thousands of dull corporate brochureware sites.
Re: (Score:3, Interesting)
Re:More planning could have prevented this (Score:5, Informative)
Further, the 9,000 servers were physically, geographically, isolated enough from the power supply (which is what exploded) to be protected. We know this to be the case because we read the article and headline and understood them and they indicate that the 9,000 servers were not blown up.
To put it another way, only the power supply was damaged by the explosion, the servers were not. Probably there was no way to isolate the power from its own explosion. The servers, however, we protected.
So, in summary, the 9,000 servers were not blown up. Only the power.
The power is off due to the explosion but there servers themselves are A-OK.
Re:More planning could have prevented this (Score:4, Interesting)
Physically OK maybe... lets see how many of them come back up when the power is restored ^ ^
Re:More planning could have prevented this (Score:5, Insightful)
Re:More planning could have prevented this (Score:5, Funny)
Re: (Score:3, Informative)
Correction (Score:5, Informative)
Re:More planning could have prevented this (Score:5, Informative)
Another issue is the complexity of a full blown 2N power system is likely to cause more outages due to human error during routine maintenance over an N+1 system. Complete 2N power systems from grid and backup sources all the way to the servers with no single point of failure (transformers, wiring, switching, PDUs, UPSs, etc.) are enormously complex and expensive, so it's not "the only thing that makes sense". I assure you issuing a one-day pro-rated credit to all your customers is cheaper.
Re: (Score:3, Insightful)
In related news, I was wondering why I wasn't getting much spam today and my sites didn't have strange spiders hitting them.
Re: (Score:3, Informative)
Re:More planning could have prevented this (Score:4, Interesting)
While it may be the fire dept that is erroneously preventing them from bringing up their back-up power, it's part of a poor disaster recovery plan to not engage with the fire dept, electric co, etc. before a disaster happens, so that everyone is on-board with your disaster recovery plans and that you have the ability to implement that plan.
The explosion was isolated to the power room. The servers are fine, the backup generators and batteries are fine. The servers should have been back online if they had a good disaster recovery plan. The whole point of disaster recovery is being able to handle a disaster. You can't say "oh there was a disaster, you can't help that". This is exactly what their plan should have been able to handle. The power room goes offline. It shouldn't matter if it was because of an explosion, a fire, equipment failure or being beamed into outer space.
It also shouldn't matter who is telling them to keep the power off. Part of the disaster recovery plan should have been making sure local authorities allowed them to carry it out. Fine, they have to shut off all power when firemen are in there with hoses. I understand that. But once the fire is out your plan should allow you to bring up backup power. It didn't. So I don't see how they can call themselves a "World Class Data Center". Part of what they sell and what customers expect is disaster recovery. And there are data centers that can provide this.
ThePlanet is pretty cheap compared to datacenters like NAC that have more redundancy and security. But ThePlanet wants to advertise that they are just as good. Now they were caught with their pants down when there was actually a disaster and their disaster recovery plan failed.
Re:More planning could have prevented this (Score:4, Informative)
Re: (Score:3, Informative)
Re: (Score:3, Funny)
Re: (Score:3, Insightful)
If your business loses $1000/minute while it's offline, get a quote for insurance that pays out $1000/minute while you're offline. Alternatively if you're happy self insuring take the loss when it happens.
It's almost as if people believe that SLAs are a form of service guarantee instead of a free very bad insurance deal.