Explosion At ThePlanet Datacenter Drops 9,000 Servers 431
An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.
Blank Label Comics, Schlock Mercenary (Score:2, Informative)
Coral cached LOFI status page (Score:5, Informative)
Kudos to them for their timely updates as to system status. Having their status page listed on /. doesn't help them much, but I
was encouraged to see a Coral Cache link to their status page. In that light, here's:
a link to the Coral Cache lofiversion of their status page:
Re:More planning could have prevented this (Score:5, Informative)
Further, the 9,000 servers were physically, geographically, isolated enough from the power supply (which is what exploded) to be protected. We know this to be the case because we read the article and headline and understood them and they indicate that the 9,000 servers were not blown up.
To put it another way, only the power supply was damaged by the explosion, the servers were not. Probably there was no way to isolate the power from its own explosion. The servers, however, we protected.
So, in summary, the 9,000 servers were not blown up. Only the power.
The power is off due to the explosion but there servers themselves are A-OK.
Re:Blank Label Comics, Schlock Mercenary (Score:3, Informative)
Re:Lithium Batteries in their UPS setup?? (Score:2, Informative)
Re:Server/customer ratio? (Score:3, Informative)
Re:Kudo to their support team (Score:1, Informative)
Shit happens. The question then becomes how you deal with it.
As above, see below. Will follow with interest.
Re:More planning could have prevented this (Score:3, Informative)
More details on the outage (Score:3, Informative)
Re:Photos or informaton on building? (Score:5, Informative)
I'm a mechanical/electrical engineer by training, and what you're saying makes no sense to us. Mistakes are made in the laboratory, where things are allowed to blow up and start fires. Once you hit the real world the considerations are *very different*. While it's possible that this fire could be caused by something entirely unforeseeable (unlikely given our experience in this field), it's also possible that this was due to improperly designed systems.
I don't suppose you'd be singing the same tune if this was a bridge collapse that killed hundreds. There's a reason why engineering costs a lot, and that's directly correlated to how little failure we can tolerate.
Huh??? No amount of planning? (Score:2, Informative)
At least with colocation, if the building gets blown up by terrorists, the servers are still running somewhere else.
Re:Explosion? (Score:5, Informative)
Correction (Score:5, Informative)
Re:More details on the outage (Score:2, Informative)
Many customers also use their DNS service, (the EV1 DNS), so while there are 9000 servers physically 'off' there are many more effectively 'black' as the conical names no longer resolve.
I'm one of those customers. We're a very small business as are many of the other customers of The Planet (formerly Everyones Internet -- EV1.net)
I can still access our sever via the IP address, but not via the conical name.
While we host our site on a private server, many of the servers of other customers are resellers and with the DNS service, I could easily see how 10s of thousand of actual sites are down beyond the 9000 physical servers.
Re:Kudo to their support team (Score:5, Informative)
First ISP (Score:2, Informative)
Re:More planning could have prevented this (Score:5, Informative)
Another issue is the complexity of a full blown 2N power system is likely to cause more outages due to human error during routine maintenance over an N+1 system. Complete 2N power systems from grid and backup sources all the way to the servers with no single point of failure (transformers, wiring, switching, PDUs, UPSs, etc.) are enormously complex and expensive, so it's not "the only thing that makes sense". I assure you issuing a one-day pro-rated credit to all your customers is cheaper.
"short in a high-volume wire conduit."? (Score:3, Informative)
They supposedly had a "short in a high-volume wire conduit." That leads to questions as to whether they exceeded the NEC limits [generalcable.com] on how much wire and how much current you can put through a conduit of a given size. Wires dissipate heat, and the basic rule is that conduits must be no more than 40% filled with wire. The rest of the space is needed for air cooling. The NEC rules are conservative, and if followed, overheating should not be a problem.
This data center is in a hot climate, and a data center is often a continuous maximum load on the wiring, so if they do exceed the packing limits for conduit, a wiring failure through overheat is a very real possibility.
Some fire inspector will pull charred wires out of damaged conduit and compare them against the NEC rules. We should know in a few days.
Re:Lithium Batteries in their UPS setup?? (Score:1, Informative)
Re:Ignorant firemen = single point-of-failure (Score:1, Informative)
Not that I don't sympathize with your predicament, but carbon monoxide is routinely used in chemistry labs around the country (I did in graduate school). Call up Aldrich [sigmaaldrich.com] and they will happily ship you just about any chemical.
That being said, carbon monoxide has the potential to be extremely toxic and should not be used without proper training and safety equipment. That's why you have carbon monoxide detectors!
Re:More planning could have prevented this (Score:3, Informative)
Re:More planning could have prevented this (Score:4, Informative)
Re:More planning could have prevented this (Score:3, Informative)
Update 11:14 PM CST (Score:4, Informative)
Re:Printer ignition source (Score:3, Informative)
In short, most unix printing systems understand a very small number of printer status codes, usually consisting of "READY, ONLINE, OFFLINE, and PRINTER ON FIRE"
The latter status message was actually semi-serious, and was thrown whenever the printer was encountering a serious error, but for some reason was continuing to print anyway. In the case of a high-speed mainframe printer, if the printer jammed but continued attempting to print, a fire could easily start due to the amount of friction created by the high-speed motors.
Re:Server/customer ratio? (Score:3, Informative)
http://lists.debian.org/debian-devel/2002/11/msg01926.html [debian.org]
Re:More details on the outage (Score:3, Informative)
DNS is *THE* *MOST* critical part of infrastructure. If the HTTP server fail, ok. If mail fails, ok. If data center explodes, you still have DNS so anyone sending email will just be stuck for a few days. But if DNS is offline, then email is offline. You are off the internet.
I've had a server motherboard die and it took a few days to get new one installed and running. But my DNS was running because backups were on different IPs and places.
I have to say, this is a BIG no-no for them not to provide proper DNS services.
Not so simple. (Score:5, Informative)
1. It assumes that the only problem is with the original transformer. When I arrive on scene I don't know what the problem was -- even if you tell me you do know, I can't believe it. I also don't know what the secondary problems are.
2. Feeding power into a building that has been physically damaged is very very dangerous. We're not talking about a transformer "failing to work" we're talking about something that blew the walls off the room it was in.
3. We already know that things didn't go the way they were supposed to. Something failed. Some safety plan didn't work. We have to assume that we're dealing with chaos until proved otherwise.
So, as a fire officer I arrive on scene and have a smoke filled building with reports of an explosion and MAYBE a report that everyone is out. I need to go in and find out what happened, if anything is still burning or in immediate danger, and if anyone is still in side. To do that safely, the first thing I want to do is secure the power to the building (shut it off) as well as any other utility feeds (oil, steam, liquefied petroleum or natural gas).
The gear I carry -- even the radio -- is designed to never create even the tiniest spark in its operation. We call it "intrinsically safe". Its one of a great many precautions we take.
We go in to a place like this not knowing the equipment, not knowing its condition.
My final proof point --
If in fact The Planet had powered up their generators, they'd have fried a lot more stuff and caused more fire. The may have destroyed their chances of salvaging the grid within 48 hours at all. Why? It turns out (we now know) that the force of the initial explosion moved three walls in the power distribution center more than a foot (I heard 3 feet I think) off their base. This tore out electrical connections, cables, conduits and power switches. Just now, after 28 hours, they've figured out how to get power to the servers on the second floor, but for the first floor servers they're having to rig up a line from the generators to that floor and it will take until tomorrow to do that. Why? Because the electrical connections from that distribution room to the first floor servers are destroyed. They're going to be running 3000 servers on the first floor off those generators for a week while they get the equipment to rebuild the connectivity to the main distribution room.
What does this prove?
1. It proves the fire marshal was right in not allowing them to feed power in their.
2. It proves that when that big dumb fireman you see (who may be a volunteer who's also a network guy and software developer with an IQ above 95% of the world) may in fact have a good reason for the way they do things on scene.
Look, as a firefighter I don't set out to ruin someone's day. I set out to keep them safe. If that sounds paternalistic, well, It is paternalistic. It very much feels that way. In my small town, its how I feel. I wonder ever time I walk into a building, how I would protect MY PEOPLE in this building if a fire broke out or a hazmat incident started or whatever. You can't help it, its what you're trained to do.
No, not necessarily (Score:5, Informative)
Well with building insurance, that's not the case. You aren't really a significant risk factor. Risk is instead calculated of of things like what kind of structure it is, how far it is from the fire department, what it's used for, what it contains (that determines what they are on the hook for) etc. So when something happens, unless it was because of a previously unknown risk factor, your rates don't necessarily change. Nothing changed with regards to risk.
Insurance is really all just risk based. They take the probability of having to make a payout and the amount of said payout vs time and come up with a rate. If something changes the risk, the rate will change as well, but if not then it doesn't change. It isn't as though your one single payout is of any significance to their overall operation.
Also, the idea of "Just pay for it yourself," is extremely silly. It smacks of someone who's never owned something of any significant value. The reason behind insurance is that you CAN'T just pay for it yourself. For example I have insurance on my house. The reason is that if I lost it, I can't afford to replace it. I don't have a couple hundred grand just lying around in the bank. That's the point of insurance. You are insuring that if something happens that you can't afford, someone will pay for it. The insurance company is then, of course, that it isn't likely to happen and they get to keep the money.