Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet Businesses

Explosion At ThePlanet Datacenter Drops 9,000 Servers 431

An anonymous reader writes "Customers hosting with ThePlanet, a major Texas hosting provider, are going through some tough times. Yesterday evening at 5:45 pm local time an electrical short caused a fire and explosion in the power room, knocking out walls and taking the entire facility offline. No one was hurt and no servers were damaged. Estimates suggest 9,000 servers are offline, affecting 7,500 customers, with ETAs for repair of at least 24 hours from onset. While they claim redundant power, because of the nature of the problem they had to go completely dark. This goes to show that no matter how much planning you do, Murphy's Law still applies." Here's a Coral CDN link to ThePlanet's forum where staff are posting updates on the outage. At this writing almost 2,400 people are trying to read it.
This discussion has been archived. No new comments can be posted.

Explosion At ThePlanet Datacenter Drops 9,000 Servers

Comments Filter:
  • by strredwolf ( 532 ) on Sunday June 01, 2008 @02:29PM (#23618881) Homepage Journal
    Schlock Mercenary, the popular webcomic, as well as most of the Blank Label Comics collective is down. Schlockmercenary.com now points to a holder site, and Sunday's comic is on the Livejournal community at http://schlocktroups.livejournal.com./ [schlocktro...ournal.com]
  • by martyb ( 196687 ) on Sunday June 01, 2008 @02:30PM (#23618897)

    Kudos to them for their timely updates as to system status. Having their status page listed on /. doesn't help them much, but I was encouraged to see a Coral Cache link to their status page. In that light, here's: a link to the Coral Cache lofiversion of their status page:

    • http://forums.theplanet.com.nyud.net:8080/lofiversion/index.php/t90185.html
  • by Hijacked Public ( 999535 ) on Sunday June 01, 2008 @02:37PM (#23618965)

    It is often the case that transformers are kept apart from all other components
    And that appears to have been the case here. Had you read the article, or even the unusually accurate headline, you would know that the 9,000 servers were 'dropped' rather than 'blown apart'. They are still physically with us, they are just dropped from service because they don't have any power because the power supply blew up.

    Further, the 9,000 servers were physically, geographically, isolated enough from the power supply (which is what exploded) to be protected. We know this to be the case because we read the article and headline and understood them and they indicate that the 9,000 servers were not blown up.

    To put it another way, only the power supply was damaged by the explosion, the servers were not. Probably there was no way to isolate the power from its own explosion. The servers, however, we protected.

    So, in summary, the 9,000 servers were not blown up. Only the power.

    The power is off due to the explosion but there servers themselves are A-OK.
  • by strredwolf ( 532 ) on Sunday June 01, 2008 @02:39PM (#23618981) Homepage Journal
    http://community.livejournal.com/schlocktroops/ [livejournal.com] for the right URL.
  • by Anonymous Coward on Sunday June 01, 2008 @02:43PM (#23619019)
    If you'd read the linked status report, you'd see that there was a short in a high voltage line. They are dark because the fire department told them not to power up their back-up generators.
  • by bipbop ( 1144919 ) on Sunday June 01, 2008 @02:43PM (#23619023)
    At my last job, BCP guidelines required both: a minimum of four servers for anything, two of which must be at a physically distant datacenter.
  • by Anonymous Coward on Sunday June 01, 2008 @02:44PM (#23619037)
    Yes.

    Shit happens. The question then becomes how you deal with it.

    As above, see below. Will follow with interest.
  • by Gazzonyx ( 982402 ) <scott.lovenberg@gm a i l.com> on Sunday June 01, 2008 @02:45PM (#23619047)
    No, the power was off because the fire department told them to shut it off (during an investigation, I assume). The explosion was in a high power conduit - I'm sure it severed all the lines inside the conduit itself. This is one of those things that couldn't easily be avoided at a single site. But, if your server is of any importance, you do have a colo, right?
  • by 1sockchuck ( 826398 ) on Sunday June 01, 2008 @02:46PM (#23619059) Homepage
    Data Center Knowledge has a story on the downtime at The Planet [datacenterknowledge.com], summarizing the information from the now Slashdotted forums. Only one of the company's six data centers was affected. The Planet has more than 50,000 servers in its network, meaning that one on five customers are offline.
  • by p0tat03 ( 985078 ) on Sunday June 01, 2008 @02:49PM (#23619091)

    I'm a mechanical/electrical engineer by training, and what you're saying makes no sense to us. Mistakes are made in the laboratory, where things are allowed to blow up and start fires. Once you hit the real world the considerations are *very different*. While it's possible that this fire could be caused by something entirely unforeseeable (unlikely given our experience in this field), it's also possible that this was due to improperly designed systems.

    I don't suppose you'd be singing the same tune if this was a bridge collapse that killed hundreds. There's a reason why engineering costs a lot, and that's directly correlated to how little failure we can tolerate.

  • by www.sorehands.com ( 142825 ) on Sunday June 01, 2008 @02:54PM (#23619135) Homepage
    Really? What about a little known thing called colocation?

    At least with colocation, if the building gets blown up by terrorists, the servers are still running somewhere else.

  • Re:Explosion? (Score:5, Informative)

    by Gazzonyx ( 982402 ) <scott.lovenberg@gm a i l.com> on Sunday June 01, 2008 @02:54PM (#23619137)
    Actually, modern batteries should be sealed valve or Absorbed Glass Mat (AGM) that don't vent (too much) hydrogen. During a thermal runaway, they vent a tiny bit before killing themselves, but hydrogen doesn't become explosive until the concentration in an enclosed environment is ~4%. 4% of a data center is a fairly large area. I've heard of this happening in one data center where the primary and fail over (IIRC) HVAC units failed and no one had been on site for well over a month. IOW, every battery in the place started venting and it took over a month without any air circulation for it to get to 4%.
  • Correction (Score:5, Informative)

    by Gazzonyx ( 982402 ) <scott.lovenberg@gm a i l.com> on Sunday June 01, 2008 @03:00PM (#23619171)
    Sorry for replying to myself, I don't think I made my post clear; the backup power is not on (the mains was blown to bits), because the fire department told them to shut it off.
  • by filmotheklown ( 740735 ) on Sunday June 01, 2008 @03:21PM (#23619341)
    Not Totally True.

    Many customers also use their DNS service, (the EV1 DNS), so while there are 9000 servers physically 'off' there are many more effectively 'black' as the conical names no longer resolve.

    I'm one of those customers. We're a very small business as are many of the other customers of The Planet (formerly Everyones Internet -- EV1.net)

    I can still access our sever via the IP address, but not via the conical name.

    While we host our site on a private server, many of the servers of other customers are resellers and with the DNS service, I could easily see how 10s of thousand of actual sites are down beyond the 9000 physical servers.

  • by SSpade ( 549608 ) on Sunday June 01, 2008 @03:28PM (#23619391) Homepage
    It's little known mostly because it's not actually true. I think you're confusing theplanet with the world, aka world.std.com.
  • First ISP (Score:2, Informative)

    by Anonymous Coward on Sunday June 01, 2008 @03:37PM (#23619445)
    You're thinking of The World. See http://www.theworld.com/about/internet.shtml [theworld.com].

  • by cecil_turtle ( 820519 ) on Sunday June 01, 2008 @03:38PM (#23619461)
    ThePlanet has 5 or more datacenters. The cost and complexity of doing a full blown physically separated 2N power system at every datacenter is far more expensive than taking the chance of having to issue a credit against an SLA. Not to mention that when a fire is involved, the fire department has full authority and may instruct you to cut all power anyway - they are coming in to an unknown situation and won't risk their own people just because you say the other power system is isolated.

    Another issue is the complexity of a full blown 2N power system is likely to cause more outages due to human error during routine maintenance over an N+1 system. Complete 2N power systems from grid and backup sources all the way to the servers with no single point of failure (transformers, wiring, switching, PDUs, UPSs, etc.) are enormously complex and expensive, so it's not "the only thing that makes sense". I assure you issuing a one-day pro-rated credit to all your customers is cheaper.
  • by Animats ( 122034 ) on Sunday June 01, 2008 @03:58PM (#23619595) Homepage

    They supposedly had a "short in a high-volume wire conduit." That leads to questions as to whether they exceeded the NEC limits [generalcable.com] on how much wire and how much current you can put through a conduit of a given size. Wires dissipate heat, and the basic rule is that conduits must be no more than 40% filled with wire. The rest of the space is needed for air cooling. The NEC rules are conservative, and if followed, overheating should not be a problem.

    This data center is in a hot climate, and a data center is often a continuous maximum load on the wiring, so if they do exceed the packing limits for conduit, a wiring failure through overheat is a very real possibility.

    Some fire inspector will pull charred wires out of damaged conduit and compare them against the NEC rules. We should know in a few days.

  • by Anonymous Coward on Sunday June 01, 2008 @04:06PM (#23619649)
    Having worked there previously, I can tell you their battery systems use (literally) tons of deep cycle lead acid batteries. Once a year they get this badass huge shipment of Sears Craftsman deep cycle car batteries. Each bank of batteries was.... eh... roughly the size of a Cooper Mini. The process of replacing them was pretty amusing to watch, if only for the fact that the UPSes were so incredibly heavy that they need their own reinforced concrete flooring because of the weight.
  • by Anonymous Coward on Sunday June 01, 2008 @04:15PM (#23619703)
    Where the HELL can you buy a toxic substance like carbon monoxide, we asked. Not his problem but he wouldn't sign off until we did. After finding out that it was illegal to ship the stuff, and that there was no local supplier,

    Not that I don't sympathize with your predicament, but carbon monoxide is routinely used in chemistry labs around the country (I did in graduate school). Call up Aldrich [sigmaaldrich.com] and they will happily ship you just about any chemical.

    That being said, carbon monoxide has the potential to be extremely toxic and should not be used without proper training and safety equipment. That's why you have carbon monoxide detectors!
  • by jacquesm ( 154384 ) <j AT ww DOT com> on Sunday June 01, 2008 @05:24PM (#23620291) Homepage
    I'm one of their customers, and it takes more than a single instance in 5 years of hosting to make me switch. That said we'll see how long it takes to get things back up. Unfortunately *both* my dns servers are in that DC, I thought they were in physically distant locations... so much for ass-um-ing things...
  • by cecil_turtle ( 820519 ) on Sunday June 01, 2008 @07:03PM (#23620999)
    You may also be interested in a pretty positive write-up from SANS [sans.org] about ThePlanet's response and handling of the situation thus far.
  • by Viflux ( 1173577 ) on Sunday June 01, 2008 @08:10PM (#23621515)
    From the status update thread... "Today at approximately 5:45 p.m., a transformer in our H1 data center in Houston caught fire, thus requiring us to take down all generators as instructed by the fire department. All servers are down." I read this as the fire department ordering them to kill *all* the power for safety reasons, rather than the explosion knocking the whole thing out.
  • Update 11:14 PM CST (Score:4, Informative)

    by Solokron ( 198043 ) on Monday June 02, 2008 @12:33AM (#23623161) Homepage
    As previously committed, I would like to provide an update on where we stand following yesterday's explosion in our H1 data center. First, I would like to extend my sincere thanks for your patience during the past 28 hours. We are acutely aware that uptime is critical to your business, and you have my personal commitment that The Planet team will continue to work around the clock to restore your service. As you have read, we have begun receiving some of the equipment required to start repairs. While no customer servers have been damaged or lost, we have new information that damage to our H1 data center is worse than initially expected. Three walls of the electrical equipment room on the first floor blew several feet from their original position, and the underground cabling that powers the first floor of H1 was destroyed. There is some good news, however. We have found a way to get power to Phase 2 (upstairs, second floor) of the data center and to restore network connectivity. We will be powering up the air conditioning system and other necessary equipment within the next few hours. Once these systems are tested, we will begin bringing the 6,000 servers online. It will take four to five hours to get them all running. We have brought in additional support from Dallas to have more hands and eyes on site to help with any servers that may experience problems. The call center has also brought in double staff to handle the increase in tickets we're expecting. Hopefully by sunrise tomorrow Phase 2 will be well on its way to full production. Let me next address Phase 1 (first floor) of the data center and the affected 3,000 servers. The news is not as good, and we were not as lucky. The damage there was far more extensive, and we have a bigger challenge that will require a two-step process. For the first step, we have designed a temporary method that we believe will bring power back to those servers sometime tomorrow evening, but the solution will be temporary. We will use a generator to supply power through next weekend when the necessary gear will be delivered to permanently restore normal utility power and our battery backup system. During the upcoming week, we will be working with those customers to resolve issues. We know this may not be a satisfactory solution for you and your business but at this time, it is the best we can do. We understand that you will be due service credits based on our Service Level Agreement. We will proactively begin providing those following the restoration of service, which is our number priority, so please bear with us until this has been completed. I recognize that this is not all good news. I can only assure you we will continue to utilize every means possible to fully restore service. I plan to have an audio update tomorrow evening. Until then, Douglas J. Erwin Chairman & Chief Executive Officer
  • by moosesocks ( 264553 ) on Monday June 02, 2008 @12:38AM (#23623183) Homepage
    For those of you who don't get the joke, there's actually an entire wikipedia article devoted to it.

    In short, most unix printing systems understand a very small number of printer status codes, usually consisting of "READY, ONLINE, OFFLINE, and PRINTER ON FIRE"

    The latter status message was actually semi-serious, and was thrown whenever the printer was encountering a serious error, but for some reason was continuing to print anyway. In the case of a high-speed mainframe printer, if the printer jammed but continued attempting to print, a fire could easily start due to the amount of friction created by the high-speed motors.
  • by gnuman99 ( 746007 ) on Monday June 02, 2008 @12:40AM (#23623193)
    How about catching on fire and burning down??

    http://lists.debian.org/debian-devel/2002/11/msg01926.html [debian.org]
  • by gnuman99 ( 746007 ) on Monday June 02, 2008 @12:58AM (#23623281)
    Shouldn't they provide, you know, primary AND secondary DNS? And in that case, wouldn't the primary AND secondary be hosted in *different* data centers?

    DNS is *THE* *MOST* critical part of infrastructure. If the HTTP server fail, ok. If mail fails, ok. If data center explodes, you still have DNS so anyone sending email will just be stuck for a few days. But if DNS is offline, then email is offline. You are off the internet.

    I've had a server motherboard die and it took a few days to get new one installed and running. But my DNS was running because backups were on different IPs and places.

    I have to say, this is a BIG no-no for them not to provide proper DNS services.
  • Not so simple. (Score:5, Informative)

    by CFD339 ( 795926 ) <{andrewp} {at} {thenorth.com}> on Monday June 02, 2008 @01:48AM (#23623561) Homepage Journal
    While it sounds like a reasonable approach at first, it makes assumptions that I can't make as an officer on scene.

    1. It assumes that the only problem is with the original transformer. When I arrive on scene I don't know what the problem was -- even if you tell me you do know, I can't believe it. I also don't know what the secondary problems are.

    2. Feeding power into a building that has been physically damaged is very very dangerous. We're not talking about a transformer "failing to work" we're talking about something that blew the walls off the room it was in.

    3. We already know that things didn't go the way they were supposed to. Something failed. Some safety plan didn't work. We have to assume that we're dealing with chaos until proved otherwise.

    So, as a fire officer I arrive on scene and have a smoke filled building with reports of an explosion and MAYBE a report that everyone is out. I need to go in and find out what happened, if anything is still burning or in immediate danger, and if anyone is still in side. To do that safely, the first thing I want to do is secure the power to the building (shut it off) as well as any other utility feeds (oil, steam, liquefied petroleum or natural gas).

    The gear I carry -- even the radio -- is designed to never create even the tiniest spark in its operation. We call it "intrinsically safe". Its one of a great many precautions we take.

    We go in to a place like this not knowing the equipment, not knowing its condition.

    My final proof point --

    If in fact The Planet had powered up their generators, they'd have fried a lot more stuff and caused more fire. The may have destroyed their chances of salvaging the grid within 48 hours at all. Why? It turns out (we now know) that the force of the initial explosion moved three walls in the power distribution center more than a foot (I heard 3 feet I think) off their base. This tore out electrical connections, cables, conduits and power switches. Just now, after 28 hours, they've figured out how to get power to the servers on the second floor, but for the first floor servers they're having to rig up a line from the generators to that floor and it will take until tomorrow to do that. Why? Because the electrical connections from that distribution room to the first floor servers are destroyed. They're going to be running 3000 servers on the first floor off those generators for a week while they get the equipment to rebuild the connectivity to the main distribution room.

    What does this prove?

    1. It proves the fire marshal was right in not allowing them to feed power in their.

    2. It proves that when that big dumb fireman you see (who may be a volunteer who's also a network guy and software developer with an IQ above 95% of the world) may in fact have a good reason for the way they do things on scene.

    Look, as a firefighter I don't set out to ruin someone's day. I set out to keep them safe. If that sounds paternalistic, well, It is paternalistic. It very much feels that way. In my small town, its how I feel. I wonder ever time I walk into a building, how I would protect MY PEOPLE in this building if a fire broke out or a hazmat incident started or whatever. You can't help it, its what you're trained to do.

  • No, not necessarily (Score:5, Informative)

    by Sycraft-fu ( 314770 ) on Monday June 02, 2008 @02:28AM (#23623757)
    You are probably thinking of auto insurance. Yes, it usually goes up when used. The reason is because when you use it, it is usually because you did something that changed your risk level. If you get in an accident, that makes you a higher risk. Continue to get in accidents, you are a higher risk still. Thus the companies want more money. It's all based on risk calculation. That's also why they want more money when you are under 25. Statistically speaking, young people are a much higher risk of accidents.

    Well with building insurance, that's not the case. You aren't really a significant risk factor. Risk is instead calculated of of things like what kind of structure it is, how far it is from the fire department, what it's used for, what it contains (that determines what they are on the hook for) etc. So when something happens, unless it was because of a previously unknown risk factor, your rates don't necessarily change. Nothing changed with regards to risk.

    Insurance is really all just risk based. They take the probability of having to make a payout and the amount of said payout vs time and come up with a rate. If something changes the risk, the rate will change as well, but if not then it doesn't change. It isn't as though your one single payout is of any significance to their overall operation.

    Also, the idea of "Just pay for it yourself," is extremely silly. It smacks of someone who's never owned something of any significant value. The reason behind insurance is that you CAN'T just pay for it yourself. For example I have insurance on my house. The reason is that if I lost it, I can't afford to replace it. I don't have a couple hundred grand just lying around in the bank. That's the point of insurance. You are insuring that if something happens that you can't afford, someone will pay for it. The insurance company is then, of course, that it isn't likely to happen and they get to keep the money.

E = MC ** 2 +- 3db

Working...