Please create an account to participate in the Slashdot moderation system

VMware Causes Second Outage While Recovering From First 215

Posted by Soulskill on Monday May 02, 2011 @07:55PM from the third-time's-a-charm dept.

jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."

This discussion has been archived. No new comments can be posted.

VMware Causes Second Outage While Recovering From First

Load All Comments

Search 215 Comments Log In/Create an Account

Comments Filter:

This is very bad design (Score:4, Interesting)

by FunkyRider ( 1128099 ) writes: on Monday May 02, 2011 @07:58PM (#36006070)

[[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?

Share
twitter facebook
- Re:This is very bad design (Score:5, Interesting)
  
  by drosboro ( 1046516 ) writes: on Monday May 02, 2011 @08:02PM (#36006108)
  
  I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:
  This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.
  My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".
  But who knows, I could be wrong... I'm sure hoping I'm not!
  
  Parent Share
  twitter facebook
  - Re:This is very bad design (Score:5, Insightful)
    
    by nurb432 ( 527695 ) writes: on Monday May 02, 2011 @08:45PM (#36006404) Homepage Journal
    
    I am sure that is what happened. I don't know of any single keystroke that would take down an entire data center. ( aside from that big red button on the wall over there.. )
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by Daniel_Staal ( 609844 ) writes:
      
      'Enter' should do it, in most cases...
      (Assuming, of course, that the (in)correct command has been typed at the command line already.)
      - Re:This is very bad design (Score:4, Informative)
        
        by X0563511 ( 793323 ) writes: on Monday May 02, 2011 @09:19PM (#36006576) Homepage Journal
        
        ... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by NFN_NLN ( 633283 ) writes:
        
        ... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.
        So I should stop typing this into random terminals and then leaving?
        > nohup "history -c; passwd -l root; rm -rf /" &
        
        Re: (Score:3)
        
        by nedlohs ( 1335013 ) writes:
        
        It seems remarkedly unlikely that there would be an executable named "history -c; passwd -l root; rm -rf /", in fact I suspect that trailing / makes it impossible on unix-like systems.
        nohup sh -c "..." &
        on the other hand...
        
        Re: (Score:2)
        
        by 42forty-two42 ( 532340 ) writes:
        
        On a serial link, just use the right arrow key. Or possibly ESC (although you'll have to deal with clearing the ESC chord afterward if it happened to be in vi or something)
        
        Re: (Score:2)
        
        by X0563511 ( 793323 ) writes:
        
        Not a bad idea. I think cleaning up the vi example is a good compromise - you wanted a prompt after all, not necessarily someone's leavings.
        
        Re: (Score:2)
        
        by vegiVamp ( 518171 ) writes:
        
        I tend to use the control key. My brain claims that the shift key doesn't always seem to work, but offers no particular examples.
        Postfactum Explanation Possum also says "it's at the corner of the keyboard, so less adjacent keys to accidentally press".
        
        Re: (Score:2)
        
        by drinkypoo ( 153816 ) writes:
        
        If it's a serial link I hit ^L
        However, Windows and Linux both swallow my first keypress when asleep so it doesn't matter if I hit control, space, enter, or super.
        
        Re: (Score:2)
        
        by cyclomedia ( 882859 ) writes:
        
        Toggle scroll lock. Nothing in the known universe uses that key for anything.
        
        Re: (Score:2)
        
        by multipartmixed ( 163409 ) writes:
        
        Scroll lock changes from application to window manager managed function keys on my FVWM 1.24 desktop.
        PC-Kermit(?) also uses scroll lock to make the screen scroll via the cursor keys.
        
        Re: (Score:3)
        
        by ais523 ( 1172701 ) writes:
        
        Notably, Excel uses it, for its intended function (making the arrow keys scroll rather than moving the cursor). And Linux, when the kernel's busy handling the screen itself (say during the boot process), uses Scroll Lock to temporarily pause quickly scrolling output to the screen so that you can see what it says. Apparently KVM switches often use a double-tap of Scroll Lock in order to send signals to the switch itself rather than the computers connected to it (on the basis that that quickly turning Scroll
        
        Re: (Score:2)
        
        by smash ( 1351 ) writes:
        
        ctrl+U then enter is reasonably safe on cisco stuff.
        
        Re: (Score:3)
        
        by Archangel Michael ( 180766 ) writes:
        
        When an unlocked and unmanned workstation is found in our Dept, the SOP is to place a RICKROLL somewhere in the system. Bonus points for being creative. I have one that is still waiting to go off, because the guy never reboots his computer. He'll never know who did it, or when.
        
        Re: (Score:2)
        
        by smash ( 1351 ) writes:
        
        i use arrange by penis
      - Re: (Score:2)
        
        by shutdown -p now ( 807394 ) writes:
        
        "Updates are available for your computer; would you like to reboot it to install them?" ~
    - Re: (Score:2)
      
      by interiot ( 50685 ) writes:
      
      If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.
      - Re: (Score:3)
        
        by c6gunner ( 950153 ) writes:
        
        If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.
        Screw that. I'd remove the sign. And replace it with one that says "FREE MOUNTAIN-DEW!".
    - Re: (Score:2)
      
      by BitZtream ( 692029 ) writes:
      
      The enter key being pressed after doing something silly like typing up an example command line for a half written script that will automate some large process to simply copy and paste into another document.
      While the reality of it is the reason they said 'hands off' was to avoid just such an accident, an engineer actually executing the test plan before it was actually ready to do its job, by accident. And it happened.
      Its really one of those moments where the poor guy is just the most perfect example of why
    - Re: (Score:3)
      
      by md65536 ( 670240 ) writes:
      
      They didn't even say that a key was pressed. Perhaps someone accidentally brushed a hand against the keyboard. Perhaps the "very bad design" of the data center involves the electrical wiring.
      Seriously, this does indicate bad design, and it does NOT inspire confidence. If cloud services go down and the official explanation that is given is "Someone accidentally touched some equipment, and everything go boom," then I don't want to rely on this cloud service. That's not good enough.
      They could try explaining wh
      - Re: (Score:2)
        
        by nedlohs ( 1335013 ) writes:
        
        Seriously? You can read that and come away with that interpretation? Rather than say "they were supposed to planning out what to do without actually executing any commands, but someone misunderstood and actually did the actions" that it obviously means.
        
        Re: (Score:2)
        
        by nedlohs ( 1335013 ) writes:
        
        They didn't dumb down. The mistake they made was not dumbing it down.
        They used the language they use internally without bothering to translate it (aka dumb it down) to something people who don't have the right context would understand. Which I agree is unprofessional and stupid of them, but it is not dumbing down though. And while the general public might misunderstand (which is why it was stupid of them) anyone with an IT background who thinks for 2 seconds knows what they mean.
        It isn't vague and imprecise
    - Re: (Score:2)
      
      by obarthelemy ( 160321 ) writes:
      
      Are you really sure that big red button would indeed take it fully down ?
      It could be a fake button. Or the servers could be more resistant than you think. There could be backup power....
      You'll never know...
      Unless ...
    - Re: (Score:2)
      
      by sjames ( 1099 ) writes:
      
      I know a case! "Are you sure? [Y/N]"
    - Re: (Score:2)
      
      by Florian Weimer ( 88405 ) writes:
      
      Some routers have extremely unsafe defaults and ignore syntax errors in commands. If you add a single letter to a command which corrects the default (perhaps while the configuration file is open in an editor), producing a syntax error, this can trigger far-reaching outages. Taking down a data center is not even the worst thing that can happen. For example, if an ISP accidentally redistributes the global BGP table into OSPF, they can produce a world-wide outage affecting thousands of routers and almost all c
    - Re: (Score:2)
      
      by Compaqt ( 1758360 ) writes:
      
      Well, if you have rm -Rf / in the terminal, and the key you hit is Enter ...
    - Re: (Score:2)
      
      by Kamiza Ikioi ( 893310 ) writes:
      
      It's called the "windows key". It has a little windows flag on it. It was placed on keyboards for the purpose of slowing down, crashing, mutilating, and annihilating data centers, desktops, laptops, and phones.
  - Re: (Score:3, Funny)
    
    by DigitalJanitor ( 21725 ) writes:
    
    Sounds like they could benefit from a virtual environment to test things out in.
    - Re: (Score:3)
      
      by ruiner13 ( 527499 ) writes:
      
      Or in this case, they need a virtual virtual environment.
  - Re: (Score:2)
    
    by haruchai ( 17472 ) writes:
    
    ?? The Playbook touched the keyboard and took out the cloud? Boy, RIM just can't catch a break these days!!
- Re: (Score:2)
  
  by verbatim ( 18390 ) writes:
  
  Finally, MovieOS being used in a production environment. Pretty soon, the cops will be using Visual Basic to hunt down suspects.
  - Re: (Score:3)
    
    by Stupendoussteve ( 891822 ) writes:
    
    Everyone knows GUIs hunt down suspects, you just have to write them in Visual Basic. Duh!
- Sounds like a 13 year old making up an excuse. (Score:2)
  
  by Ecuador ( 740021 ) writes:
  
  I remember from almost 20 years ago (DOS / floppy era) overhearing a couple of kids in my school yard. Apparently one of them had promised the other a floppy with a game and he had not delivered. The excuse was "you know, I had it ready and everything, but I hit on the "delete" key by accident and I lost it - sorry". The other party agreed it was an unfortunate accident and did not make a fuss. I was in disbelief of the idiocy of the exchange I had just heard - and I was just 13 years old.
  Vmware's explanati
- - Re: (Score:2)
    
    by sumdumass ( 711423 ) writes:
    
    It wasn't out of boredom. He went into a chat room and asked for advice. The guy talking the most gave him that information after asking if he was running windows and he replied I think so.
- - Re: (Score:2)
    
    by VanessaE ( 970834 ) writes:
    
    Oh how I wish I had mod points right now - this should me modded straight to +5, Funny.
    I just about laughed myself silly. Thanks. :-D
Game Over (Score:4, Insightful)

by ae1294 ( 1547521 ) writes: on Monday May 02, 2011 @08:00PM (#36006096) Journal

The cloud is a lie. Would the next marketing buzz world please come on down!

Share
twitter facebook
- Re: (Score:3)
  
  by Samantha Wright ( 1324923 ) writes:
  
  Completely disagree. The solution is clear: eliminate all potential sources of human error.
  - Re: (Score:2, Funny)
    
    by Anonymous Coward writes:
    
    Has anyone mentioned Skynet yet?
- Re: (Score:2)
  
  by jd ( 1658 ) writes:
  
  How is this [wolfram.com] a cloud?
- Re: (Score:2)
  
  by vegiVamp ( 518171 ) writes:
  
  I conclude that the cloud is really cake, and now want some.
- - Re:A cloud in need, is a cloud indeed (Score:4, Funny)
    
    by VortexCortex ( 1117377 ) writes: <VortexCortex@project-retrogr a d e.com> on Tuesday May 03, 2011 @01:31AM (#36007498)
    
    No, no, it is indeed a cloud: Thin, wispy and ephemeral.
    Not to mention The Cloud is dangerous!
    One time, "The Cloud" corrupted a few files on my server, toasted my dev machine's hard drive (couldn't even re-install!) made several monitors explode, and split the tree outside my home-office completely in two; Flying chunks of bark shattered my windows... to say nothing of the horror that became of the decorative landscape lighting that foolishly linked the outside to my main electrical system, may it rest in pieces.
    The ironic thing is that I had a lightning rod installed; I thought I was safe from The Cloud, but The Cloud decided that my, now deceased, 200ft pine tree was a better target of opportunity.
    The Cloud is a scary concept -- Super charged flying electrical batteries, always looming overhead, unpredictably destroying their targets with tremendous power, and surgical precision. Hell, the terror of witnessing such an event has permanently emotionally scarred my dog -- She has a prescription for Valium now because she hyperventilates and continuously shakes for hours at the mere sound of distant thunder...
    My psyche is not unscathed either: I have to take a tranquilizer whenever I hear the words: "To The Cloud"
    
    Parent Share
    twitter facebook
- - Re: (Score:2)
    
    by ae1294 ( 1547521 ) writes:
    
    Cloud is just a term. Everything is in the net nowadays. Same problems exists even if we don't call it Cloud.
    Managing of service is just too easy when one key press can cause this kind of damage. They need more redundancy both for infrastructure and management.
    Bullshit. One keypunch doesn't cause this unless shit is being run by people who shouldn't be any-ware near a server.
    By taking all of the various sites and services for hundreds of companies and condensing them into one two or even three buildings you create the exact opposite of what the Internet was designed to be in the first place, which is decentralised. You also remove accountability for the people who are running the show. People stop being able to see the forest from the trees and you end up with mo
Slashdot summary non sensationalist (Score:4, Interesting)

by rsborg ( 111459 ) writes: on Monday May 02, 2011 @08:03PM (#36006128) Homepage

Amazingly the Cloudfoundry blog itself had a much more dramatic telling:
"... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.
Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."
(emphasis mine).
I'd hate to be that ops guy.

Share
twitter facebook
- VMware shows its PR colors. (Score:5, Insightful)
  
  by shuz ( 706678 ) writes: on Monday May 02, 2011 @08:18PM (#36006220) Homepage Journal
  
  VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X. This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick. For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.
  
  Parent Share
  twitter facebook
  - Re: (Score:3)
    
    by HFShadow ( 530449 ) writes:
    
    Agreed. They seem to treat it as some magical instance where touching the keyboard breaks things, as though this was written by someone's grandmother.
    How did one engineer touching a keyboard when he shouldn't, take everything down? I don't think I could do this at work unless I was really trying hard. This is a really shitty response, especially compared to the writeup that amazon put out.
    - The Answer is obvious (Score:2)
      
      by SuperKendall ( 25149 ) writes:
      
      How did one engineer touching a keyboard when he shouldn't, take everything down?
      He touched the keyboard in its Special Place.
      Not to worry though, they called in Chris Hanson to help with network ops in the future, we'll not be seeing a repeat.
  - Re: (Score:2)
    
    by Chuck Chunder ( 21021 ) writes:
    
    A better PR response
    In what sense? I know that I appreciate frank disclosures of problems from our providers rather than obfuscating the issue (if nothing else it might highlight a similar problem in our procedures).
  - Re:VMware shows its PR colors. (Score:5, Insightful)
    
    by ToasterMonkey ( 467067 ) writes: on Monday May 02, 2011 @10:36PM (#36007002) Homepage
    
    VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
    "Transparency is bad" +4 Insightful
    What the... ?
    
    Parent Share
    twitter facebook
    - Re:VMware shows its PR colors. (Score:5, Informative)
      
      by rsborg ( 111459 ) writes: on Monday May 02, 2011 @11:49PM (#36007230) Homepage
      
      VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.
      "Transparency is bad" +4 Insightful
      What the... ?
      You know, I'd prefer my vendor/partner (ie, VMWare) doesn't throw their employees under the bus when bad stuff happens. If this happened at Apple or Google the group (leadership taking responsibility) would announce they messed up... not "one of the peons pushed a magic button".
      Transparency is only useful as a way to diagnose and improve. This "explanation" from VMWare hides all explaination (...touched the keyboard. This resulted in a full outage of the network infrastructure...) while torching a single employee.
      
      Parent Share
      twitter facebook
      - Re: (Score:2)
        
        by SuperQ ( 431 ) * writes:
        
        Yup, here's a good example of what you're talking about:
        http://gmailblog.blogspot.com/2011/02/gmail-back-soon-for-everyone.html [blogspot.com]
        So what caused this problem? We released a storage software update that introduced the unexpected bug, which caused 0.02% of Gmail users to temporarily lose access to their email. When we discovered the problem, we immediately stopped the deployment of the new software and reverted to the old version.
  - Re:VMware shows its PR colors. (Score:4, Informative)
    
    by drooling-dog ( 189103 ) writes: on Monday May 02, 2011 @10:41PM (#36007026)
    
    To me it sounds like someone (non-technical) high up in the chain wanted to focus blame on an inadverant act by one of the engineers. Inadvertant, of course, so no one needs to get fired and file a lawsuit, and an engineer so that no one in upper management appears culpable. The downside is that they dramatically underscore the fragility of their cloud, thereby undermining its acceptance in the market. Not a good tradeoff, if that's the case.
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by dbIII ( 701233 ) writes:
    
    They are only as redundant and stable as the individuals managing them
    Given that it is 2011 a very large chunk of the IT workforce has been made redundant.
  - Re: (Score:2)
    
    by MightyYar ( 622222 ) writes:
    
    The company as a whole is responsible for any of its failures.
    I completely disagree. "The company" does not actually exist - it is actually just a group of individuals. If an individual can mess up the whole infrastructure, then I'd sure like to know that.
    A better PR response
    Yup, a better BS response that leaves them just as opaque as all the other companies out there.
    An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.
    That's exactly right - but you wouldn't know that if they had said "we made an unscheduled change". I prefer their transparency.
- Re: (Score:3)
  
  by fuzzyfuzzyfungus ( 1223518 ) writes:
  
  "And that is the story we tell the new hires. If they ask why the employee health plan covers cyanide..."
- Re: (Score:3)
  
  by Icegryphon ( 715550 ) writes:
  
  Keyboards, how do they work? This does not bode well for VMware. As much as I love their production, I did chuckle at this major failure.
  - Re: (Score:2)
    
    by mirix ( 1649853 ) writes:
    
    It's easy to be all high and mighty when your Selectric isn't even capable of being connected to anything mission critical.
    Actually - how did you manage to get it to post on /.?
    - Re:Slashdot summary non sensationalist (Score:5, Funny)
      
      by Icegryphon ( 715550 ) writes: on Monday May 02, 2011 @08:47PM (#36006420)
      
      Don't go knocking my typewriter It's Electric, and has wonderful BNC connector for network access. IBM, you did good.
      
      Parent Share
      twitter facebook
- Re: (Score:2)
  
  by BigGerman ( 541312 ) writes:
  
  Stopping engineers from touching keyboards is important part of maintaining one's cloud infrastructure. From experience.
- Re: (Score:2)
  
  by Virtucon ( 127420 ) writes:
  
  The infrastructure design is not resilient and it seems late in the game to "develop a playbook" after you've gone live. Their credibility also in building a fault tolerant platform is questionable. While VMWare is at the core of a lot of data centers, there are other players that bring things to the table to build out the other pieces that make high availability and reliability a reality; I don't think they understand how all of this fits together. By reading that this was a "paper only" all hands on d
UR DOING IT WRONG! (Score:3)

by celest ( 100606 ) writes: <mekki@me[ ].ca ['kki' in gap]> on Monday May 02, 2011 @08:05PM (#36006142) Homepage

You would think someone as big as VMware would have figured out, by now, that if "An inadvertent press of a key on a keyboard" can lead to "a full outage of the network infrastructure [including] all load balancers, routers, and firewalls [resulting] in a complete external loss of connectivity to [their Cloud service]" that they are DOING IT WRONG!
In other news, VMware announces they're releasing a new voting machine: http://xkcd.com/463/ [xkcd.com]

Share
twitter facebook
- Re: (Score:2)
  
  by Xtravar ( 725372 ) writes:
  
  I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context.
  Like, did they touch it and press a key?
  Did they touch it for an extended period, typing "killall cloud"?
  Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?
  - Re: (Score:2)
    
    by RoFLKOPTr ( 1294290 ) writes:
    
    I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context. Like, did they touch it and press a key? Did they touch it for an extended period, typing "killall cloud"? Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?
    The keyboard they touched wasn't a keyboard in the conventional sense. It was a small 3"x3" yellow/black striped board with one large circular red key on it. Somebody touched that key even though the sign said "DON'T PUSH THIS." A harmless prank.
  - Re: (Score:2)
    
    by LoRdTAW ( 99712 ) writes:
    
    It was probably inappropriately touched in a no-no place.
  - Re:UR DOING IT WRONG! (Score:5, Funny)
    
    by Jeremi ( 14640 ) writes: on Monday May 02, 2011 @10:07PM (#36006848) Homepage
    
    I would like more elaboration on what "touched the keyboard" means.
    It was an extreme case of static discharge. The engineer is lucky to be alive -- when doing cloud computing, thunderstorms are a huge hazard.
    
    Parent Share
    twitter facebook
  - Re: (Score:3, Informative)
    
    by larry bagina ( 561269 ) writes:
    
    Remember how your uncle used to touch you in your naughty place? It was like that.
Not the RED button!!! (Score:3)

by geekmux ( 1040042 ) writes: on Monday May 02, 2011 @08:08PM (#36006160)

"...An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry."
OK, seriously, who the hell has that much shit tied to a single key on a keyboard?
I've heard of macros for the lazy, but damn...

Share
twitter facebook
Engineering Errors (Score:5, Interesting)

by Bruha ( 412869 ) writes: on Monday May 02, 2011 @08:11PM (#36006188) Homepage Journal

You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.
I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.
More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.
There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.
I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.

Share
twitter facebook
- Re: (Score:2)
  
  by dissy ( 172727 ) writes:
  
  Perhaps most of their infrastructure is virtual, and the button he pressed was the hosts power key, shutting down all the guests at once.
- Re: (Score:2)
  
  by lucifuge31337 ( 529072 ) writes:
  
  More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment.
  How does that even happen in a properly managed environment? In fact, even in an improperly managed one? I'd have to try hard to make that happen......I mean...really. Bring up an identically configured VTP master, change it enough times to get a higher rev number, put it on the same LAN and......without external inputs (dropping links to the real VTP master) pretty much nothing ought to happen (other than syslog screaming) unless you're using some really crusty old IOS/CatOS.
  - Re: (Score:2)
    
    by zbaron ( 649094 ) writes:
    
    Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.
    - Re: (Score:2)
      
      by lucifuge31337 ( 529072 ) writes:
      
      Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.
      See my previous post about "crusty old IOS/CatOS".
      Also, who the hell runs the same VTP name and auth key in production and the lab? That is BEGGING for problems.
      Maybe I've just been doing this the right way for too long. I find it difficult te believe that there are networks of any scale that have any duration of uptime that aren't following very, very simple procedures to ensure uptime and/or are operating with such a complete lack of knowledge of the basic plumbing that makes them work. Also, who doesn
  - - Re: (Score:2)
      
      by lucifuge31337 ( 529072 ) writes:
      
      So....what I said. Except you have it in your lab environment. And you don't relize its your VTP master. And you don't bother to put your production config on your replacement box before putting it in production....... Yeah. Not buying it as a likely scenario. This required multiple steps, and a fundamental lack of understanding of key functions of networking equipment in a datacenter setting (namely not knowing what your VTP master is) and a lack of any sort of sane procedures (putting a piece of eq
      - Re: (Score:2)
        
        by lucifuge31337 ( 529072 ) writes:
        
        Plenty of people run their VTP domains as all servers...since they are too lazy to remember which is the server :)
        And to my point, that's amateur hour stuff. Not what one would expect in a professional data center.
        Also, that would not cause this proposed issue, as if they were all servers, none of them would take data as ca VTP client. It would be like not running VTP at all.
- I disagree. (Score:2)
  
  by khasim ( 1285 ) writes:
  
  However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices.
  The problem with such "security" is that the easier you make it for your admins to connect ... the easier you make it for the bad guys to connect.
  The answer is to run training exercis
Since I'm being an awful person today... (Score:3)

by fuzzyfuzzyfungus ( 1223518 ) writes: on Monday May 02, 2011 @08:19PM (#36006228) Journal

I, for one, would like to suggest that the Cloud Foundry is really foundering...

Share
twitter facebook
PEBKAC (Score:2)

by MrQuacker ( 1938262 ) writes:

And that is why we need skynet.
Don't let it happen again (Score:5, Funny)

by stumblingblock ( 409645 ) writes: on Monday May 02, 2011 @08:35PM (#36006334)

They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.

Share
twitter facebook
I don't trust "The Cloud" (Score:2)

by Beelzebud ( 1361137 ) writes:

When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that. The Cloud is great for sharing photos or game saves, but I don't see a future where we all do our computing "in the cloud".
- Re: (Score:2)
  
  by Jeremi ( 14640 ) writes:
  
  When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that.
  You know what beats a local hard drive? Two local hard drives, so that if one of them dies, you can still retrieve your data on the other one. And you know what beats two local hard drives? N hard drives in different locations, so that even after Evil Otto nukes your office and your branch office, you can still retrieve a backup copy of your data from another zip code.
  I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home compute
  - Re: (Score:2)
    
    by jimicus ( 737525 ) writes:
    
    Not necessarily, as has already demonstrated.
    I forget exactly where I first read it, but it bears repeating: Unless you can put your finger on a damn good reason why your business cannot deal with any downtime, you don't need high availability and probably shouldn't bother with it.
    It invariably introduces a lot more complication, a lot more to go wrong. Few businesses truly need it, usually all they need is a clear plan to recover from system failure which accounts for the length of time such recovery wi
  - Re: (Score:2)
    
    by Nevynxxx ( 932175 ) writes:
    
    I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home computer's local drive? That seems like it would be a good feature to have.
    Dropbox.
- Re: (Score:2)
  
  by jd ( 1658 ) writes:
  
  Hard drives are easy to beat. Core memory has an estimated lifespan 20-30x that of a hard drive, is impervious to EMP and won't crash if bumped.
Cloudy Vision of My Future... (Score:2)

by BoRegardless ( 721219 ) writes:

If I think I can trust a cloud to support my data.
Cloud lol. (Score:2)

by unity100 ( 970058 ) writes:

I cant see why it is too hard to realize that, if you end up tying everything into one major big structure, put everything in it, regardless of how much redundancy you designed, it will eventually flop grandly.

if not downtime, it will be security. if not, its something else. the idea is, you are creating one HUGE environment which contains everything. its inevitable that some issue affects all the participants in that environment eventually. those being the clients.

lets admit it - huge monolithic clou
Human error (Score:2)

by PPH ( 736903 ) writes:

No problem. SkyNet will remedy that.
Is the power grid run by some old pc terminal Esc (Score:2)

by Joe The Dragon ( 967727 ) writes:

Is the power grid run by some old pc terminal where hitting Esc can crash the full system?
Anybody see the irony in the first outage? (Score:5, Interesting)

by HockeyPuck ( 141947 ) writes: on Tuesday May 03, 2011 @02:33AM (#36007720)

Ok.. so Vmware is owned by EMC, a dominant storage player. They lost a power supply in a cabinet. So? EMC arrays have had multiple power feeds for years (decades). Even the low end clariion has 2x power supplies. And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed. So that if you lost one PDU, the cabinet still has 100% at no redundancy.
I also find it odd that they'd have an application configuration that if access was lost to ONE lun on ONE array, that it would cripple the entire application. Umm... this is bad application design if you ask me. All it would take would be for the host to mirror the lun to another disk array. That way the array could blow up and you'd be fine, and being VMware (a part of EMC) disk is cheap, unlike the brutal prices the rest of us pay.
Either that or the power failure caused a loss of a single path from host to disk and they forgot to configure Powerpath on the server... or verify that vmware's native multipathing was working correctly...
Irony. A storage company having a storage problem.

Share
twitter facebook
- "Anybody" is sometimes very very badly wrong (Score:2)
  
  by dbIII ( 701233 ) writes:
  
  And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed
  If you don't know if the other circuit is on another phase or not and you have a power supply fault that can be a truly shocking suggestion that can destroy the equipment you intended to save since you may be dealing with 480V now instead of 240V. If you DO know they are on the same phase it is a good idea - but in some circumstances it can be a very very stupid idea to randomly plug the power into random sock
  - Re: (Score:2)
    
    by multipartmixed ( 163409 ) writes:
    
    Um, no.
    Modern servers (not 1950s radio gear) do not feed AC on the equipment side of the power supply. The AC is contained within the PSU and the equipment is powered by DC.
    And besides which, all modern data centers keep their redundant power distribution in phase. For starters, they know that their grounds will be tied together through customer equipment.
Wow. (Score:2)

by chill ( 34294 ) writes:

141 comments and no one mentions the old Sun equipment that had the !@#^ power button on the keyboard! Must be the young crowd posting.
Been there, done that. Reached over, bumped the keyboard and the SparcStation went "blink!" and off.
I've been to a couple lab environments where the upper-right key on every keyboard had been physically removed because this was such a stupid design.
- Re: (Score:2)
  
  by multipartmixed ( 163409 ) writes:
  
  Just so you know, you can turn that off. /etc/power.conf IIRC. That said, I also tend to rip the key off.
  Wanna know ironic, though? The Sun E150 server (mini E450 chassis, Ultra-1 guts) can't be turned *on* without the keyboard.
  True story, one DC where I worked about 12 years ago called Sun support because a machine wouldn't power up after a simulated power failure. Stupid Sun SE wound up replacing the motherboard before he would listen to me and plug in a damn keyboard.
  - Re: (Score:2)
    
    by chill ( 34294 ) writes:
    
    Yeah, but it is so much more satisfying to rip off that damned key with a pair of pliers. :-)
    I have no trouble believing the story of the tech. I remember using that trick on MCSEs who thought they knew computers and Sun servers were just like the WinTel ones...
    "How do you turn this damn thing on?" :-)
Clearly? (Score:2)

by EmagGeek ( 574360 ) writes:

"Clearly, human error is still a major factor in cloud networks."
That is a huge leap. You cannot take one incident and use it as a broad brush with which to paint all of the players in cloud computing.
This should read: "Clearly, human error was a major player in these two specific incidents at VMWare."
Can Slashdot mods PLEASE dispense with the sensationalism?
One key? (Score:2)

by WD ( 96061 ) writes:

You know... there is a fix for that [failblog.org].
- Re:'An inadvertent press of a key on a keyboard' (Score:5, Funny)
  
  by verbatim ( 18390 ) writes: on Monday May 02, 2011 @08:03PM (#36006124) Homepage
  
  This pretty much describes my entire career.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Dyinobal ( 1427207 ) writes:
  
  I'd think that was obvious, clouds are made out of vapor by definition.
  - - Re: (Score:2)
      
      by md65536 ( 670240 ) writes:
      
      Maybe the data center was too dry and when they touched the keyboard, a lightning bolt from the cloud struck.
      Anyway I think keeping clouds in a data center sounds dangerous for several reasons.
- - - Re: (Score:2)
      
      by Per Wigren ( 5315 ) writes:
      
      What were Osama bin Laden's last words?
      "Darn."

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

This is very bad design (Score:4, Interesting)

Re:This is very bad design (Score:5, Interesting)

Re:This is very bad design (Score:5, Insightful)

Re: (Score:3)

Re:This is very bad design (Score:4, Informative)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Funny)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Sounds like a 13 year old making up an excuse. (Score:2)

Re: (Score:2)

Re: (Score:2)

Game Over (Score:4, Insightful)

Re: (Score:3)

Re: (Score:2, Funny)

Re: (Score:2)

Re: (Score:2)

Re:A cloud in need, is a cloud indeed (Score:4, Funny)

Re: (Score:2)

Slashdot summary non sensationalist (Score:4, Interesting)

VMware shows its PR colors. (Score:5, Insightful)

Re: (Score:3)

The Answer is obvious (Score:2)

Re: (Score:2)

Re:VMware shows its PR colors. (Score:5, Insightful)

Re:VMware shows its PR colors. (Score:5, Informative)

Re: (Score:2)

Re:VMware shows its PR colors. (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re:Slashdot summary non sensationalist (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

UR DOING IT WRONG! (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:UR DOING IT WRONG! (Score:5, Funny)

Re: (Score:3, Informative)

Not the RED button!!! (Score:3)

Engineering Errors (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

I disagree. (Score:2)

Since I'm being an awful person today... (Score:3)

PEBKAC (Score:2)

Don't let it happen again (Score:5, Funny)

I don't trust "The Cloud" (Score:2)