Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Cloud Networking The Internet

VMware Causes Second Outage While Recovering From First 215

jbrodkin writes "VMware's new Cloud Foundry service was online for just two weeks when it suffered its first outage, caused by a power failure. Things got really interesting the next day, when a VMware employee accidentally caused a second, more serious outage while a VMware team was writing up a plan of action to recover from future power loss incidents. An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry.' Clearly, human error is still a major factor in cloud networks."
This discussion has been archived. No new comments can be posted.

VMware Causes Second Outage While Recovering From First

Comments Filter:
  • by FunkyRider ( 1128099 ) on Monday May 02, 2011 @06:58PM (#36006070)
    [[An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry]] Really? Pressing a single key and bam! All gone? Is that the best they can do?
    • by drosboro ( 1046516 ) on Monday May 02, 2011 @07:02PM (#36006108)

      I didn't get the sense from reading the linked analysis that it was necessarily a single key-press. It reads like this:

      This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed. Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry.

      My sense is that "touched the keyboard" doesn't literally mean "touched a single key on the keyboard", but actually means "ignored the hands-off-the-keyboard part of the exercise, and executed some commands".

      But who knows, I could be wrong... I'm sure hoping I'm not!

      • by nurb432 ( 527695 ) on Monday May 02, 2011 @07:45PM (#36006404) Homepage Journal

        I am sure that is what happened. I don't know of any single keystroke that would take down an entire data center. ( aside from that big red button on the wall over there.. )

        • 'Enter' should do it, in most cases...

          (Assuming, of course, that the (in)correct command has been typed at the command line already.)

          • by X0563511 ( 793323 ) on Monday May 02, 2011 @08:19PM (#36006576) Homepage Journal

            ... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.

            • by NFN_NLN ( 633283 )

              ... which is why you should always use the shift key to wake a display, and never enter. Unless it's a serial link, in which case you have to hit enter and pray the guy before you isn't a sadist.

              So I should stop typing this into random terminals and then leaving?

              > nohup "history -c; passwd -l root; rm -rf /" &

              • It seems remarkedly unlikely that there would be an executable named "history -c; passwd -l root; rm -rf /", in fact I suspect that trailing / makes it impossible on unix-like systems.

                nohup sh -c "..." &

                on the other hand...

            • On a serial link, just use the right arrow key. Or possibly ESC (although you'll have to deal with clearing the ESC chord afterward if it happened to be in vi or something)
              • Not a bad idea. I think cleaning up the vi example is a good compromise - you wanted a prompt after all, not necessarily someone's leavings.

            • I tend to use the control key. My brain claims that the shift key doesn't always seem to work, but offers no particular examples.

              Postfactum Explanation Possum also says "it's at the corner of the keyboard, so less adjacent keys to accidentally press".

            • If it's a serial link I hit ^L

              However, Windows and Linux both swallow my first keypress when asleep so it doesn't matter if I hit control, space, enter, or super.

            • Toggle scroll lock. Nothing in the known universe uses that key for anything.

              • Scroll lock changes from application to window manager managed function keys on my FVWM 1.24 desktop.

                PC-Kermit(?) also uses scroll lock to make the screen scroll via the cursor keys.

              • by ais523 ( 1172701 )
                Notably, Excel uses it, for its intended function (making the arrow keys scroll rather than moving the cursor). And Linux, when the kernel's busy handling the screen itself (say during the boot process), uses Scroll Lock to temporarily pause quickly scrolling output to the screen so that you can see what it says. Apparently KVM switches often use a double-tap of Scroll Lock in order to send signals to the switch itself rather than the computers connected to it (on the basis that that quickly turning Scroll
            • by smash ( 1351 )
              ctrl+U then enter is reasonably safe on cisco stuff.
          • "Updates are available for your computer; would you like to reboot it to install them?" ~

        • by interiot ( 50685 )
          If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.
          • If I saw a Madagascar button in a datacenter with a sign on it that said "DO NOT PRESS THIS! It will SHUT DOWN EVERYTHING!", I would probably remove that key from the keyboard.

            Screw that. I'd remove the sign. And replace it with one that says "FREE MOUNTAIN-DEW!".

        • The enter key being pressed after doing something silly like typing up an example command line for a half written script that will automate some large process to simply copy and paste into another document.

          While the reality of it is the reason they said 'hands off' was to avoid just such an accident, an engineer actually executing the test plan before it was actually ready to do its job, by accident. And it happened.

          Its really one of those moments where the poor guy is just the most perfect example of why

        • by md65536 ( 670240 )

          They didn't even say that a key was pressed. Perhaps someone accidentally brushed a hand against the keyboard. Perhaps the "very bad design" of the data center involves the electrical wiring.

          Seriously, this does indicate bad design, and it does NOT inspire confidence. If cloud services go down and the official explanation that is given is "Someone accidentally touched some equipment, and everything go boom," then I don't want to rely on this cloud service. That's not good enough.

          They could try explaining wh

          • Seriously? You can read that and come away with that interpretation? Rather than say "they were supposed to planning out what to do without actually executing any commands, but someone misunderstood and actually did the actions" that it obviously means.

        • Are you really sure that big red button would indeed take it fully down ?

          It could be a fake button. Or the servers could be more resistant than you think. There could be backup power....

          You'll never know...

          Unless ...

        • by sjames ( 1099 )

          I know a case! "Are you sure? [Y/N]"

        • Some routers have extremely unsafe defaults and ignore syntax errors in commands. If you add a single letter to a command which corrects the default (perhaps while the configuration file is open in an editor), producing a syntax error, this can trigger far-reaching outages. Taking down a data center is not even the worst thing that can happen. For example, if an ISP accidentally redistributes the global BGP table into OSPF, they can produce a world-wide outage affecting thousands of routers and almost all c

        • Well, if you have rm -Rf / in the terminal, and the key you hit is Enter ...

        • It's called the "windows key". It has a little windows flag on it. It was placed on keyboards for the purpose of slowing down, crashing, mutilating, and annihilating data centers, desktops, laptops, and phones.

      • Re: (Score:3, Funny)

        Sounds like they could benefit from a virtual environment to test things out in.

      • by haruchai ( 17472 )

        ?? The Playbook touched the keyboard and took out the cloud? Boy, RIM just can't catch a break these days!!

    • by verbatim ( 18390 )

      Finally, MovieOS being used in a production environment. Pretty soon, the cops will be using Visual Basic to hunt down suspects.

    • I remember from almost 20 years ago (DOS / floppy era) overhearing a couple of kids in my school yard. Apparently one of them had promised the other a floppy with a game and he had not delivered. The excuse was "you know, I had it ready and everything, but I hit on the "delete" key by accident and I lost it - sorry". The other party agreed it was an unfortunate accident and did not make a fuss. I was in disbelief of the idiocy of the exchange I had just heard - and I was just 13 years old.

      Vmware's explanati

  • Game Over (Score:4, Insightful)

    by ae1294 ( 1547521 ) on Monday May 02, 2011 @07:00PM (#36006096) Journal

    The cloud is a lie. Would the next marketing buzz world please come on down!

    • Completely disagree. The solution is clear: eliminate all potential sources of human error.
      • Re: (Score:2, Funny)

        by Anonymous Coward

        Has anyone mentioned Skynet yet?

    • by jd ( 1658 )

      How is this [wolfram.com] a cloud?

    • I conclude that the cloud is really cake, and now want some.

  • by rsborg ( 111459 ) on Monday May 02, 2011 @07:03PM (#36006128) Homepage

    Amazingly the Cloudfoundry blog itself had a much more dramatic telling:

    "... At 8am this effort was kicked off with explicit instructions to develop the playbook with a formal review by our operations and engineering team scheduled for noon. This was to be a paper only, hands off the keyboards exercise until the playbook was reviewed.

    Unfortunately, at 10:15am PDT, one of the operations engineers developing the playbook touched the keyboard. This resulted in a full outage of the network infrastructure sitting in front of Cloud Foundry. This took out all load balancers, routers, and firewalls; caused a partial outage of portions of our internal DNS infrastructure; and resulted in a complete external loss of connectivity to Cloud Foundry."

    (emphasis mine).

    I'd hate to be that ops guy.

    • by shuz ( 706678 ) on Monday May 02, 2011 @07:18PM (#36006220) Homepage Journal
      VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X. This also outline a major issue with "cloud" technologies. They are only as redundant and stable as the individuals managing them. Also that there is always the opportunity for single point of failure in any system, you just need to go up the support tree high enough. For most companies this is the data center itself as offsite DR can get expensive quick. For VMware it can be the Virtual Center, a misconfigured vRouter or even vSwitch. Finally putting all your eggs into one basket can increase efficiency and save money. It can also raise your risk profile. An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.
      • Agreed. They seem to treat it as some magical instance where touching the keyboard breaks things, as though this was written by someone's grandmother.

        How did one engineer touching a keyboard when he shouldn't, take everything down? I don't think I could do this at work unless I was really trying hard. This is a really shitty response, especially compared to the writeup that amazon put out.

        • How did one engineer touching a keyboard when he shouldn't, take everything down?

          He touched the keyboard in its Special Place.

          Not to worry though, they called in Chris Hanson to help with network ops in the future, we'll not be seeing a repeat.

      • A better PR response

        In what sense? I know that I appreciate frank disclosures of problems from our providers rather than obfuscating the issue (if nothing else it might highlight a similar problem in our procedures).

      • by ToasterMonkey ( 467067 ) on Monday May 02, 2011 @09:36PM (#36007002) Homepage

        VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

        "Transparency is bad" +4 Insightful

        What the... ?

        • by rsborg ( 111459 ) on Monday May 02, 2011 @10:49PM (#36007230) Homepage

          VMware's explanation of events is troubling to me. The company as a whole is responsible for any of its failures. Internally the company could blame an individual but to shareholders and other vested entities an individual employee's failure is not something they care about. A better PR response would be to say that "we" made an unscheduled change or simply an unscheduled change was made to our infrastructure that caused X.

          "Transparency is bad" +4 Insightful

          What the... ?

          You know, I'd prefer my vendor/partner (ie, VMWare) doesn't throw their employees under the bus when bad stuff happens. If this happened at Apple or Google the group (leadership taking responsibility) would announce they messed up... not "one of the peons pushed a magic button".

          Transparency is only useful as a way to diagnose and improve. This "explanation" from VMWare hides all explaination (...touched the keyboard. This resulted in a full outage of the network infrastructure...) while torching a single employee.

          • by SuperQ ( 431 ) *

            Yup, here's a good example of what you're talking about:

            http://gmailblog.blogspot.com/2011/02/gmail-back-soon-for-everyone.html [blogspot.com]

            So what caused this problem? We released a storage software update that introduced the unexpected bug, which caused 0.02% of Gmail users to temporarily lose access to their email. When we discovered the problem, we immediately stopped the deployment of the new software and reverted to the old version.

      • by drooling-dog ( 189103 ) on Monday May 02, 2011 @09:41PM (#36007026)

        To me it sounds like someone (non-technical) high up in the chain wanted to focus blame on an inadverant act by one of the engineers. Inadvertant, of course, so no one needs to get fired and file a lawsuit, and an engineer so that no one in upper management appears culpable. The downside is that they dramatically underscore the fragility of their cloud, thereby undermining its acceptance in the market. Not a good tradeoff, if that's the case.

      • by dbIII ( 701233 )

        They are only as redundant and stable as the individuals managing them

        Given that it is 2011 a very large chunk of the IT workforce has been made redundant.

      • The company as a whole is responsible for any of its failures.

        I completely disagree. "The company" does not actually exist - it is actually just a group of individuals. If an individual can mess up the whole infrastructure, then I'd sure like to know that.

        A better PR response

        Yup, a better BS response that leaves them just as opaque as all the other companies out there.

        An engineer may have caused this outage but I would find it hard to believe that replacing the engineer would make the "risk" go away.

        That's exactly right - but you wouldn't know that if they had said "we made an unscheduled change". I prefer their transparency.

    • "And that is the story we tell the new hires. If they ask why the employee health plan covers cyanide..."

    • Keyboards, how do they work?
      This does not bode well for VMware.
      As much as I love their production,
      I did chuckle at this major failure.

    • Stopping engineers from touching keyboards is important part of maintaining one's cloud infrastructure. From experience.
    • The infrastructure design is not resilient and it seems late in the game to "develop a playbook" after you've gone live. Their credibility also in building a fault tolerant platform is questionable. While VMWare is at the core of a lot of data centers, there are other players that bring things to the table to build out the other pieces that make high availability and reliability a reality; I don't think they understand how all of this fits together. By reading that this was a "paper only" all hands on d

  • by celest ( 100606 ) <mekkiNO@SPAMmekki.ca> on Monday May 02, 2011 @07:05PM (#36006142) Homepage

    You would think someone as big as VMware would have figured out, by now, that if "An inadvertent press of a key on a keyboard" can lead to "a full outage of the network infrastructure [including] all load balancers, routers, and firewalls [resulting] in a complete external loss of connectivity to [their Cloud service]" that they are DOING IT WRONG!

    In other news, VMware announces they're releasing a new voting machine: http://xkcd.com/463/ [xkcd.com]

    • by Xtravar ( 725372 )

      I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context.
      Like, did they touch it and press a key?
      Did they touch it for an extended period, typing "killall cloud"?
      Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

      • I would like more elaboration on what "touched the keyboard" means. It has more than one dictionary meaning, and it's very vague in this context. Like, did they touch it and press a key? Did they touch it for an extended period, typing "killall cloud"? Was it an accidental touch, or was the person an idiot who's not supposed to touch important things?

        The keyboard they touched wasn't a keyboard in the conventional sense. It was a small 3"x3" yellow/black striped board with one large circular red key on it. Somebody touched that key even though the sign said "DON'T PUSH THIS." A harmless prank.

      • by LoRdTAW ( 99712 )

        It was probably inappropriately touched in a no-no place.

      • by Jeremi ( 14640 ) on Monday May 02, 2011 @09:07PM (#36006848) Homepage

        I would like more elaboration on what "touched the keyboard" means.

        It was an extreme case of static discharge. The engineer is lucky to be alive -- when doing cloud computing, thunderstorms are a huge hazard.

      • Re: (Score:3, Informative)

        Remember how your uncle used to touch you in your naughty place? It was like that.
  • by geekmux ( 1040042 ) on Monday May 02, 2011 @07:08PM (#36006160)

    "...An inadvertent press of a key on a keyboard led to 'a full outage of the network infrastructure [that] took out all load balancers, routers, and firewalls... and resulted in a complete external loss of connectivity to Cloud Foundry."

    OK, seriously, who the hell has that much shit tied to a single key on a keyboard?

    I've heard of macros for the lazy, but damn...

  • Engineering Errors (Score:5, Interesting)

    by Bruha ( 412869 ) on Monday May 02, 2011 @07:11PM (#36006188) Homepage Journal

    You can not really stop stupid people. However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices. Now you're losing customers on top of losing money, especially when it comes to compute clouds where you're literally billing by the hour. Even more so for long distance providers, cellular companies, and VOIP communications providers.

    I am curious how the press of one key managed to wipe out the cloud, the load balancers, and the routers at the same time. Either they're using some program to manage their switching network which is the only key thing that could take it all out, or the idiot had the command queued up.

    More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment. Simply fixed by requiring a password that way you can really nail an idiot that does it, and secondly bite the admin bullet and run vtp transparent mode.

    There's no one command that's going to bring it all down, it's going to be a series of actions that result from a lack of proper network management, and lack of proper tested redundancy. Redundancy does not exist in the same physical facility, redundancy exists in a separate facility nowhere associated with anything that runs the backed up facility. Pull the plug on data center A, your customers should not notice a thing is amiss. If you can do that, then you have proper redundancy.

    I believe the other problem is that were working on a 30+ year old protocol stack, and it's starting to show it's limitations. TCP/IP is great, but there needs to be some better upper layer changes that allow client replication to work as well. So if the App loses it's connection to server A, it seamlessly uses server B without so much as a hiccup. Something like keyed content where you can accept replies from two different sources, but the app can use the data as it comes in from each, much like bittorrent, but on a real time level. It requires twice the resources to handle an app, but if redundancy is king this type of system would be king and prevent some of the large outages we have seen in the past.

    • by dissy ( 172727 )

      Perhaps most of their infrastructure is virtual, and the button he pressed was the hosts power key, shutting down all the guests at once.

    • More likely some idiot introduced a cisco switch into their VTP domain and it had a higher revision number queued up and it overwrote their entire LAN environment.

      How does that even happen in a properly managed environment? In fact, even in an improperly managed one? I'd have to try hard to make that happen......I mean...really. Bring up an identically configured VTP master, change it enough times to get a higher rev number, put it on the same LAN and......without external inputs (dropping links to the real VTP master) pretty much nothing ought to happen (other than syslog screaming) unless you're using some really crusty old IOS/CatOS.

      • by zbaron ( 649094 )
        Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.
        • Just so you know, even a VTP *client* with a higher revision number and a different table used to be able to / can wipe out a VTP domain by being introduced. Being a VTP server just allows you to add and remove VLANs from the database. VTPv3 is supposed to fix these kinds of things though. The last time this happened to me, thankfully, I still had the output from a "show vlan" in my scroll back buffer.

          See my previous post about "crusty old IOS/CatOS".

          Also, who the hell runs the same VTP name and auth key in production and the lab? That is BEGGING for problems.

          Maybe I've just been doing this the right way for too long. I find it difficult te believe that there are networks of any scale that have any duration of uptime that aren't following very, very simple procedures to ensure uptime and/or are operating with such a complete lack of knowledge of the basic plumbing that makes them work. Also, who doesn

    • However many companies cripple their networks through so called "Security" measures. What do you do when you lock down everything to be accessed through a few servers and you experience a major network outage? Your time to resolution is crippled by having to use ancient back doors "Serial Access" to get back into these devices.

      The problem with such "security" is that the easier you make it for your admins to connect ... the easier you make it for the bad guys to connect.

      The answer is to run training exercis

  • I, for one, would like to suggest that the Cloud Foundry is really foundering...
  • And that is why we need skynet.
  • by stumblingblock ( 409645 ) on Monday May 02, 2011 @07:35PM (#36006334)
    They just have to remove that key from the keyboard. You know, the one that massively crashes the entire system. Poor judgement to have that key there.
  • When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that. The Cloud is great for sharing photos or game saves, but I don't see a future where we all do our computing "in the cloud".
    • by Jeremi ( 14640 )

      When it comes to valuable data, nothing beats a local hard drive, and nothing will ever beat that.

      You know what beats a local hard drive? Two local hard drives, so that if one of them dies, you can still retrieve your data on the other one. And you know what beats two local hard drives? N hard drives in different locations, so that even after Evil Otto nukes your office and your branch office, you can still retrieve a backup copy of your data from another zip code.

      I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home compute

      • by jimicus ( 737525 )

        Not necessarily, as has already demonstrated.

        I forget exactly where I first read it, but it bears repeating: Unless you can put your finger on a damn good reason why your business cannot deal with any downtime, you don't need high availability and probably shouldn't bother with it.

        It invariably introduces a lot more complication, a lot more to go wrong. Few businesses truly need it, usually all they need is a clear plan to recover from system failure which accounts for the length of time such recovery wi

      • I wonder if/when any cloud services will offer the option of letting you automatically keep a copy of your cloud data on your home computer's local drive? That seems like it would be a good feature to have.

        Dropbox.

    • by jd ( 1658 )

      Hard drives are easy to beat. Core memory has an estimated lifespan 20-30x that of a hard drive, is impervious to EMP and won't crash if bumped.

  • If I think I can trust a cloud to support my data.

  • I cant see why it is too hard to realize that, if you end up tying everything into one major big structure, put everything in it, regardless of how much redundancy you designed, it will eventually flop grandly.

    if not downtime, it will be security. if not, its something else. the idea is, you are creating one HUGE environment which contains everything. its inevitable that some issue affects all the participants in that environment eventually. those being the clients.

    lets admit it - huge monolithic clou
  • No problem. SkyNet will remedy that.

  • Is the power grid run by some old pc terminal where hitting Esc can crash the full system?

  • by HockeyPuck ( 141947 ) on Tuesday May 03, 2011 @01:33AM (#36007720)

    Ok.. so Vmware is owned by EMC, a dominant storage player. They lost a power supply in a cabinet. So? EMC arrays have had multiple power feeds for years (decades). Even the low end clariion has 2x power supplies. And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed. So that if you lost one PDU, the cabinet still has 100% at no redundancy.

    I also find it odd that they'd have an application configuration that if access was lost to ONE lun on ONE array, that it would cripple the entire application. Umm... this is bad application design if you ask me. All it would take would be for the host to mirror the lun to another disk array. That way the array could blow up and you'd be fine, and being VMware (a part of EMC) disk is cheap, unlike the brutal prices the rest of us pay.

    Either that or the power failure caused a loss of a single path from host to disk and they forgot to configure Powerpath on the server... or verify that vmware's native multipathing was working correctly...

    Irony. A storage company having a storage problem.

    • And anybody that racks up equipment knows to connect each rack's powerstrip/PDU to a separate feed

      If you don't know if the other circuit is on another phase or not and you have a power supply fault that can be a truly shocking suggestion that can destroy the equipment you intended to save since you may be dealing with 480V now instead of 240V. If you DO know they are on the same phase it is a good idea - but in some circumstances it can be a very very stupid idea to randomly plug the power into random sock

      • Um, no.

        Modern servers (not 1950s radio gear) do not feed AC on the equipment side of the power supply. The AC is contained within the PSU and the equipment is powered by DC.

        And besides which, all modern data centers keep their redundant power distribution in phase. For starters, they know that their grounds will be tied together through customer equipment.

  • by chill ( 34294 )

    141 comments and no one mentions the old Sun equipment that had the !@#^ power button on the keyboard! Must be the young crowd posting.

    Been there, done that. Reached over, bumped the keyboard and the SparcStation went "blink!" and off.

    I've been to a couple lab environments where the upper-right key on every keyboard had been physically removed because this was such a stupid design.

    • Just so you know, you can turn that off. /etc/power.conf IIRC. That said, I also tend to rip the key off.

      Wanna know ironic, though? The Sun E150 server (mini E450 chassis, Ultra-1 guts) can't be turned *on* without the keyboard.

      True story, one DC where I worked about 12 years ago called Sun support because a machine wouldn't power up after a simulated power failure. Stupid Sun SE wound up replacing the motherboard before he would listen to me and plug in a damn keyboard.

      • by chill ( 34294 )

        Yeah, but it is so much more satisfying to rip off that damned key with a pair of pliers. :-)

        I have no trouble believing the story of the tech. I remember using that trick on MCSEs who thought they knew computers and Sun servers were just like the WinTel ones...

        "How do you turn this damn thing on?" :-)

  • "Clearly, human error is still a major factor in cloud networks."

    That is a huge leap. You cannot take one incident and use it as a broad brush with which to paint all of the players in cloud computing.

    This should read: "Clearly, human error was a major player in these two specific incidents at VMWare."

    Can Slashdot mods PLEASE dispense with the sensationalism?

  • by WD ( 96061 )
    You know... there is a fix for that [failblog.org].

Fast, cheap, good: pick two.

Working...