Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
The Internet IT

An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com) 169

Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week. The company also outlined what caused the issue -- the event was triggered by human error. The company said an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the company said in a press statement Thursday. It adds: The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
This discussion has been archived. No new comments can be posted.

An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon S

Comments Filter:
  • Cheap help is so easy to overwork.
  • by xxxJonBoyxxx ( 565205 ) on Thursday March 02, 2017 @02:55PM (#53964199)
    >> wrong command

    Sure, blame the intern actually typing the command.

    More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."
    • Re:Fucking interns (Score:5, Insightful)

      by DickBreath ( 207180 ) on Thursday March 02, 2017 @03:33PM (#53964613) Homepage
      Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.

      Here are two other ideas.

      1. Confirmation. Are you sure you want to delete 3,207 servers?
      (oh, drat, that's not what I meant!)

      2. Require more typing. If you really want to delete 3,207 servers, then type "DELETE SERVERS" in all caps and press enter. (or something like that. Similar to how Ripley had to go through a lot of motions to activate the self destruct.)
      • Re:Fucking interns (Score:5, Insightful)

        by dgatwood ( 11270 ) on Thursday March 02, 2017 @04:50PM (#53965295) Homepage Journal

        Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.

        The problem started way before the admin entered the command. The root cause is that you can do this by entering a command in the first place. This sort of thing should be part of a change-controlled configuration management system, and the change should be reviewed before it gets rolled out, it should be rolled out on a staged basis to a single cluster, and it should get rolled back if it breaks things.

        • ^^^ and it should get rolled back if it breaks things

          This! A thousand time: this!

        • by lgw ( 121541 )

          All of that was done, more-or-less, is the problem.

          Some poor schmuck took the command line from the (presumably reviewed) change to something billing-related (TFA is short on details there), and typed in that approved command to do the approved thing in a controlled way. But the command line had a typo, and, total WTF, the command line with a typo was able to wreak wholesale destruction.

          Whatever configuration management system acted on that command was garbage. Any sane system would have said "hell no!", o

        • by Bongo ( 13261 )

          The problem started way before the admin entered the command. The root cause is that you can do this by entering a command in the first place. This sort of thing should be part of a change-controlled configuration management system, and the change should be reviewed before it gets rolled out, it should be rolled out on a staged basis to a single cluster, and it should get rolled back if it breaks things.

          Yes, and I gather, similarly, airline safety is about the systems (culture/procedures/tech) which allowed the individual to make the mistake. Even the way a checklist is written, is critical (what to leave out is as important as what to include).

          And I gather like with a new drug, you don't give all your subjects the same injection all together. Wait and see if the first survives, then go on.

          And it is always kinda fascinating how, the system one builds to cope with one set of scenarios, in turn creates a new

    • by squiggleslash ( 241428 ) on Thursday March 02, 2017 @04:06PM (#53964923) Homepage Journal
      "alexa take down s3 servers a b and c"
      "OK, taking down s3"
    • by Tablizer ( 95088 )

      Sure, blame the intern actually typing the command.

      A (now retired) colleague of mine, I'll call Bob, once was a mainframe operator. An incompetent programmer used to blame his accounting application adding errors on Bob for "entering the command wrong".

      It was something simple like "RUN ACCT7", but Bob was accused of doing it wrong without specifics, and formally written up for that. HR didn't know anything about computers, so it was easy to bullshit them per creating reprimands.

      We'd always joke when somethi

    • I can see how this happened:

      - "Okay, I've typed the shutdown commnand, now what was the name of the PBX server?"

      - "Asterisk"
    • https://aws.amazon.com/message/41926/ [amazon.com]

      We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level

      Yeah, they have apparently made this screw up much harder to repeat.

  • You pushed the "trigger disruptions to S3 storage service" button!
  • by EmagGeek ( 574360 ) <(gterich) (at) (aol.com)> on Thursday March 02, 2017 @02:55PM (#53964205) Journal

    The worker called the AWS internal helpdesk and the BOFH on the other end said "Okay, log in as root and type this... rm -rf slash... that'll fix it"

  • by Joe_Dragon ( 2206452 ) on Thursday March 02, 2017 @02:55PM (#53964211)

    playbook?? This is my data not a football!

    • This is my data not a football!

      If your data is so important, keep it on your own server. Preferably on a separate network not directly connected to the Internet.

      • by Altrag ( 195300 )

        Yeah that sounds good. Because its absolutely certain that any random individual or small company would have strong or even reasonably decent data protection and retention policies, not to mention security policies.

        And of course that doesn't get into the other benefits of AWS and other similar services -- namely the abundant access to large systems that scale under various load scenarios. The scaling in particular is something not even large companies could provide for themselves since doing so pretty muc

        • If privacy is your utmost concern, then sure keep your data encrypted on a computer that never has and never will see a network connection and put it in a Faraday cage and whatnot.

          A dedicated network between my workstations and the file server is adequate for my needs.

          But if you need it to be on the internet (say, you provide a website service,) then you may as well consider AWS and other services.

          Very little of my data need to live 24/7 on the Internet. Since I'm converting my dynamic websites to static generated websites, the data behind my websites stays off the Internet as well.

          And even with this recent crash, I can bet that they've also got better uptime than your off-the-shelf box that you wiped Windows off and installed Linux on because Linux is completely secure right?

          My file server is a custom built PC that runs FreeNAS (BSD) in a Z2 (RAID-6) hard drive configuration. Current uptime is three months since an extended power outage drain the UPS battery and prompted the server to safely power down. It

          • by Altrag ( 195300 )

            is adequate for my needs

            That's kind of the tricky part there. Not everyone's needs will match yours.

            the data behind my websites stays off the Internet as well

            So basically a backup that you can use to regenerate your internet-facing site. Always good to have more backups no matter what their form but again, other people might have different needs. In particular some dynamic sites actually need to be dynamic (frequently for business reasons more than technical ones, but that's beside the point. Needs are needs no matter where they're generated.)

            And of course, I'm assuming you're transf

            • And motherboard, CPU, PSU, etc failures?

              Never happened. Probably because I replace everything when the hard drives start to have problems after running 24/7 for five years. I had to replace the nine-year-old motherboard in my gaming PC so it would have better specs than the file server.

              [...] worry about nosy Amazon employees [...]

              Uh, no. Script kiddies from China and Russia banging down my virtual doors. I got tired of playing whack the mole with trying to keep everything up to date for Joomla and WordPress. When I replaced a dynamic website with a static website, hacking attempts dropped f

    • Relax, they were just deflating it.
  • Transcript (Score:5, Funny)

    by 93 Escort Wagon ( 326346 ) on Thursday March 02, 2017 @03:12PM (#53964365)

    Enter command: DELETE ALL SERVERS
    Confirm that you wish to delete all servers: YES
    Are you sure? YES
    You really wish to delete all servers? YES
    I cannot find a predefined scenario under which all servers are removed. Do you wish to abort? NO
    Please enter administrator command override to begin deletion of all servers: ZERO ZERO ZERO DESTRUCT ZERO

    • by Altrag ( 195300 )

      I would guess it was something more along the lines of "TAKEDOWN server_subset *" when they meant "TAKEDOWN server_subset/*". The same kind of thing that you can accidentally do with say rm. Obviously I don't work in the bowels of AWS and don't know the real command syntax, but chances are it wasn't as obvious as pushing a big red button ie: not a "predefined" scenario but a simple typo that happened to pass syntax parsing and got run anyway even though it wasn't what the user wanted.

      We've all done someth

    • Re: (Score:3, Funny)

      by sky_khan72 ( 994064 )
      Once I worked in a software firm. They said they got rid of the one feature after they got support calls for that irreversible destructive operation which you must enter something like "YES I UNDERSTAND. DELETE ANYWAY" to proceed.
    • Enter command: DELETE ALL SERVERS

      Funny enough, I started this one job and was given an account on the VAX system that handled all main applications plus internal email. When I went into the email list to read and write my personal email, it looked something like this:

      Read Email
      Send Email
      Read Sent Email
      Read Deleted Email
      Delete All Email

      Well, after a week or two, I had a bunch of email and no need to keep it, so I hit Delete All Email. What nobody told me was since I was an admin, that command meant ALL EMAIL, for EVERYBODY on the system

  • by TheSync ( 5291 ) on Thursday March 02, 2017 @03:27PM (#53964543) Journal

    The big problem is not the US-EAST-1 S3 outage.

    The big problem is all the other Amazon "special sauce" that blew up when US-EAST-1 S3 went down, which means Amazon has not adequately made their own services reliable with multi-AZ/multi-region resiliency.

    Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

    So Echo/Alexa was down because it depends on Lambda, new subscriptions to AMI software, Simple Email Service, etc.

    • by phantomfive ( 622387 ) on Thursday March 02, 2017 @03:50PM (#53964775) Journal
      Yeah, actually I was surprised how much Amazon stuff actually went down. I had a coworker who had previously worked at Amazon, and he assured profusely me that AWS was not stable. Clearly he was right. (I don't know why this is big news all of a sudden, AWS has a big outage every year on average).
    • My boy couldn't turn his desk light on because an intern in Seattle made a typo? The 'S' in IoT also stands for 'sanity'.

      • Is there an H in there somewhere too? And lets just drop the lower-case 'o' entirely.

        (I hope someone understands this really convoluted joke)

    • by guruevi ( 827432 )

      HA is hard. "Cloud" makes it even harder because in most instances you don't have much control over the lower levels anymore.

      • by swb ( 14022 )

        Wasn't it Netflix that released their internal scripts for testing reliability? Randomly blowing away cloud instances and other core components so they could more realistically test the ability of the HA to deliver?

        I generally agree that HA is hard, and it's made harder still by PHBs who ask for HA and then cherry pick the cheapest element (out of several necessary), blab to management that you are fault tolerant and then never allow actually testing it.

        I also blame vendors for waaayyyy overpromising what

  • An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.
    • An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.

      Why would it have taken a week to recover? You can un-delete objects in AD. Below is an example... Should have taken, at most, a few hours to recover, not a week.

      https://technet.microsoft.com/... [microsoft.com]

      • That is a VERY good question. I learned long ago that those questions seldom get answered here. They recreated each ID manually. Which generated a new SID meaning when they logged back in they couldn't access their old profile data. Yeah, like I said...fiasco.
  • by QuietLagoon ( 813062 ) on Thursday March 02, 2017 @03:31PM (#53964581)
    ... we never, NEVER typed such critical commands. They were always entered into a script, and the script double-checked by a second set of eyes. While we did have some minor inconsequential errors, we never had a major error because of mis-typed commands.
    • The double-check script way works until it doesn't.

      The command wasn't mis-typed, the scope was wrong.

      • The command wasn't mis-typed, the scope was wrong.

        And how did the wrong scope appear? By magic? Or did someone enter it?

        .
        Having the wrong scope is precisely one of the types of errors a script and a second set of eyes will help to prevent.

    • Procedures are just the archaeology of mistakes..

      • Procedures are just the archaeology of mistakes..

        Fortunately, that procedure was put into place because of a small mistake, and the procedure prevented much larger mistakes down the road.

        .
        Experience is something you don't get until just after you need it.

        Experience is the worst teacher; it gives the test before the lesson. ---Vernon Law

    • by lgw ( 121541 )

      Doesn't really matter though.: command, or script, or checked-in config file. There's just shouldn't be a way to destroy the world with any one action, regardless of how that action is expressed.

  • by Dorianny ( 1847922 ) on Thursday March 02, 2017 @03:31PM (#53964587) Journal

    Speaking as a Sysadmin that has been there, nothing compares to the horror of realizing that the split second it took between hitting [Enter] and aborting with [Ctrl-c] was enough to blow up half the production environment.

    This is why all potentially very dangerous commands should default to "--dry-run" and only execute with a "--force" switch.

  • by known_coward_69 ( 4151743 ) on Thursday March 02, 2017 @03:38PM (#53964655)

    i've seen someone once change the usable memory of SQL server down to 1/10 the physical RAM by accident cause he thought he was so awesome and only used sql for changing configuration options

    why i like the almost dumb proof GUI where you can double and triple check visually before you do something that can take a dozen applications offline

  • "I must have put a decimal point in the wrong place or something. I always do that! I always mess up some mundane detail!"
  • at the very least they should have made the VM's unavailable instead of removing them so that if something happened all they had to do was bring them back up again

  • by 140Mandak262Jamuna ( 970587 ) on Thursday March 02, 2017 @03:55PM (#53964825) Journal
    Looks like Amazon has very strict sign off requirement. After entering the command, the system asked for his name to be logged for the audit trail.

    His name was Robert `); DROP TABLE S3-subsystem; --

  • by StormReaver ( 59959 ) on Thursday March 02, 2017 @03:55PM (#53964833)

    Before my company moved from internally managed email to office365 managed email, the email service was highly reliable. But now, Microsoft unilaterally deleted half of our entire corporate email history due to some internal mistake. It was able to restore most (if not all) of the deleted email, so we narrowly avoided a disaster.

    But this kind of stuff is a disaster waiting to happen that too many management-level boneheads seem to either not understand or not care about until it's too late.

    Anyone relinquishing control over their infrastructure to unaccountable third parties needs to fired ASAP, and be replaced with someone who isn't a complete and utter moron. The mine is littered with dead canaries, and too many responses are of the line, "that won't happen to our canaries. Let's forge ahead."

    • by guruevi ( 827432 )

      You found out too late that Microsoft doesn't have backups of it's service. I actually had the same issue, employer decided that $25/mailbox/month was 'normal' and then mailboxes corrupted on the O365 servers (because, it's still Exchange after all, the worst e-mail system in the world and it has inherited the same Exchange problems: corrupt data stores). Now we're scrambling to find a 'backup' solution. TCO calculations that were already dodgy suddenly went up 50%.

    • Greetings Sir,

      I just wanted to point out how wrong you are.
      We have finished migrating to the cloud and now we have redundancy, speed and guaranteed uptime
      This also allowed us to get rid our our surly ops guys, so it's a win win really.

      We are now in the process of outsourcing our programing department to India, another win, win for us in management.
      We are excited about the future, what could go possibly go wrong?

      Regards,

      Bob Stiff - PHB
  • So the wonder of cloud services is that a single fat fingered error rather than just taking out one company, can take out the world.

    • by Tablizer ( 95088 )

      So the wonder of cloud services is that a single fat fingered error rather than just taking out one company, can take out the world.

      I'm hoping T's fingers are too short to reach the Red Button. I'm hoping O had a collar soldered to it that only fit his long fingers.

    • Now Amazon can sell AWS prime. If you are a susbcriber of AWS prime we will check "Twice" before removing your servers. That should boost the profit somewhat

  • by s1d3track3D ( 1504503 ) on Thursday March 02, 2017 @04:09PM (#53964953)
    I'm glad they located the issue and put safeguards in place to make sure it doesn't happen again.

    Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended

    oh, nevermind.

  • Maybe 'remove one server' should be a big green button, and 'remove many servers' should be a big red button. Then you could put a red dot or a green dot in the run book.
  • Of course, you have all your business critical systems spread amongst several AWS data centers, right? And any business critical is replicated several times? Not.

  • Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week.

    What am I missing here? Is there a way to "knock down" a website without affecting it?

    • As an example, if your website was hosted servers in a different region than the outage, but tried to send email using US-EAST-1, your website would have still been up, but would have been affected because it couldn't send email.

  • Something extremely similar happened to my company last month. An EMC tech was onsite to work on the Isilon system. He was supposed to issue a command to put one of the nodes into maintenance mode. Instead he put the entire cluster into maintenance mode.

    Needless to say, he is not welcome back. Ever.

    Not to put myself above anyone else, I made a similar mistake a couple of years ago. I wrote a script that checked a csv input against a list of computers in Active Directory. It was supposed to delete all

  • The admin has a very powerful tool. It has almost no constraints on what it can do because 99% of the time we want that power. We are dealing with an uncommon, unexpected situation and need to be able to have the power to do something different. The exact correct command might be something that no one anticipated before. It would be very time consuming to come up with rules preventing such a command.

    Also I don't think more warning messages or safety logic is always the answer. Maybe practicing more wi
    • by sjames ( 1099 )

      Warnings are nice, but it's hard to anticipate all of the conditions where a warning might be in order. It's also hard to make people pay attention to warnings when each and every action produces one or more warnings.

      Reversibility is a key. Let the admin see the consequences of the last given command and undo it if necessary. Warnings are for actions that are intrinsically irreversible. Build it into the commands if possible. If not, build it into the procedures. Don't delete an instance, just take it offli

  • ..since errors of this nature could be worse than a missile launch. https://en.wikipedia.org/wiki/... [wikipedia.org]
  • To all those guys who are bragging about how they would never put anything in the cloud (AWS or otherwise) because their data centers are so reliable, so redundant, fault-tolerant and insulated from human error that they can be held to the highest possible standards of up-time and accountability, are you hiring?

    Or, would you be interested in a bridge I have for sale?

  • Anyone knows exactly what typo?
    I can't find the fumbled command line anywhere in the internet.

As of next Tuesday, C will be flushed in favor of COBOL. Please update your programs.

Working...