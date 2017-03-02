Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 


An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com) 61

Posted by msmash from the how-it-all-happened dept.
Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week. The company also outlined what caused the issue -- the event was triggered by human error. The company said an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the company said in a press statement Thursday. It adds: The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

  • Cheap help is so easy to overwork.

  • Fucking interns (Score:5, Funny)

    by xxxJonBoyxxx ( 565205 ) on Thursday March 02, 2017 @01:55PM (#53964199)
    >> wrong command

    Sure, blame the intern actually typing the command.

    More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."
    • Maybe when ANY servers are deleted, even just one, there should be two or more people who look at the command before it is entered. Just to have more than one pair of eyes on it. Just to greatly reduce the chances of doing something you don't want to do. Sort of like, if you did the rm -rf \ thing. Make sure another person looks at it first. Seems like a simple rule for certain powerful commands where the user's powers include enough scope to accidentally do a lot of damage.

      Here are two other ideas.
    • "alexa take down s3 servers a b and c"
      "OK, taking down s3"
  • You pushed the "trigger disruptions to S3 storage service" button!

  • AWS Internal Help Desk (Score:5, Funny)

    by EmagGeek ( 574360 ) <gterichNO@SPAMaol.com> on Thursday March 02, 2017 @01:55PM (#53964205) Journal

    The worker called the AWS internal helpdesk and the BOFH on the other end said "Okay, log in as root and type this... rm -rf slash... that'll fix it"

  • playbook?? This is my data not a football!

    • Re: (Score:2)

      by creimer ( 824291 )

      This is my data not a football!

      If your data is so important, keep it on your own server. Preferably on a separate network not directly connected to the Internet.

      • Re: (Score:2)

        by Altrag ( 195300 )

        Yeah that sounds good. Because its absolutely certain that any random individual or small company would have strong or even reasonably decent data protection and retention policies, not to mention security policies.

        And of course that doesn't get into the other benefits of AWS and other similar services -- namely the abundant access to large systems that scale under various load scenarios. The scaling in particular is something not even large companies could provide for themselves since doing so pretty muc

  • Transcript (Score:5, Funny)

    by 93 Escort Wagon ( 326346 ) on Thursday March 02, 2017 @02:12PM (#53964365)

    Enter command: DELETE ALL SERVERS
    Confirm that you wish to delete all servers: YES
    Are you sure? YES
    You really wish to delete all servers? YES
    I cannot find a predefined scenario under which all servers are removed. Do you wish to abort? NO
    Please enter administrator command override to begin deletion of all servers: ZERO ZERO ZERO DESTRUCT ZERO

    • Re: (Score:2)

      by Altrag ( 195300 )

      I would guess it was something more along the lines of "TAKEDOWN server_subset *" when they meant "TAKEDOWN server_subset/*". The same kind of thing that you can accidentally do with say rm. Obviously I don't work in the bowels of AWS and don't know the real command syntax, but chances are it wasn't as obvious as pushing a big red button ie: not a "predefined" scenario but a simple typo that happened to pass syntax parsing and got run anyway even though it wasn't what the user wanted.

      We've all done someth

  • The big problem is not the US-EAST-1 S3 outage.

    The big problem is all the other Amazon "special sauce" that blew up when US-EAST-1 S3 went down, which means Amazon has not adequately made their own services reliable with multi-AZ/multi-region resiliency.

    Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda wer

    • Yeah, actually I was surprised how much Amazon stuff actually went down. I had a coworker who had previously worked at Amazon, and he assured profusely me that AWS was not stable. Clearly he was right. (I don't know why this is big news all of a sudden, AWS has a big outage every year on average).

    • My boy couldn't turn his desk light on because an intern in Seattle made a typo? The 'S' in IoT also stands for 'sanity'.

    • Re: (Score:2)

      by guruevi ( 827432 )

      HA is hard. "Cloud" makes it even harder because in most instances you don't have much control over the lower levels anymore.

  • An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.

    • An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.

      Why would it have taken a week to recover? You can un-delete objects in AD. Below is an example... Should have taken, at most, a few hours to recover, not a week.

      https://technet.microsoft.com/... [microsoft.com]

  • ... we never, NEVER typed such critical commands. They were always entered into a script, and the script double-checked by a second set of eyes. While we did have some minor inconsequential errors, we never had a major error because of mis-typed commands.

  • Been there, done that (Score:5, Insightful)

    by Dorianny ( 1847922 ) on Thursday March 02, 2017 @02:31PM (#53964587) Journal

    Speaking as a Sysadmin that has been there, nothing compares to the horror of realizing that the split second it took between hitting [Enter] and aborting with [Ctrl-c] was enough to blow up half the production environment.

    This is why all potentially very dangerous commands should default to "--dry-run" and only execute with a "--force" switch.

  • i've seen someone once change the usable memory of SQL server down to 1/10 the physical RAM by accident cause he thought he was so awesome and only used sql for changing configuration options

    why i like the almost dumb proof GUI where you can double and triple check visually before you do something that can take a dozen applications offline

  • "I must have put a decimal point in the wrong place or something. I always do that! I always mess up some mundane detail!"

  • at the very least they should have made the VM's unavailable instead of removing them so that if something happened all they had to do was bring them back up again

  • Looks like Amazon has very strict sign off requirement. After entering the command, the system asked for his name to be logged for the audit trail.

    His name was Robert `); DROP TABLE S3-subsystem; --

  • Before my company moved from internally managed email to office365 managed email, the email service was highly reliable. But now, Microsoft unilaterally deleted half of our entire corporate email history due to some internal mistake. It was able to restore most (if not all) of the deleted email, so we narrowly avoided a disaster.

    But this kind of stuff is a disaster waiting to happen that too many management-level boneheads seem to either not understand or not care about until it's too late.

    Anyone relinqui

    • Re: (Score:2)

      by guruevi ( 827432 )

      You found out too late that Microsoft doesn't have backups of it's service. I actually had the same issue, employer decided that $25/mailbox/month was 'normal' and then mailboxes corrupted on the O365 servers (because, it's still Exchange after all, the worst e-mail system in the world and it has inherited the same Exchange problems: corrupt data stores). Now we're scrambling to find a 'backup' solution. TCO calculations that were already dodgy suddenly went up 50%.

  • I'm glad they located the issue and put safeguards in place to make sure it doesn't happen again.

    Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended

    oh, nevermind.

