An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com) 52
Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week. The company also outlined what caused the issue -- the event was triggered by human error. The company said an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the company said in a press statement Thursday. It adds: The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
Realy going to claim redundant sites for static data is hard? Eventually consistent databases are a thing has been for a long time outside of some very specific niches how much stuff really needs ACID transactions.
And yes I've built these many times well before the cloud was a "thing". Using a single cloud provider for anything is a risk the same reasons we use multiple data centers in different parts of the country/world since before the internet allowed commercial traffic and probably before that
More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."
Here are two other ideas.
The worker called the AWS internal helpdesk and the BOFH on the other end said "Okay, log in as root and type this... rm -rf slash... that'll fix it"
Obviously you try to automate as much as possible
For efficiency, sure.
For reliability/safety, you automate only that which is guaranteed to be safe. The more reliability/safety you want, the less you can automate.
Similarly for security. Does your shit come back up after a reboot? Or does someone have to key in passwords to get the drives unlocked/decrypted, then get the OS running, and then get the various service accounts to do their shit.
No matter where you draw the line, documentation for regular procedures, disaster recovery, and initial configuration
If your data is so important, keep it on your own server. Preferably on a separate network not directly connected to the Internet.
Yeah that sounds good. Because its absolutely certain that any random individual or small company would have strong or even reasonably decent data protection and retention policies, not to mention security policies.
And of course that doesn't get into the other benefits of AWS and other similar services -- namely the abundant access to large systems that scale under various load scenarios. The scaling in particular is something not even large companies could provide for themselves since doing so pretty much
Except anyone can make a mistake. Therefore inevitably everyone does make mistakes. Usually you're lucky and it wasn't important, but very rarely things such as this happens to someone. All you have demonstrated is your own prejudice. I could equally say it was an overpaid lazy American worker which is why companies are climbing over themselves to get cheaper skilled H-1B workers, but that also would be prejudiced. Unless you have the full details, which neither of us do, then its just meaningless noise.
Ok, anyone can make a mistake, but if H1Bs built the server management system to rely on manually typed commands and no one saw the obvious risk of doing that, where does the blame really lie?
The stereotypical H1B is culturally preconditioned to serve, not analyze; they'll (attempt to) do exactly what they're told... no more, no less, with little questioning.
PHB: Hey, could I interrupt you for a second while you're typing that command? I've got something more important. We're thinking of changing the locks on the data center doors to another brand that offers locks in a variety of different colors. And the vendor has assured me that these locks can use the same code as my luggage.
Transcript (Score:5, Funny)
Enter command: DELETE ALL SERVERS
Confirm that you wish to delete all servers: YES
Are you sure? YES
You really wish to delete all servers? YES
I cannot find a predefined scenario under which all servers are removed. Do you wish to abort? NO
Please enter administrator command override to begin deletion of all servers: ZERO ZERO ZERO DESTRUCT ZERO
I would guess it was something more along the lines of "TAKEDOWN server_subset *" when they meant "TAKEDOWN server_subset/*". The same kind of thing that you can accidentally do with say rm. Obviously I don't work in the bowels of AWS and don't know the real command syntax, but chances are it wasn't as obvious as pushing a big red button ie: not a "predefined" scenario but a simple typo that happened to pass syntax parsing and got run anyway even though it wasn't what the user wanted.
We've all done something
S3 outage not the big problem (Score:2)
The big problem is not the US-EAST-1 S3 outage.
The big problem is all the other Amazon "special sauce" that blew up when US-EAST-1 S3 went down, which means Amazon has not adequately made their own services reliable with multi-AZ/multi-region resiliency.
Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were
My boy couldn't turn his desk light on because an intern in Seattle made a typo? The 'S' in IoT also stands for 'sanity'.
Simple errors with big affect (Score:2)
An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.
Why would it have taken a week to recover? You can un-delete objects in AD. Below is an example... Should have taken, at most, a few hours to recover, not a week.
https://technet.microsoft.com/... [microsoft.com]
When I was in ops... (Score:2)
Been there, done that (Score:3)
Speaking as a Sysadmin that has been there, nothing compares to the horror of realizing that the split second it took between hitting [Enter] and aborting with [Ctrl-c] was enough to blow up half the production environment.
This is why all potentially very dangerous commands should default to "--dry-run" and only execute with a "--force" switch.
the magic of the command line (Score:2)
i've seen someone once change the usable memory of SQL server down to 1/10 the physical RAM by accident cause he thought he was so awesome and only used sql for changing configuration options
why i like the almost dumb proof GUI where you can double and triple check visually before you do something that can take a dozen applications offline
why remove in the middle of the day? (Score:2)
at the very least they should have made the VM's unavailable instead of removing them so that if something happened all they had to do was bring them back up again
Cloud Services Are Inherently Unreliable (Score:2)
Before my company moved from internally managed email to office365 managed email, the email service was highly reliable. But now, Microsoft unilaterally deleted half of our entire corporate email history due to some internal mistake. It was able to restore most (if not all) of the deleted email, so we narrowly avoided a disaster.
But this kind of stuff is a disaster waiting to happen that too many management-level boneheads seem to either not understand or not care about until it's too late.
Anyone relinquishing