An Incorrect Command Entered By Employee Triggered Disruptions To S3 Storage Service, Knocking Down Dozens of Websites, Amazon Says (amazon.com) 169
Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week. The company also outlined what caused the issue -- the event was triggered by human error. The company said an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the company said in a press statement Thursday. It adds: The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
CLI ? (Score:1)
Fucking interns (Score:5, Funny)
Sure, blame the intern actually typing the command.
More seriously, perhaps it's time that utility Clippy-ed up. As in "I see you're about to kill thousands of servers. Type YES to proceed."
Re:Fucking interns (Score:5, Insightful)
Here are two other ideas.
1. Confirmation. Are you sure you want to delete 3,207 servers?
(oh, drat, that's not what I meant!)
2. Require more typing. If you really want to delete 3,207 servers, then type "DELETE SERVERS" in all caps and press enter. (or something like that. Similar to how Ripley had to go through a lot of motions to activate the self destruct.)
Re:Fucking interns (Score:5, Insightful)
The problem started way before the admin entered the command. The root cause is that you can do this by entering a command in the first place. This sort of thing should be part of a change-controlled configuration management system, and the change should be reviewed before it gets rolled out, it should be rolled out on a staged basis to a single cluster, and it should get rolled back if it breaks things.
Re: (Score:2)
This! A thousand time: this!
Re: (Score:3)
All of that was done, more-or-less, is the problem.
Some poor schmuck took the command line from the (presumably reviewed) change to something billing-related (TFA is short on details there), and typed in that approved command to do the approved thing in a controlled way. But the command line had a typo, and, total WTF, the command line with a typo was able to wreak wholesale destruction.
Whatever configuration management system acted on that command was garbage. Any sane system would have said "hell no!", o
Re: (Score:2)
The problem started way before the admin entered the command. The root cause is that you can do this by entering a command in the first place. This sort of thing should be part of a change-controlled configuration management system, and the change should be reviewed before it gets rolled out, it should be rolled out on a staged basis to a single cluster, and it should get rolled back if it breaks things.
Yes, and I gather, similarly, airline safety is about the systems (culture/procedures/tech) which allowed the individual to make the mistake. Even the way a checklist is written, is critical (what to leave out is as important as what to include).
And I gather like with a new drug, you don't give all your subjects the same injection all together. Wait and see if the first survives, then go on.
And it is always kinda fascinating how, the system one builds to cope with one set of scenarios, in turn creates a new
Re: (Score:2)
Did you know that this is the heart of any "quality" initiative since World War II? Make things idiot-proof, and your employees will save their mental energy for harder things and make less mistakes overall.
Re: (Score:2)
If you think anybody that makes mistakes is dumb then you are an absolute idiot. I am all for making "lifer" harder for idiots.
Re: (Score:2)
When I do anything on a production server, I am extremely careful. Paranoid even. I double check everything. And automation helps avoid mistakes. I only configure a few parameters of a script. But I can double check that before I run it. And I leave the previous configurations commented as examples. That way I just clone the current one, change the version numbers, etc.
Since these Amazon servers that were deleted have the potential to do a HUGE amount of damage, I don't have an
Comment removed (Score:5, Funny)
Re: (Score:3)
Re: (Score:1)
A (now retired) colleague of mine, I'll call Bob, once was a mainframe operator. An incompetent programmer used to blame his accounting application adding errors on Bob for "entering the command wrong".
It was something simple like "RUN ACCT7", but Bob was accused of doing it wrong without specifics, and formally written up for that. HR didn't know anything about computers, so it was easy to bullshit them per creating reprimands.
We'd always joke when somethi
Re: (Score:2)
- "Okay, I've typed the shutdown commnand, now what was the name of the PBX server?"
- "Asterisk"
Re: (Score:3)
We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level
Yeah, they have apparently made this screw up much harder to repeat.
You Crazy Fool! (Score:2)
AWS Internal Help Desk (Score:5, Funny)
The worker called the AWS internal helpdesk and the BOFH on the other end said "Okay, log in as root and type this... rm -rf slash... that'll fix it"
Re: (Score:2)
Re: (Score:2)
Re:AWS Internal Help Desk (Score:4, Funny)
No. He is splitting the command in more than one line. The next input would trigger the killing frenzy.
Not enough UNIX, as it appears.
playbook?? This is my data not a football! (Score:3, Funny)
playbook?? This is my data not a football!
Re: (Score:2)
This is my data not a football!
If your data is so important, keep it on your own server. Preferably on a separate network not directly connected to the Internet.
Re: (Score:3)
Yeah that sounds good. Because its absolutely certain that any random individual or small company would have strong or even reasonably decent data protection and retention policies, not to mention security policies.
And of course that doesn't get into the other benefits of AWS and other similar services -- namely the abundant access to large systems that scale under various load scenarios. The scaling in particular is something not even large companies could provide for themselves since doing so pretty muc
Re: (Score:3)
If privacy is your utmost concern, then sure keep your data encrypted on a computer that never has and never will see a network connection and put it in a Faraday cage and whatnot.
A dedicated network between my workstations and the file server is adequate for my needs.
But if you need it to be on the internet (say, you provide a website service,) then you may as well consider AWS and other services.
Very little of my data need to live 24/7 on the Internet. Since I'm converting my dynamic websites to static generated websites, the data behind my websites stays off the Internet as well.
And even with this recent crash, I can bet that they've also got better uptime than your off-the-shelf box that you wiped Windows off and installed Linux on because Linux is completely secure right?
My file server is a custom built PC that runs FreeNAS (BSD) in a Z2 (RAID-6) hard drive configuration. Current uptime is three months since an extended power outage drain the UPS battery and prompted the server to safely power down. It
Re: (Score:2)
is adequate for my needs
That's kind of the tricky part there. Not everyone's needs will match yours.
the data behind my websites stays off the Internet as well
So basically a backup that you can use to regenerate your internet-facing site. Always good to have more backups no matter what their form but again, other people might have different needs. In particular some dynamic sites actually need to be dynamic (frequently for business reasons more than technical ones, but that's beside the point. Needs are needs no matter where they're generated.)
And of course, I'm assuming you're transf
Re: (Score:2)
And motherboard, CPU, PSU, etc failures?
Never happened. Probably because I replace everything when the hard drives start to have problems after running 24/7 for five years. I had to replace the nine-year-old motherboard in my gaming PC so it would have better specs than the file server.
[...] worry about nosy Amazon employees [...]
Uh, no. Script kiddies from China and Russia banging down my virtual doors. I got tired of playing whack the mole with trying to keep everything up to date for Joomla and WordPress. When I replaced a dynamic website with a static website, hacking attempts dropped f
Re: (Score:2)
Disclaimer: I keep all my data in a dufflebag under my kid sisters' bed.
Not necessarily the best place to store your Playboy magazine collection. ;)
Re: (Score:2)
Re: (Score:2)
Obviously you try to automate as much as possible
For efficiency, sure.
For reliability/safety, you automate only that which is guaranteed to be safe. The more reliability/safety you want, the less you can automate.
Similarly for security. Does your shit come back up after a reboot? Or does someone have to key in passwords to get the drives unlocked/decrypted, then get the OS running, and then get the various service accounts to do their shit.
No matter where you draw the line, documentation for regular procedures, disaster recovery, and initial configurat
My experience suggests the opposite (Score:5, Insightful)
> For reliability/safety, you automate only that which is guaranteed to be safe. The more reliability/safety you want, the less you can automate.
My experience is the exact opposite. When I write software to automate something, that automated procedure is planned and reviewed, then undergoes unit testing, integration testing, and acceptance testing. When I do something by hand - well you better hope the phone doesn't ring while I'm in the middle of it because if I lose concentration for a moment mistakes are quite possible. My boss agrees; the other day I mentioned I was doing something manually and he cocked his head and asked "manually? Isn't that subject to typos and other errors?"
Re: (Score:2)
I'm all for having things scripted or menuized, or otherwise made foolproof over manually keying things in that aren't security-sensitive (such as passwurdz).
But triggering those scripts and running the "do this complex thing" job should have a human at the trigger, watching things as they burst into flames (or don't).
Another option for complex procedures is to have 2 people to serve as a check against each other. These kinds of checks are commonplace in the military and various regulated industries (minin
Re: (Score:2)
Scripted and rehearsed system maintenance is SOP for good shops. You don't have your highly paid senior folks doing grunt work, and you don't put a grunt in front of a terminal and let them work by the seat of their pants. Obviously you try to automate as much as possible, but there will always remain things that cannot be automated and must be done by a human.
Amazon famously has their highly paid senior engineers doing grunt work. Clearly, given yesterday, it's not their only mistake.
Transcript (Score:5, Funny)
Enter command: DELETE ALL SERVERS
Confirm that you wish to delete all servers: YES
Are you sure? YES
You really wish to delete all servers? YES
I cannot find a predefined scenario under which all servers are removed. Do you wish to abort? NO
Please enter administrator command override to begin deletion of all servers: ZERO ZERO ZERO DESTRUCT ZERO
Re: (Score:2)
I would guess it was something more along the lines of "TAKEDOWN server_subset *" when they meant "TAKEDOWN server_subset/*". The same kind of thing that you can accidentally do with say rm. Obviously I don't work in the bowels of AWS and don't know the real command syntax, but chances are it wasn't as obvious as pushing a big red button ie: not a "predefined" scenario but a simple typo that happened to pass syntax parsing and got run anyway even though it wasn't what the user wanted.
We've all done someth
Re: (Score:3, Funny)
Re: (Score:3)
Enter command: DELETE ALL SERVERS
Funny enough, I started this one job and was given an account on the VAX system that handled all main applications plus internal email. When I went into the email list to read and write my personal email, it looked something like this:
Read Email
Send Email
Read Sent Email
Read Deleted Email
Delete All Email
Well, after a week or two, I had a bunch of email and no need to keep it, so I hit Delete All Email. What nobody told me was since I was an admin, that command meant ALL EMAIL, for EVERYBODY on the system
Re: (Score:1)
Janeway's ship was "lost" out in nowhere-land. There was no Federation to inspect or enforce such rules. They probably hot-wired the ship to give her more control to be more nimble since there was no help around. It was the space equivalent of the Wild West.
Re: (Score:2)
There you go trying to analyze a TV show to show its logical. Its fiction folks and not even Science Fiction. Its more like fantasy set in the future. Its not supposed to be logical.
S3 outage not the big problem (Score:5, Interesting)
The big problem is not the US-EAST-1 S3 outage.
The big problem is all the other Amazon "special sauce" that blew up when US-EAST-1 S3 went down, which means Amazon has not adequately made their own services reliable with multi-AZ/multi-region resiliency.
Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
So Echo/Alexa was down because it depends on Lambda, new subscriptions to AMI software, Simple Email Service, etc.
Re:S3 outage not the big problem (Score:4, Interesting)
Re: S3 outage not the big problem (Score:2)
My boy couldn't turn his desk light on because an intern in Seattle made a typo? The 'S' in IoT also stands for 'sanity'.
Re: (Score:2)
Is there an H in there somewhere too? And lets just drop the lower-case 'o' entirely.
(I hope someone understands this really convoluted joke)
Re: (Score:2)
HA is hard. "Cloud" makes it even harder because in most instances you don't have much control over the lower levels anymore.
Re: (Score:3)
Wasn't it Netflix that released their internal scripts for testing reliability? Randomly blowing away cloud instances and other core components so they could more realistically test the ability of the HA to deliver?
I generally agree that HA is hard, and it's made harder still by PHBs who ask for HA and then cherry pick the cheapest element (out of several necessary), blab to management that you are fault tolerant and then never allow actually testing it.
I also blame vendors for waaayyyy overpromising what
Simple errors with big affect (Score:2)
Re: (Score:2)
An AD administrator in charge of purging old user accounts was using a script to cull AD. He put an * someplace he shouldn't and deleted all the users in a sub-domain. That was a fun week. And I was still cleaning up after that fiasco months later.
Why would it have taken a week to recover? You can un-delete objects in AD. Below is an example... Should have taken, at most, a few hours to recover, not a week.
https://technet.microsoft.com/... [microsoft.com]
Re: (Score:2)
When I was in ops... (Score:5, Insightful)
Scripting won't work either, really (Score:2)
The double-check script way works until it doesn't.
The command wasn't mis-typed, the scope was wrong.
Re: (Score:2)
The command wasn't mis-typed, the scope was wrong.
And how did the wrong scope appear? By magic? Or did someone enter it?
.
Having the wrong scope is precisely one of the types of errors a script and a second set of eyes will help to prevent.
Re: (Score:3)
Procedures are just the archaeology of mistakes..
Re: (Score:2)
Procedures are just the archaeology of mistakes..
Fortunately, that procedure was put into place because of a small mistake, and the procedure prevented much larger mistakes down the road.
.
Experience is something you don't get until just after you need it.
Experience is the worst teacher; it gives the test before the lesson. ---Vernon Law
Re: (Score:2)
Doesn't really matter though.: command, or script, or checked-in config file. There's just shouldn't be a way to destroy the world with any one action, regardless of how that action is expressed.
Been there, done that (Score:5, Insightful)
Speaking as a Sysadmin that has been there, nothing compares to the horror of realizing that the split second it took between hitting [Enter] and aborting with [Ctrl-c] was enough to blow up half the production environment.
This is why all potentially very dangerous commands should default to "--dry-run" and only execute with a "--force" switch.
Re:Been there, done that (Score:4, Insightful)
Re: (Score:2)
the magic of the command line (Score:3)
i've seen someone once change the usable memory of SQL server down to 1/10 the physical RAM by accident cause he thought he was so awesome and only used sql for changing configuration options
why i like the almost dumb proof GUI where you can double and triple check visually before you do something that can take a dozen applications offline
Re: (Score:2)
Grownups and then you, right?
Re: (Score:2)
some "grown up" deleted half of S3.
This is not some mundane detail, Michael! (Score:2)
why remove in the middle of the day? (Score:2)
at the very least they should have made the VM's unavailable instead of removing them so that if something happened all they had to do was bring them back up again
The employee name was revealed. (Score:5, Funny)
His name was Robert `); DROP TABLE S3-subsystem; --
Re: (Score:2)
Every time they tried to add his resume to the blacklist, it just disappeared.
Re: (Score:2)
Every time they tried to add his resume to the blacklist, it just disappeared.
The blacklist disappeared. Not little Bobby's resume alone.
Cloud Services Are Inherently Unreliable (Score:3)
Before my company moved from internally managed email to office365 managed email, the email service was highly reliable. But now, Microsoft unilaterally deleted half of our entire corporate email history due to some internal mistake. It was able to restore most (if not all) of the deleted email, so we narrowly avoided a disaster.
But this kind of stuff is a disaster waiting to happen that too many management-level boneheads seem to either not understand or not care about until it's too late.
Anyone relinquishing control over their infrastructure to unaccountable third parties needs to fired ASAP, and be replaced with someone who isn't a complete and utter moron. The mine is littered with dead canaries, and too many responses are of the line, "that won't happen to our canaries. Let's forge ahead."
Re: (Score:2)
You found out too late that Microsoft doesn't have backups of it's service. I actually had the same issue, employer decided that $25/mailbox/month was 'normal' and then mailboxes corrupted on the O365 servers (because, it's still Exchange after all, the worst e-mail system in the world and it has inherited the same Exchange problems: corrupt data stores). Now we're scrambling to find a 'backup' solution. TCO calculations that were already dodgy suddenly went up 50%.
Re: (Score:2)
I just wanted to point out how wrong you are.
We have finished migrating to the cloud and now we have redundancy, speed and guaranteed uptime
This also allowed us to get rid our our surly ops guys, so it's a win win really.
We are now in the process of outsourcing our programing department to India, another win, win for us in management.
We are excited about the future, what could go possibly go wrong?
Regards,
Bob Stiff - PHB
cloud serives, rm -rf / has global significance (Score:1)
So the wonder of cloud services is that a single fat fingered error rather than just taking out one company, can take out the world.
Re: (Score:1)
I'm hoping T's fingers are too short to reach the Red Button. I'm hoping O had a collar soldered to it that only fit his long fingers.
Business Opportunity (Score:3)
Now Amazon can sell AWS prime. If you are a susbcriber of AWS prime we will check "Twice" before removing your servers. That should boost the profit somewhat
Well, at least that won't happen again (Score:4, Insightful)
Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended
oh, nevermind.
Admin by color (Score:2)
Re:Admin by color (Score:4, Insightful)
Re: (Score:1)
Press 'em both, just to be sure they got the right one!
--
You have the right to remain dead.
Re: (Score:3)
Maybe 'remove one server' should be a big green button, and 'remove many servers' should be a big red button.
What do color blind admins do?
Take down all the S3 services in the US-EAST-1 Region :P
Re: (Score:2)
Customize the colour scheme in the console?
AWS (Score:1)
Of course, you have all your business critical systems spread amongst several AWS data centers, right? And any business critical is replicated several times? Not.
In Some Cases? (Score:2)
Amazon is apologizing for the disruptions to its S3 storage service that knocked down and -- in some cases affected -- dozens of websites earlier this week.
What am I missing here? Is there a way to "knock down" a website without affecting it?
Re: (Score:2)
As an example, if your website was hosted servers in a different region than the outage, but tried to send email using US-EAST-1, your website would have still been up, but would have been affected because it couldn't send email.
Happens all the time (Score:2)
Something extremely similar happened to my company last month. An EMC tech was onsite to work on the Isilon system. He was supposed to issue a command to put one of the nodes into maintenance mode. Instead he put the entire cluster into maintenance mode.
Needless to say, he is not welcome back. Ever.
Not to put myself above anyone else, I made a similar mistake a couple of years ago. I wrote a script that checked a csv input against a list of computers in Active Directory. It was supposed to delete all
The more power the more rope to hang yourself with (Score:2)
Also I don't think more warning messages or safety logic is always the answer. Maybe practicing more wi
Re: (Score:2)
Warnings are nice, but it's hard to anticipate all of the conditions where a warning might be in order. It's also hard to make people pay attention to warnings when each and every action produces one or more warnings.
Reversibility is a key. Let the admin see the consequences of the last given command and undo it if necessary. Warnings are for actions that are intrinsically irreversible. Build it into the commands if possible. If not, build it into the procedures. Don't delete an instance, just take it offli
Time to implement the Two-Man Rule... (Score:2)
I Want to Work Where You Work (Score:2)
To all those guys who are bragging about how they would never put anything in the cloud (AWS or otherwise) because their data centers are so reliable, so redundant, fault-tolerant and insulated from human error that they can be held to the highest possible standards of up-time and accountability, are you hiring?
Or, would you be interested in a bridge I have for sale?
Where is the typo? (Score:2)
Anyone knows exactly what typo?
I can't find the fumbled command line anywhere in the internet.
Re: cloud (Score:3, Insightful)
Well that all sounds easy enough [wikipedia.org]
Re: (Score:3)
Realy going to claim redundant sites for static data is hard? Eventually consistent databases are a thing has been for a long time outside of some very specific niches how much stuff really needs ACID transactions.
And yes I've built these many times well before the cloud was a "thing". Using a single cloud provider for anything is a risk the same reasons we use multiple data centers in different parts of the country/world since before the internet allowed commercial traffic and probably before that (no di
Re: cloud (Score:5, Insightful)
I would say most stuff requires ACID or at least continuously consistent databases (you don't always need transactions or atomicity) and eventually consistent is a niche. Most 'eventually consistent' systems I've seen have an entire layer on top to make sure the data is consistent.
Anytime you do a financial transaction of any sorts (free or not), you need a consistent system or risk someone being able to manipulate the data. Obviously, some developers don't really care at first since eventually consistent updates are fast enough initially. But once they realize the mistake they made, an entire layer of patchwork gets written to make it behave like a rational database again.
Re: (Score:3)
Well that all sounds easy enough [wikipedia.org]
Well, computers are easy. A child can program one. That's why you should always hire the cheapest IT workers you can get.
Amazon should use the cloud (Score:3)
AWS should use the cloud, that way when one server goes down the load is picked up seamlessly by another one with no downtime ...... Oh Wait. Never Mind
Re: H-1B Visas (Score:1)
Except anyone can make a mistake. Therefore inevitably everyone does make mistakes. Usually you're lucky and it wasn't important, but very rarely things such as this happens to someone. All you have demonstrated is your own prejudice. I could equally say it was an overpaid lazy American worker which is why companies are climbing over themselves to get cheaper skilled H-1B workers, but that also would be prejudiced. Unless you have the full details, which neither of us do, then its just meaningless noise.
Re: (Score:2)
Ok, anyone can make a mistake, but if H1Bs built the server management system to rely on manually typed commands and no one saw the obvious risk of doing that, where does the blame really lie?
The stereotypical H1B is culturally preconditioned to serve, not analyze; they'll (attempt to) do exactly what they're told... no more, no less, with little questioning.
Re: (Score:2)
Nothing but more meaningless noise. It's no wonder lazy overpaid Americans are getting replaced when they make bad assumptions and spout prejudice.
India, the predominant H-1B Visa source for American companies is an authoritarian culture will arrange marriages and all that sort of stuff. That's not prejudice, it's an observation of fact. Can you dispute that with facts and evidence? Or you just want your fantasy version of reality in your head to be right even if it's complete fantasy. You know that's the definition of delusion right?
Re: (Score:3)
Re: (Score:2)
Yeh, there needs to be a rm -rf -list that simply displays the file names it would delete.
They gave it the name find(1).
Re: (Score:2)
Yeah I don't think you got the idea...
Re: (Score:2)
That's exactly what a tired/disgruntled operator at AOL did many years ago, I believe at their data center in Japan. Wherever it was, it affected a very important system that took them down in a pretty broad geographic area for something like 2 or 3 days. It was a big deal.
Re: (Score:2)
PHB: Hey, could I interrupt you for a second while you're typing that command? I've got something more important. We're thinking of changing the locks on the data center doors to another brand that offers locks in a variety of different colors. And the vendor has assured me that these locks can use the same code as my luggage.
Re: (Score:2)
Re: (Score:3)
Oh, be careful, friend! I made the grave mistake of suggesting on Reddit that we've kinda/sorta/maybe become too enamored of CLIs and that just MAYBE a GUI MIGHT have prevented this, and I got hammered mercilessly.
You don't want to say anything that doesn't equate to worship at the feet of the almighty, great and awesome CLI around the wrong people.