How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com) 69
Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.
Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.
Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.
The whole point of "prime day" (Score:5, Interesting)
The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure. They can test using simulated connections, but that only goes so far. They need to be able to test AWS with massive demand on unpredictable pages, and have the system scale appropriately. What better way to do this than to shove a few "sales" at a bunch of products, and then contact literally every media outlet in the country to promote it. Seriously, name a local news channel NOT hyping the prime day event. This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.
Re:The whole point of "prime day" (Score:4, Interesting)
That sounds good and that's probably what used to happen.
Last year (when I was still employed by them) Prime day was a VERY big deal; they count on that toward the value of their stocks.
Unfortunately, Amazon is a very reactionary company - they look at the job that is in front of them (RIGHT in front of them) and focus on that to the exclusion of all else. It's likely that multiple departments knew the failures were going to happen but they were unable to get upper management to take them seriously (because working on something that is supposed to already work slows us down from working on what does not yet work, you see). I bet management is taking them seriously now.
I guarantee some poor underpaid schmuck will lose his job over this instead of the management staff that's responsible for an overwhelming workload.
Re: (Score:2, Informative)
That is rarely the case.
I have never seen Amazon or specifically Ec2 to be an organization of blame. The reality is they ask people to go mind numbingly fast a lot of the time and shit happens. There are a number of things people do to try to curtail failure, but it's pretty much by the pants most days. This is the reason there is a reflection on failure and what could have been done better. Consistently fail as an organization and higher ups start asking what you are doing as an org to solve your problems.
Re: (Score:2)
I would suggest, that if you're adverse to failure to the point of never really trying something difficult, you're already failing.
There are a lot of things that teach us, but failure is one of the greatest teachers of all.
Or as my dad used to say (probably stolen from elsewhere), "If you aren't failing, you're not trying hard enough"
Joke's on you (Score:1, Insightful)
Re: (Score:2)
...This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.
"Currently out of capacity for scaling,"
And I'm still scratching my head trying to figure out what the real problem was, because running out of hardware is kind of what the comment above implies, not that there was some glitch to fix in "new infrastructure code".
And scale things up by hand? On Prime Day? That's kind of like telling a NASCAR pit crew they're gonna have to change tires on the car while it's still racing around the track.
Re: (Score:1)
I recognize the quotes in the article - they are from the trouble for the event (all Amazon was watching it, so no wonder it leaked). So let me just say, there was no one single cause of failure.
Re: The whole point of "prime day" (Score:3)
Re: (Score:2)
However being Amazon has years of data and tending information, they should have been able to predict how much demand was going to happen, and made sure the infrastructure is prepped, and ready for the event.
It is kinda a high stake game to play for load testing.
Re: (Score:2)
--This does not appear to be a failure of Prime Day's IT team - it's a Failure of Manglement. Stress the people under you too hard and don't give them what they need, and this is the kind of chit that happens.
--Amazon has been in business for a fairly long time - long enough for upper manglement to become stultified. I doubt they'll learn much from this debacle (but I'd be glad to be proven wrong.) They'll probably continue to treat their warehouse workers like chattel tho.
Re:The whole point of "prime day" (Score:5, Interesting)
Recent Amazon employee here, posting AC for obvious reasons,
The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure.
No, it's a straightforward copy of Singles Day. Singles Day is the biggest shopping day in the world, so Amazon figured they could invent their own shopping holiday and people would go for it.
They need to be able to test AWS with massive demand on unpredictable pages
Almost none of Amazon retail runs on AWS. There are islands here and there, but for the most part they're still unrelated after all these years.
The big difference this year is that something didn't work right
Yeah, here's what idn't work right: in a cost-cutting effort, upper management imposed a huge paperwork burden in order to scale up your fleet for prime day. Some teams clearly decided to take risks with a smaller fleet instead of jumping through flaming hoops to justify the exact number of servers they'd need to scale up to.
How'd that work out for you, Amazon?
Re: (Score:3)
Re: (Score:2)
Interesting, and that makes sense. I was thinking about it the other day. I've never bought anything from Amazon on Prime Day, mainly because every time I look at the sale items, they seem to be a bunch of junk that I have no interest in or need for. I'd started to think of it as a typical "clearance" sale, that they were trying to make space in the warehouse (for upcoming Xmas) by ditching their leftover junk at low prices.
Re: (Score:2)
Tried buying something all day (Score:2)
Re: (Score:2)
Didn't feel like capacity was the issue (Score:2)
They should switch to a scalable Infrastructrure (Score:2)
Like Amazon Web Services for example...
Re: They should switch to a scalable Infrastructru (Score:2)
Re:not enough servers? (Score:5, Informative)
This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.
Source: I work at Amazon.
Re: (Score:2)
Re: (Score:3, Insightful)
AWS wasn’t built for amazon.com use. It was built with excess amazon.com capacity.
That’s a significant difference.
Re: (Score:1)
Except that AWS was built for internal use with the view to make it trivial to open up later. It runs on the same system.
Nope. Pure propaganda from the early days of AWS. AWS was an external product from the beginning, with hopes that one day the retail side would be able to use it. Not so much.
What is true is that Amazon will have dedicated servers running AWS purely for their own use, and priority over other AWS users for shared resources.
Nope. All "reserved instances" have highest priority, and there's never really a problem with that capacity that lasted more than a couple minutes. EC2 "on-demand" instances come next, but they'll never terminate one to give it to anyone else. The only servers they'll take away from people are "Spot" instances, and they're up fro
Re: (Score:3)
This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.
Source: I work at Amazon.
Why not? The private dog food tastes better?
Re: (Score:3)
Why not? The private dog food tastes better?
No, the non-AWS stuff is pure garbage. But it would take engineering effort to move to AWS, and management would have to fund that effort instead of their own pet projects.
Re: (Score:2)
I doubt that. When I worked there in the mid 2000s, what amazon had internally was light years ahead (pun intended, since it was named Apollo) of anything I'd seen outside of it. I highly doubt they haven't continued to invest in that. But it makes lots of sense to keep separate pools of servers for AWS vs Amazon.com, for both security and reliability.
Re: (Score:2)
I doubt that. When I worked there in the mid 2000s, what amazon had internally was light years ahead (pun intended, since it was named Apollo) of anything I'd seen outside of it. I highly doubt they haven't continued to invest in that.
You should doubt more. It's all deep legacy stuff now, most of it the same systems that were innovative 15 years ago. Most of the rest of the world has moved on to implicitly auto-scaling container-based solutions, where dev teams never muck with "servers" in any way. Google (which is hardly leading edge these days) has been containerized for years. Even Azure offers self-scaling fleets with some abstraction away from explicit server types.
And the stuff your remember was never made to work on AWS, so if
The real problem with prime day (Score:2, Funny)
Is that I wasn't able to buy anything on prime day because I don't have any money. Epic fail. LULZ.
Thanks, Obama.
Re: (Score:3)
Re: (Score:2)
But if you were already planning on buying something, it was worth waiting for Prime Day to buy it.
Re: (Score:2)
But if you were already planning on buying something, it was worth waiting for Prime Day to buy it.
Only if it was discounted on prime day; the only thing I was thinking of buying from Amazon on prime day turned out to be twenty bucks cheaper from walmart. And, of course, only if you actually managed to place your order, given that Amazon failed at reliability.
Re: (Score:2)
Well, good for Walmart then. I can't even imagine the number of sales Amazon lost over those glitches.
Re: (Score:2)
Here's all they need to know!
https://www.youtube.com/watch?... [youtube.com]
Re: (Score:2)
Classic rookie AWS provisioning mistake... (Score:2)
One of the first things you need to do when setting up an environment in AWS is to get them to increase your (artificially low) server limits for each instance type you're planning on using. Otherwise, you're going to run into those limits at the worst possible time when you need to rapidly scale your servers.
While I understand why they do this (probably to protect themselves from having someone spin up 1,000 cryptocoin mining instances with a hacked account), it's refreshing to see Amazon get bit by their
You had to be patient (Score:2)
I tried for hours to order the Amazon Fire 7" (8GB) for the low price of CAD$40, but the page kept changing. Sometimes it would be available, sometimes it would be disabled and only the 16GB was available, sometimes the 8GB option completely disappeared as if it didn't even exist, other times it was available from a third-party non-Amazon seller for nearly twice the price.
It kept doing that every single time the page loaded and I was reloading it roughly once per second.
What's also weird is that once every
Re: (Score:2)
It's still probably the best low-cost "brand-name" tablet out there for Netflix. The Fire HD 7" specifications are at least twice as better than the crap available around here at twice that price.
Who would have thougt (Score:5, Funny)
Amazon got slashdotted.
Re: (Score:3)
Honestly (Score:1)