How Amazon Scrambled To Fix Prime Day Glitches

How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com) 69

Posted by BeauHD on Tuesday July 24, 2018 @03:00AM from the too-hot-to-handle dept.

Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.

Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.

How Amazon Scrambled To Fix Prime Day Glitches

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 69 Comments Log In/Create an Account

Comments Filter:

The whole point of "prime day" (Score:5, Interesting)

by darkain ( 749283 ) writes: on Tuesday July 24, 2018 @03:07AM (#56998942) Homepage

The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure. They can test using simulated connections, but that only goes so far. They need to be able to test AWS with massive demand on unpredictable pages, and have the system scale appropriately. What better way to do this than to shove a few "sales" at a bunch of products, and then contact literally every media outlet in the country to promote it. Seriously, name a local news channel NOT hyping the prime day event. This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.

- Re:The whole point of "prime day" (Score:4, Interesting)
  
  by Anonymous Coward writes: on Tuesday July 24, 2018 @03:35AM (#56998994)
  
  That sounds good and that's probably what used to happen.
  Last year (when I was still employed by them) Prime day was a VERY big deal; they count on that toward the value of their stocks.
  Unfortunately, Amazon is a very reactionary company - they look at the job that is in front of them (RIGHT in front of them) and focus on that to the exclusion of all else. It's likely that multiple departments knew the failures were going to happen but they were unable to get upper management to take them seriously (because working on something that is supposed to already work slows us down from working on what does not yet work, you see). I bet management is taking them seriously now.
  I guarantee some poor underpaid schmuck will lose his job over this instead of the management staff that's responsible for an overwhelming workload.
  
  - Re: (Score:2, Informative)
    
    by Anonymous Coward writes:
    
    That is rarely the case.
    I have never seen Amazon or specifically Ec2 to be an organization of blame. The reality is they ask people to go mind numbingly fast a lot of the time and shit happens. There are a number of things people do to try to curtail failure, but it's pretty much by the pants most days. This is the reason there is a reflection on failure and what could have been done better. Consistently fail as an organization and higher ups start asking what you are doing as an org to solve your problems.
    - Re: (Score:2)
      
      by Archangel Michael ( 180766 ) writes:
      
      I would suggest, that if you're adverse to failure to the point of never really trying something difficult, you're already failing.
      There are a lot of things that teach us, but failure is one of the greatest teachers of all.
      Or as my dad used to say (probably stolen from elsewhere), "If you aren't failing, you're not trying hard enough"
- Joke's on you (Score:1, Insightful)
  
  by Anonymous Coward writes:
  
  Joke's on you, because most of the Amazon retail doesn't actually run on AWS. It uses its own deployment system, server management, data storage, etc.
- Re: (Score:2)
  
  by geekmux ( 1040042 ) writes:
  
  ...This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.
  "Currently out of capacity for scaling,"
  And I'm still scratching my head trying to figure out what the real problem was, because running out of hardware is kind of what the comment above implies, not that there was some glitch to fix in "new infrastructure code".
  And scale things up by hand? On Prime Day? That's kind of like telling a NASCAR pit crew they're gonna have to change tires on the car while it's still racing around the track.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    People were not provisioning servers manually. They were triggering provisioning actions manually instead of relying on autoscaling. In particular, autoscaling actions are throttled to avoid rapid changes that were actually needed there.
    
    I recognize the quotes in the article - they are from the trouble for the event (all Amazon was watching it, so no wonder it leaked). So let me just say, there was no one single cause of failure.
  - Re: The whole point of "prime day" (Score:3)
    
    by LordWabbit2 ( 2440804 ) writes:
    
    Welcome to IT, a work colleague once said "we can do the impossible, but miracles take a bit longer".
- Re: (Score:2)
  
  by jellomizer ( 103300 ) writes:
  
  However being Amazon has years of data and tending information, they should have been able to predict how much demand was going to happen, and made sure the infrastructure is prepped, and ready for the event.
  It is kinda a high stake game to play for load testing.
  - Re: (Score:2)
    
    by Wolfrider ( 856 ) writes:
    
    --This does not appear to be a failure of Prime Day's IT team - it's a Failure of Manglement. Stress the people under you too hard and don't give them what they need, and this is the kind of chit that happens.
    --Amazon has been in business for a fairly long time - long enough for upper manglement to become stultified. I doubt they'll learn much from this debacle (but I'd be glad to be proven wrong.) They'll probably continue to treat their warehouse workers like chattel tho.
- Re:The whole point of "prime day" (Score:5, Interesting)
  
  by Anonymous Coward writes: on Tuesday July 24, 2018 @09:10AM (#56999932)
  
  Recent Amazon employee here, posting AC for obvious reasons,
  The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure.
  
  No, it's a straightforward copy of Singles Day. Singles Day is the biggest shopping day in the world, so Amazon figured they could invent their own shopping holiday and people would go for it.
  They need to be able to test AWS with massive demand on unpredictable pages
  Almost none of Amazon retail runs on AWS. There are islands here and there, but for the most part they're still unrelated after all these years.
  The big difference this year is that something didn't work right
  Yeah, here's what idn't work right: in a cost-cutting effort, upper management imposed a huge paperwork burden in order to scale up your fleet for prime day. Some teams clearly decided to take risks with a smaller fleet instead of jumping through flaming hoops to justify the exact number of servers they'd need to scale up to.
  How'd that work out for you, Amazon?
  
- Re: (Score:3)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:2)
  
  by imidan ( 559239 ) writes:
  
  The entire point of "prime day", ... is to test out their infrastructure.
  Interesting, and that makes sense. I was thinking about it the other day. I've never bought anything from Amazon on Prime Day, mainly because every time I look at the sale items, they seem to be a bunch of junk that I have no interest in or need for. I'd started to think of it as a typical "clearance" sale, that they were trying to make space in the warehouse (for upcoming Xmas) by ditching their leftover junk at low prices.
- Re: (Score:2)
  
  by ayesnymous ( 3665205 ) writes:
  
  Their test ended up being an advertisement for Azure.
Tried buying something all day (Score:2)

by SmaryJerry ( 2759091 ) writes:

I don't know about anyone else but I couldn't buy anything for a solid 5 hours or more, checking intermittently, because it wouldn't let me checkout. I did see a lot of Amazon dogs though.. like a lot..
- - - Re: (Score:2)
      
      by desdinova 216 ( 2000908 ) writes:
      
      I thought they were all a continuation of a proud tradition of /. trolling. going all the way back to GNAA, golden girls, Apps, cows, hot Grits, Sublaxations, Bennett Haselton, Jonathan Katz...
Didn't feel like capacity was the issue (Score:2)

by SmaryJerry ( 2759091 ) writes:

The whole site worked for me but I couldn't checkout. It didn't seem like a server capacity issue to me.
They should switch to a scalable Infrastructrure (Score:2)

by burki ( 32245 ) writes:

Like Amazon Web Services for example...
- Re: They should switch to a scalable Infrastructru (Score:2)
  
  by xxxJonBoyxxx ( 565205 ) writes:
  
  I was thinking of something like Microsoft Azure or Google Cloud platform. maybe Amazon could hire some of their Cloud Consultants to figure out how to do it.
- Re:not enough servers? (Score:5, Informative)
  
  by Fnkmaster ( 89084 ) writes: on Tuesday July 24, 2018 @05:06AM (#56999232)
  
  This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.
  Source: I work at Amazon.
  
  - Re: (Score:2)
    
    by h33t l4x0r ( 4107715 ) writes:
    
    So there's a private AWS that works even worse than public AWS? Tell me more. And please subscribe me to your newsletter.
    - - Re: (Score:3, Insightful)
        
        by Anonymous Coward writes:
        
        AWS wasn’t built for amazon.com use. It was built with excess amazon.com capacity.
        That’s a significant difference.
        
        Re: (Score:1)
        
        by Anonymous Coward writes:
        
        Except that AWS was built for internal use with the view to make it trivial to open up later. It runs on the same system.
        Nope. Pure propaganda from the early days of AWS. AWS was an external product from the beginning, with hopes that one day the retail side would be able to use it. Not so much.
        What is true is that Amazon will have dedicated servers running AWS purely for their own use, and priority over other AWS users for shared resources.
        Nope. All "reserved instances" have highest priority, and there's never really a problem with that capacity that lasted more than a couple minutes. EC2 "on-demand" instances come next, but they'll never terminate one to give it to anyone else. The only servers they'll take away from people are "Spot" instances, and they're up fro
  - Re: (Score:3)
    
    by cascadingstylesheet ( 140919 ) writes:
    
    This has nothing to do with AWS auto-scaling. The system that had issues doesn't run in public AWS. I can't say more than that unfortunately, but some random professor speculating based on leaked posts without any knowledge of the actual systems involved is a terrible source of information.
    Source: I work at Amazon.
    Why not? The private dog food tastes better?
    - Re: (Score:3)
      
      by lgw ( 121541 ) writes:
      
      Why not? The private dog food tastes better?
      No, the non-AWS stuff is pure garbage. But it would take engineering effort to move to AWS, and management would have to fund that effort instead of their own pet projects.
      - Re: (Score:2)
        
        by AuMatar ( 183847 ) writes:
        
        I doubt that. When I worked there in the mid 2000s, what amazon had internally was light years ahead (pun intended, since it was named Apollo) of anything I'd seen outside of it. I highly doubt they haven't continued to invest in that. But it makes lots of sense to keep separate pools of servers for AWS vs Amazon.com, for both security and reliability.
        
        Re: (Score:2)
        
        by lgw ( 121541 ) writes:
        
        I doubt that. When I worked there in the mid 2000s, what amazon had internally was light years ahead (pun intended, since it was named Apollo) of anything I'd seen outside of it. I highly doubt they haven't continued to invest in that.
        You should doubt more. It's all deep legacy stuff now, most of it the same systems that were innovative 15 years ago. Most of the rest of the world has moved on to implicitly auto-scaling container-based solutions, where dev teams never muck with "servers" in any way. Google (which is hardly leading edge these days) has been containerized for years. Even Azure offers self-scaling fleets with some abstraction away from explicit server types.
        And the stuff your remember was never made to work on AWS, so if
The real problem with prime day (Score:2, Funny)

by Anonymous Coward writes:

Is that I wasn't able to buy anything on prime day because I don't have any money. Epic fail. LULZ.
Thanks, Obama.
- - Re: (Score:3)
    
    by queBurro ( 1499731 ) writes:
    
    but but Hilary
- Re: (Score:2)
  
  by DontBeAMoran ( 4843879 ) writes:
  
  But if you were already planning on buying something, it was worth waiting for Prime Day to buy it.
  - Re: (Score:2)
    
    by drinkypoo ( 153816 ) writes:
    
    But if you were already planning on buying something, it was worth waiting for Prime Day to buy it.
    Only if it was discounted on prime day; the only thing I was thinking of buying from Amazon on prime day turned out to be twenty bucks cheaper from walmart. And, of course, only if you actually managed to place your order, given that Amazon failed at reliability.
    - Re: (Score:2)
      
      by DontBeAMoran ( 4843879 ) writes:
      
      Well, good for Walmart then. I can't even imagine the number of sales Amazon lost over those glitches.
- - Re: (Score:2)
    
    by EETech1 ( 1179269 ) writes:
    
    Here's all they need to know!
    https://www.youtube.com/watch?... [youtube.com]
- Re: (Score:2)
  
  by Fly Swatter ( 30498 ) writes:
  
  If you don't stress test what you have how do know it will work? They use this 'day' to fix any issues now so they don't have unexpected ones at actual important times of the year.
Classic rookie AWS provisioning mistake... (Score:2)

by supremebob ( 574732 ) writes:

One of the first things you need to do when setting up an environment in AWS is to get them to increase your (artificially low) server limits for each instance type you're planning on using. Otherwise, you're going to run into those limits at the worst possible time when you need to rapidly scale your servers.
While I understand why they do this (probably to protect themselves from having someone spin up 1,000 cryptocoin mining instances with a hacked account), it's refreshing to see Amazon get bit by their
You had to be patient (Score:2)

by DontBeAMoran ( 4843879 ) writes:

I tried for hours to order the Amazon Fire 7" (8GB) for the low price of CAD$40, but the page kept changing. Sometimes it would be available, sometimes it would be disabled and only the 16GB was available, sometimes the 8GB option completely disappeared as if it didn't even exist, other times it was available from a third-party non-Amazon seller for nearly twice the price.
It kept doing that every single time the page loaded and I was reloading it roughly once per second.
What's also weird is that once every
- - Re: (Score:2)
    
    by DontBeAMoran ( 4843879 ) writes:
    
    It's still probably the best low-cost "brand-name" tablet out there for Netflix. The Fire HD 7" specifications are at least twice as better than the crap available around here at twice that price.
Who would have thougt (Score:5, Funny)

by nospam007 ( 722110 ) * writes: on Tuesday July 24, 2018 @10:07AM (#57000224)

Amazon got slashdotted.

- Re: (Score:3)
  
  by bugs2squash ( 1132591 ) writes:
  
  Slashdotted ! Ha, well played... I have to wonder if it was all bona fide customer demand that toppled it though.
Honestly (Score:1)

by MoralCharacter ( 5221835 ) writes:

The Prime Day thing is pretty skeezy - tons of no-name brand items who's prices were inflated for the sale day so they could "slash prices" and offer you the low low discounted price of what it normally sells at - but with the bigger price it never sold out crossed out. It was entirely an exercise in preying on peoples gullibility, who saw these huge "discounts" and made impulse buys thinking this super short special shopping day was saving them money. And of course, you had to buy the prime membership in

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

The whole point of "prime day" (Score:5, Interesting)

Re:The whole point of "prime day" (Score:4, Interesting)

Re: (Score:2, Informative)

Re: (Score:2)

Joke's on you (Score:1, Insightful)

Re: (Score:2)

Re: (Score:1)

Re: The whole point of "prime day" (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:The whole point of "prime day" (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Tried buying something all day (Score:2)

Re: (Score:2)

Didn't feel like capacity was the issue (Score:2)

They should switch to a scalable Infrastructrure (Score:2)

Re: They should switch to a scalable Infrastructru (Score:2)

Re:not enough servers? (Score:5, Informative)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:1)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

The real problem with prime day (Score:2, Funny)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Classic rookie AWS provisioning mistake... (Score:2)

You had to be patient (Score:2)

Re: (Score:2)

Who would have thougt (Score:5, Funny)

Re: (Score:3)

Honestly (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals