Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
The Internet Businesses IT Technology

How Amazon Scrambled To Fix Prime Day Glitches (cnbc.com) 69

Amazon's Prime Day shopping event last week was riddled with glitches. Roughly 15 minutes into the sale, the landing page stopped working. Some users saw an error page featuring the "dogs of Amazon" and were never able to enter the site; others got caught in a loop of pages urging them to "Shop all deals." According to internal documents obtained by CNBC, it appears that Amazon failed to secure enough servers to handle the traffic surge, causing it to launch a scaled-down backup front page and temporarily kill off all international traffic. From the report: The e-commerce giant also had to add servers manually to meet the traffic demand, indicating its auto-scaling feature may have failed to work properly leading up to the crash, according to external experts who reviewed the documents. "Currently out of capacity for scaling," one of the updates said about the status of Amazon's servers, roughly an hour after Prime Day's launch. "Looking at scavenging hardware." A breakdown in an internal system called Sable, which Amazon uses to provide computation and storage services to its retail and digital businesses, caused a series of glitches across other services that depend on it, including Prime, authentication and video playback, the documents show.

Amazon chose not to shut off its site. Instead, it manually added servers so it could improve the site performance gradually, according to the documents. One person wrote in a status update that he was adding 50 to 150 "hosts," or virtual servers, because of the extra traffic. Caesar says the root cause of the problem may have to do with a failure in Amazon's auto-scaling feature, which automatically detects traffic fluctuations and adjusts server capacity accordingly. The fact that Amazon cut off international traffic first, rather than increase the number of servers immediately, and added server power manually instead of automatically, is an indication of a breakdown in auto-scaling, a critical component when dealing with unexpected traffic spikes, he said.

This discussion has been archived. No new comments can be posted.

How Amazon Scrambled To Fix Prime Day Glitches

Comments Filter:
  • by darkain ( 749283 ) on Tuesday July 24, 2018 @02:07AM (#56998942) Homepage

    The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure. They can test using simulated connections, but that only goes so far. They need to be able to test AWS with massive demand on unpredictable pages, and have the system scale appropriately. What better way to do this than to shove a few "sales" at a bunch of products, and then contact literally every media outlet in the country to promote it. Seriously, name a local news channel NOT hyping the prime day event. This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.

    • by Anonymous Coward on Tuesday July 24, 2018 @02:35AM (#56998994)

      That sounds good and that's probably what used to happen.

      Last year (when I was still employed by them) Prime day was a VERY big deal; they count on that toward the value of their stocks.

      Unfortunately, Amazon is a very reactionary company - they look at the job that is in front of them (RIGHT in front of them) and focus on that to the exclusion of all else. It's likely that multiple departments knew the failures were going to happen but they were unable to get upper management to take them seriously (because working on something that is supposed to already work slows us down from working on what does not yet work, you see). I bet management is taking them seriously now.

      I guarantee some poor underpaid schmuck will lose his job over this instead of the management staff that's responsible for an overwhelming workload.

      • Re: (Score:2, Informative)

        by Anonymous Coward

        That is rarely the case.

        I have never seen Amazon or specifically Ec2 to be an organization of blame. The reality is they ask people to go mind numbingly fast a lot of the time and shit happens. There are a number of things people do to try to curtail failure, but it's pretty much by the pants most days. This is the reason there is a reflection on failure and what could have been done better. Consistently fail as an organization and higher ups start asking what you are doing as an org to solve your problems.

        • I would suggest, that if you're adverse to failure to the point of never really trying something difficult, you're already failing.

          There are a lot of things that teach us, but failure is one of the greatest teachers of all.

          Or as my dad used to say (probably stolen from elsewhere), "If you aren't failing, you're not trying hard enough"

    • Joke's on you (Score:1, Insightful)

      by Anonymous Coward
      Joke's on you, because most of the Amazon retail doesn't actually run on AWS. It uses its own deployment system, server management, data storage, etc.
    • ...This is simply Amazon creating quite possibly the worlds largest single day beta test of new infrastructure code, and done annually. The big difference this year is that something didn't work right, so engineers were right on the spot to scale things up manually by hand.

      "Currently out of capacity for scaling,"

      And I'm still scratching my head trying to figure out what the real problem was, because running out of hardware is kind of what the comment above implies, not that there was some glitch to fix in "new infrastructure code".

      And scale things up by hand? On Prime Day? That's kind of like telling a NASCAR pit crew they're gonna have to change tires on the car while it's still racing around the track.

      • by Anonymous Coward
        People were not provisioning servers manually. They were triggering provisioning actions manually instead of relying on autoscaling. In particular, autoscaling actions are throttled to avoid rapid changes that were actually needed there.

        I recognize the quotes in the article - they are from the trouble for the event (all Amazon was watching it, so no wonder it leaked). So let me just say, there was no one single cause of failure.
      • Welcome to IT, a work colleague once said "we can do the impossible, but miracles take a bit longer".
    • However being Amazon has years of data and tending information, they should have been able to predict how much demand was going to happen, and made sure the infrastructure is prepped, and ready for the event.

      It is kinda a high stake game to play for load testing.

      • by Wolfrider ( 856 )

        --This does not appear to be a failure of Prime Day's IT team - it's a Failure of Manglement. Stress the people under you too hard and don't give them what they need, and this is the kind of chit that happens.

        --Amazon has been in business for a fairly long time - long enough for upper manglement to become stultified. I doubt they'll learn much from this debacle (but I'd be glad to be proven wrong.) They'll probably continue to treat their warehouse workers like chattel tho.

    • by Anonymous Coward on Tuesday July 24, 2018 @08:10AM (#56999932)

      Recent Amazon employee here, posting AC for obvious reasons,

      The entire point of "prime day", which actually started many years ago with a massive sale selling XBox consoles for $100/ea, is to test out their infrastructure.

      No, it's a straightforward copy of Singles Day. Singles Day is the biggest shopping day in the world, so Amazon figured they could invent their own shopping holiday and people would go for it.

      They need to be able to test AWS with massive demand on unpredictable pages

      Almost none of Amazon retail runs on AWS. There are islands here and there, but for the most part they're still unrelated after all these years.

      The big difference this year is that something didn't work right

      Yeah, here's what idn't work right: in a cost-cutting effort, upper management imposed a huge paperwork burden in order to scale up your fleet for prime day. Some teams clearly decided to take risks with a smaller fleet instead of jumping through flaming hoops to justify the exact number of servers they'd need to scale up to.

      How'd that work out for you, Amazon?

    • Comment removed based on user account deletion
    • by imidan ( 559239 )

      The entire point of "prime day", ... is to test out their infrastructure.

      Interesting, and that makes sense. I was thinking about it the other day. I've never bought anything from Amazon on Prime Day, mainly because every time I look at the sale items, they seem to be a bunch of junk that I have no interest in or need for. I'd started to think of it as a typical "clearance" sale, that they were trying to make space in the warehouse (for upcoming Xmas) by ditching their leftover junk at low prices.

    • Their test ended up being an advertisement for Azure.
  • I don't know about anyone else but I couldn't buy anything for a solid 5 hours or more, checking intermittently, because it wouldn't let me checkout. I did see a lot of Amazon dogs though.. like a lot..
  • The whole site worked for me but I couldn't checkout. It didn't seem like a server capacity issue to me.
  • Like Amazon Web Services for example...

  • by Anonymous Coward

    Is that I wasn't able to buy anything on prime day because I don't have any money. Epic fail. LULZ.

    Thanks, Obama.

  • One of the first things you need to do when setting up an environment in AWS is to get them to increase your (artificially low) server limits for each instance type you're planning on using. Otherwise, you're going to run into those limits at the worst possible time when you need to rapidly scale your servers.

    While I understand why they do this (probably to protect themselves from having someone spin up 1,000 cryptocoin mining instances with a hacked account), it's refreshing to see Amazon get bit by their

  • I tried for hours to order the Amazon Fire 7" (8GB) for the low price of CAD$40, but the page kept changing. Sometimes it would be available, sometimes it would be disabled and only the 16GB was available, sometimes the 8GB option completely disappeared as if it didn't even exist, other times it was available from a third-party non-Amazon seller for nearly twice the price.

    It kept doing that every single time the page loaded and I was reloading it roughly once per second.

    What's also weird is that once every

  • by nospam007 ( 722110 ) * on Tuesday July 24, 2018 @09:07AM (#57000224)

    Amazon got slashdotted.

    • Slashdotted ! Ha, well played... I have to wonder if it was all bona fide customer demand that toppled it though.
  • The Prime Day thing is pretty skeezy - tons of no-name brand items who's prices were inflated for the sale day so they could "slash prices" and offer you the low low discounted price of what it normally sells at - but with the bigger price it never sold out crossed out. It was entirely an exercise in preying on peoples gullibility, who saw these huge "discounts" and made impulse buys thinking this super short special shopping day was saving them money. And of course, you had to buy the prime membership in

Keep up the good work! But please don't ask me to help.

Working...