AWS Outage Has Taken Down a Big Chunk of the Internet (theverge.com) 90
Amazon Web Services (AWS), Amazon's internet infrastructure service that is the backbone of many websites and apps, is experiencing a major outage affecting a large portion of the internet. From a report: "Kinesis has been experiencing increased error rates this morning in our US-East-1 Region that's impacted some other AWS services," Amazon said in a statement to The Verge. "We are working toward resolution." And, ironically, in a notice on the AWS Service Health Dashboard, Amazon said the issue has apparently "affected our ability to post updates" to that dashboard. "We continue to work towards recovery of the issue affecting the Kinesis Data Streams API in the US-EAST-1 Region," Amazon said in a 1:47PM ET update posted to the dashboard. "For Kinesis Data Streams, the issue is affecting the subsystem that is responsible for handling incoming requests. The team has identified the root cause and is working on resolving the issue affecting this subsystem."
Well now what? (Score:5, Insightful)
Re:Well now what? (Score:5, Insightful)
In future news: Microsoft Windows 10 has stopped working for millions of people because they couldn't access the subscription server.
That would be Azure, not AWS.
Nevertheless. This will be a wake-up call for all those hipster executives that think "the cloud" is cheaper than owning datacenters.
It might be, until they find out that they don't get to make decisions on what gets fixed first, since they don't own the infrastructure or manage the engineers fixing the network.
Re:Well now what? (Score:5, Insightful)
They're paying more for the engineers. It's just hidden in the hosting budget, so they can close their eyes and keep telling themselves it's not real.
Re: (Score:2, Insightful)
You still need engineers. It's not like AWS instances run themselves. All you're saving is managing server hardware and networking gear (which is honestly kind of nice to offload).
And managers are strangely more willing to deal with monthly expenses than capex.
Not if but when (Score:2)
It's not if it happens but when it happens that the system you company spent $5,000,000 building goes offline for months due to an 'unexpected' feature change, regulation change, OS patch, hardware failure when you host it in the cloud.
The cowboy coding of using every possible cloud technology you can get at AWS or Azure is responsible for lots of the upcoming mess.
But the In-Fligh magazine (Score:2)
says that it is cool.
And that is what managers read.
Re: (Score:3)
You still need some networking gear to keep the office on the net, particularly the people who are running the cloud instances.
Re: (Score:3)
You're also getting security staff far and away better than you're likely to be able to afford otherwise. Just the security department for AWS is larger than the entire IT department of most large corporations, and they're very, very good.
**Full Disclosure**
I work at Amazon and formerly AWS, but in physical security. Nothing to do with this (fortunately).
Re: (Score:3)
They DO have to pay people to admin their cloud services. They also pay for the engineers in the form of fees for the cloud service. They COULD just make sure one or two people are able to handle a physical server. When you add up the cloud fees, downtime costs, and the people you have to keep on hand anyway the cloud might not be as attractive is the people drinking the cool aid claim.
Re:Well now what? (Score:4, Insightful)
Risk not cost.
Pick any two: cheap, fast, correct.
Re: (Score:3)
This will be a wake-up call for all those hipster executives that think "the cloud" is cheaper than owning datacenters.
"The Cloud" is not only cheaper, but the total downtime is going to be a lot less than a bespoke datacenter run by people they hired off of Craigslist.
Before you say "hire better people", understand that a typical CEO doesn't have the expertise to judge who is "better" at IT. Amazon does have that expertise.
For most companies "owning datacenters" is neither cheap nor reliable.
Re: (Score:1)
Re: (Score:2)
The servers I run have far less downtime than the cloud.
Re: (Score:3, Insightful)
Cloud is neither better or cheaper. It is at first glance more convenient. And that is what "engineers" nowadays value most.
Give it a bit of time and you'll get IT specialists that can only do one thing. By now, you (as a company) need to hire much more services externally, as you can't rely anymore on any in-house skill-set. 3rd party service providers will have you (as a company) over a barrel. And you'll pay through the nose for the privilege.
By now you (as a company) are so dependent on 3rd party servic
Re: (Score:2)
As if a PHB yelling at people ever got stuff fixed faster.
Re: (Score:3)
It might be, until they find out that they don't get to make decisions on what gets fixed first, since they don't own the infrastructure or manage the engineers fixing the network.
They don't have to make those decisions, because outages at AWS are indiscriminate; the same outage can affect both a Fortune 500 company and a 10-person startup. Because AWS has so many customers across a broad range of pricing tiers, it behooves them to fix problems right away. They don't have the luxury of saying "this is only affecting our bottom tier of customers so we can goof off all we want."
Re:Well now what? (Score:4, Interesting)
Or you could implement "cloud" in a way that the cloud providers themselves recommend: multi-region hosting.
All of today's problems are in us-east-1 (Virginia) region. If you are in us-west-2 or us-east-2 you are largely unaffected.
Combine the following:
1. cross-region database backups
2. cross-region S3 bucket replication
3. privileged info (passwords, API keys, etc.) in a secrets management system (see: Vault) with it's database backed up or mirrored cross-region
4. infrastructure as code, checked into git, mirrored cross-region.
Now let's say that us-east-1 gets hit by a meteor. Your data is already sitting across the continent in Oregon either up and running in the case of the S3 buckets, or restorable from your nightly database snapshots, and you can run 'terraform apply' during that database restore pointing at us-west-2, and all your stuff comes back up - just change your DNS to point at the new load balancer if your Terraform doesn't do that for you.
Major outage just turned into a disaster recovery exercise and you're back up in an hour or two. And you look like a god damn genius while the rest of the Internet whines about not cloud providers when it's really their own fault for not planning to have a truly resilient architecture for literally no more money.
Re: (Score:2)
i was agreeing with everything you said until you mentioned "truly resilient architecture for literally no more money"
this is categorically false. if you have your resources in multiple az's/regions, you are going to pay for the additional resources. and this is not even involving peering costs if needed.
yes you can do it. but it takes a lot of work and it's expensive
Re: (Score:2)
Except you missed the point of my post - you don't have stuff up in those other regions other than database snapshots on either RDS or S3 until an event like this happens. Then you apply your Terraform with a different AWS provider specifying the different region.
I suppose technically you would spend a few dollars on the insanely cheap storage that is S3, but the difference is so negligible as to not matter. Obviously, the more of your backup environment you leave on hot standby, the more it will cost - b
Re: (Score:2)
Nevertheless. This will be a wake-up call for all those hipster executives that think "the cloud" is cheaper than owning datacenters.
Depends. If the same problem hit ones own datacenter, how many companies' support team can fix the problem faster than AWS?
I would bet most would do worse.
Re: (Score:2)
Nevertheless. This will be a wake-up call for all those hipster executives that think "the cloud" is cheaper than owning datacenters. It might be, until they find out that they don't get to make decisions on what gets fixed first, since they don't own the infrastructure or manage the engineers fixing the network.
Sure, they can decide what gets fixed first much more often if they own the data center.
Re: Well now what? (Score:2)
How often is the cloud down, and for how long? Now compare that to your typical self-host in a data center. If the uptime is better, does it matter where the servers are located?
Re: (Score:2)
Yeah, but they can blame any failure on someone else, which is the main selling point of the cloud.
Re: (Score:2)
Nevertheless. This will be a wake-up call for all those hipster executives that think "the cloud" is cheaper than owning datacenters.
Why would it be a wakeup call? You are assuming that this cloud outage is more severe than that caused by the infrastructure managed by said executives.
People love pointing to Cloud outages while continuously ignoring that the uptime despite those outages is typically far better than the uptime of company's own IT systems.
Re: (Score:2)
SolidWorks does that to me all the time the bastards.
An obvious solution is to stop getting your software from bastards.
I don't have that problem with FreeCAD. No cost = No subscription server.
Plus, it uses Python as a scripting language, rather than VBA for Solidworks.
Re: (Score:3)
For the moment, your only realistic choices for solid modeling software is to get it from bastards.
1. Calculate how much you will pay for Solidworks over the next 10 years
2. Make a list of the features you rely on in Solidworks that are missing from FreeCAD.
3. Contact the FreeCAD developers and offer them the money from #1 to implement the features in #2.
Re: (Score:2)
But what if the cost of #1 is less than the cost of #3 ? It likely is for any individual company - development costs a lot, especially if you want it of high enough quality for daily use.
Re: (Score:2)
Plus, you don't have to worry about losing the ability to use your own hard work because it's locked up in a proprietary format and they decided to alter the deal further.
Re: (Score:2)
FreeCAD is no substitute for SolidWorks. Sorry. Not even close. I use FreeCAD a fair bit and find it to be fairly powerful for my simple needs, if extremely quirky and buggy. But I have no illusions that it can do more than a fraction of what people do with SolidWorks.
Re: (Score:2)
For power-users, what you say is likely true.
But many people who pay for Solidworks, perhaps the majority, actually have simple needs that can be met by FreeCAD.
FreeCAD works for me. Since I am a software nerd, I do much of my design work by writing scripts that are then executed to generate the design. If I need to change the design, I can often do so by changing one constant and re-running the script. A non-nerd may spend hours making the same change by hand.
For script-driven development, FreeCAD is sup
Re: (Score:1)
Oracle cloud was down this am also (Score:2)
I do not think that OCI has dependencies on AWS... does it?
Ellison and Bezos crossed their streams? (Score:2)
Oh, the irony, if Oracle has a dependency on AWS. Who would have thought that Ellison and Bezos would ever cross their streams?
Re: Oracle cloud was down this am also (Score:1)
All My Favorite Sites Are Still Working (Score:2)
Sounds like this is a little overblown to me. Also looks like my favorite sites aren't stupid enough to use AWS :)
Re: (Score:3)
Well architected sites won't be affected by failures in a single availability zone.
Re: (Score:2)
Well architected sites won't be affected by failures in a single availability zone.
Very true. Although in this instance, it looks like some AWS services like Kinesis are impacted more broadly. Heck, Amazon is even having trouble updating the outage page because they use Kinesis to push the status updates to it. So you can do everything right but if the cloud provider PaaS services go bits up due to a site outage, you are still hosed. Seems like Amazon needs to look at their own services architecture.
Re: (Score:2)
Well architected sites won't be affected by failures in a single availability zone.
I guess that we can now conclude that Amazon Prime music streaming website is not well-architected. (But I always kind of suspected that anyway, given its general level of wonkiness.)
Re: (Score:2)
Well architected sites won't be affected by failures in a single availability zone.
I guess that we can now conclude that Amazon Prime music streaming website is not well-architected. (But I always kind of suspected that anyway, given its general level of wonkiness.)
While Netflix was running fine (acceptably?) on AWS.
My understanding is that the first step to dealing with a major outage is to stop running Chaos Monkey on production.
Re: (Score:2)
Good, or bad, news then: this wasn't a problem with an "availability zone," but rather with 3rd party services.
So while what you said may be true, generally, in the context it is an incorrect diagnosis, and hilariously arrogant considering.
Well Architected SItes (Score:1)
Well architected sites won't be affected by failures in a single availability zone.
Such well architected sites, that span availability zones, usually fail to see the light of day due to budget constraints.
But, yea.
Re: (Score:2)
I was definitely getting errors on the Fidelity site this morning for the first two hours after the market opened. I could still buy and sell stocks and access all my account data, so the Fidelity servers were fine, but the stock research and analysis screens were all toast, and would vary between errors and no data.
Comment removed (Score:3, Informative)
Re: (Score:1)
Re: (Score:2)
Wonder if this is why Amazon Music is acting up. Told Alexa to shuffle "My Music" (there are around 1300 songs there). It cycles through the same six each time and says "That's all."
Re: (Score:2)
WTF are you on about? The Communist Manifesto is an interesting piece of philosophy, and Marx's economic observations were considerably closer to being on-target than Smith's, but I'm no more a 'Marxist' than I am a Democrat.
BTW idiot, I work at Amazon. I'm part of the team here maintaining the largest security system on the planet.
Newbie troll is lame, even Free Republic had better trolls.
Re: (Score:2)
The troll-fu is weak in this one.
SlashDot has a long and glorious history of some of the best trolls and flamers on the Internet, but you're not even qualified to to kiss their royal rumps.
Bwhahaha (Score:5, Insightful)
:-D fuckers! "Cloud" means, "someone else's computer", not, "magical faerie dust that never fails"
Re:Bwhahaha (Score:5, Insightful)
:-D fuckers! "Cloud" means, "someone else's computer", not, "magical faerie dust that never fails"
Cloud also means "someone else has to cancel their thanksgiving plans to fix it, not me."
Re: (Score:3)
Or... They just tell you that they're looking into it, but because of [insert excuse here] they are currently anticipating a 72-96 hour disruption window, during which all of your systems may continue not to work properly, and there is absolutely nothing you can do about it. Happy thanksgiving. :-)
Re: (Score:2)
Not at AWS, which is the cash cow that forced Bezos to pay dividends (they couldn't spend the money fast enough). This is current a SEV1 issue, Jeff would have personally gotten paged when the ticket was generated and I'll guarantee that he's watching the progress and will be interested in the Cause Of Event analysis.
**Full Disclosure**
I work at Amazon, but fortunately nothing to do with this.
Re: (Score:2)
Not in my experience. When your product/service that relies on AWS on the back end goes down, you're getting that call regardless of whether it was your crappy code or AWS being flaky. At least if it's your code, you can actually do something about it. Try telling the C-levels that there's nothing you can do because it's AWS' fault.
Re: Bwhahaha (Score:2)
Re: (Score:2)
> Cloud also means "someone else has to cancel their thanksgiving plans to fix it, not me."
And the problem is...? If that someone else earns a decent salary cancelling Thanksgiving plans, more power to them.
(Usually those without plans would take these rotations and likely they would earn double/triple on-call hourly pay if something needs fixing)
Better than you (Score:3)
The AWS staff in general is much better than the average IT person, so AWS is down less than most infrastructure is down.
Re: (Score:2)
No one said never fails. All these cloud providers have uptime guarantees and these outages don't breach those. Now the question is, can *you* do better.
Re: (Score:2)
According to my uptime numbers on my own services, yes. OpenBSD is fantastically rock solid.
Work services? They use AWS. I don't pay the bills, so that's Above My Pay Grade.
Re: (Score:2)
OpenBSD is fantastically rock solid
OpenBSD is not the reason a service goes down and the fact that you restricted your reply to just the OS is damning.
Re: (Score:2)
No True Reason for Downtime argument. Try again.
There's probable chance it's Iran, China, Russia.. (Score:3)
...North Korea or Venezuela...
From and "Internally leaked memo..."
"We use these entities to cover up our own incompetencies from whenever we see fit..."
Re: (Score:1, Troll)
North Korea? The country with a total of a couple hundred mostly-antique servers nationwide, which is 100% dependent on The Great Firewall Of China for connectivity to the Internet, which doesn't have the ability to attract a single competent pen-test instructor for its handful of programming students? Good grief. Isn't the "North Korean super-hacker" trope getting a bit old now?
Re: (Score:2)
Wow, so how long did you spend in North Korea, anyway?
That's way better information than is available in public. Heck, your story even contained details about their staffing! The rest of the world can only dream about that level of access to whatever their system is.
Don't you worry your local contacts with be punished because you just engaged in industrial espionage by telling us those details?
Re: (Score:2)
Now what the frack are you babbling about? They buy second-hand servers from China, and their only connection to the greater world is through two fiber lines through China. Until around 2012 their only Internet connectivity was a single T-3 line to Taiwan, (reportedly frequently saturated by Kim's porn habit). That's all public knowledge, has been for years.
AHA (Score:3)
Re:AHA (Score:4, Funny)
I wish some of my coworkers had run into that same issue this morning...
Giphy is a blight on humanity. Why any supposedly "professional" tool includes it is beyond my comprehension (seriously - Slack, Teams, wtf?).
Re: (Score:3)
Re: AHA (Score:2)
You know how they say a picture says a thousand words. Instead of using so many words, how about you do some math. How many words can a gif with 10 frames say?
Re: (Score:2)
How many words can a gif with 10 frames say?
Since the words the gif are "saying" are invariably displayed in large block letters on the gif itself... I'd say between 1 and 4.
Re: (Score:2)
A picture of something specific can communicate a lot about that specific thing.
I could write an essay about the horrors of the Vietnam War, but the picture of the naked, burned girl running away from her napalmed village can communicate quite a bit about it as well, and quite viscerally.
Nevertheless, one should be able to communicate words and feelings like 'I agree' without needing to post a picture of Kermit the Frog flailing around.
us-east-1 - the worst! (Score:2)
us-east-1 is the region that always fails. Avoid it if possible!
Re: (Score:1)
Re:us-east-1 - the worst! (Score:5, Funny)
Citation needed.
Here you go. [slashdot.org]
Re: (Score:2)
Re: (Score:1)
what a pain (Score:5, Funny)
That explains the error message Alexa was giving. Had to manually operate my alexa smart oven because the cloud was down - first time since I got it. I had to actually learn how to use it if I wanted lunch.
Re: (Score:2)
That is the funniest thing I've read all day.
Re: (Score:1)
That explains the error message Alexa was giving. Had to manually operate my alexa smart oven because the cloud was down - first time since I got it. I had to actually learn how to use it if I wanted lunch.
Now that she's back she's mad at you? You didn't clean the oven, made a mess just like a man would?
IM SHOCKED (Score:2)
Our AWS service was available the entire time. (Score:1)
Internet is Fine. (Score:3)
It's like saying a bunch of highways are closed when it is really just a few malls and restaurants closed.
-Who still puts all their eggs in one basket?
HEY! YOU! Don't be stupid! (Score:1)
ANYTHING in the cloud can be considered vulnerable. ANYTHING.
I see a time when computing migrates back to decentralized, secure servers behind thick firewalls accessed via super-encrypted VPN.
Why would anyone trust any private thing, like Ring doorbells, or personal Facebook data, or Oracle databases,
or your home security camera system to ANY cloud based service?
Besides vulnerability to ever-improving hackers, there is