AWS Introduces DNS Failover Feature for Its Notoriously Unreliable US East Region (theregister.com) 25
Amazon Web Services has rolled out a DNS resilience feature that allows customers to make domain name system changes within 60 minutes of a service disruption in its US East region, a direct response to the long history of outages at the cloud giant's most troubled infrastructure.
AWS said customers in regulated industries like banking, fintech and SaaS had asked for additional capabilities to meet business continuity and compliance requirements, specifically the ability to provision standby resources or redirect traffic during unexpected regional disruptions. The 60-minute recovery time objective still leaves a substantial window for outages to cascade, and the timing of the announcement -- less than six weeks after an October 20th DynamoDB incident and a subsequent VM problem drew criticism -- underscores how persistent US East's reliability issues have been.
AWS said customers in regulated industries like banking, fintech and SaaS had asked for additional capabilities to meet business continuity and compliance requirements, specifically the ability to provision standby resources or redirect traffic during unexpected regional disruptions. The 60-minute recovery time objective still leaves a substantial window for outages to cascade, and the timing of the announcement -- less than six weeks after an October 20th DynamoDB incident and a subsequent VM problem drew criticism -- underscores how persistent US East's reliability issues have been.
And in what region did they test this new feature? (Score:2)
I think this pretty much just underlines the fact that US-East-1 is basically their "test" version of prod...
Re:And in what region did they test this new featu (Score:5, Informative)
More generally, AWS has an incentive to not let people get the impression that deploying their stuff within one region provides sufficient availability. Because two deployments in two regions earn them more than twice what one deployment in one region puts on the customer's bill. Even if customers are clever enough (and very few are) to not have their second deployment with the same cloud vendor, they can at least charge for the data traffic.
Re: (Score:2)
Which of course also conveniently earns them $$$ when there is significant data traffic from deployments in us-east-1 to deployments in other regions.
Except for us-east-2. Traffic between us-east-1 and us-east-2 costs the same as traffic within us-east-1.
Re: (Score:2)
Wow, I didn't know us-east-2 has free network traffic with us-east-1.
Re: (Score:3)
Re: (Score:1)
Doesn't help when:
1. us-east-1 is the default region
2. Customers blithely accept the default instead of taking 5s to consider if that is the most appropriate one based on their location. Chances are high that it isn't.
It's the oldest and most crudded up (Score:2)
US-East-1 is the original AWS region, and they just keep hacking stuff onto it and never clean it up properly. The networking is a mess that no-one understands. On top of that, it's the default region when creating an instance, so a lot of people don't change it and end up with their stuff running there for no good reason. Never deploy anything to US-East-1 if you want it to be reliable.
Re: (Score:2)
Some services are managed through us-east-1 especially when the service itself is global, things like route53 and cloudfront.
Re: (Score:2)
Yeah, there's also a feature for automatically restarting unresponsive instances that depends on US-East-1 no matter where you deploy. It basically works by regularly sending heartbeats between a service running at US-East-1 and your instance. If it doesn't see any heartbeats for a while, it thinks your instance is unresponsive and restarts it. When US-East-1 went down, every instance with this feature enabled was constantly being restarted, no matter where it was deployed. That contributed to a lot of
Re: (Score:2)
Ugh.
Well, thanks for clearing up the question I had about people insisting that their sites outside US-East-1 being impacted during the outage.
Re: (Score:2)
I hear these 'it's the oldest' and think to both the New Jersey POTS cable plant, and the horrors of maintaining wiring that Alex Bell installed. Or the joy of Parisian telephone cable plant, legendary for being unmanageable.
lest I forget, Jakarta and Manila vie for the worst overhead wiring in the world, sop dense and tangled it blots the sun in places. Ugh.
US-East-1/2 is just complex. And too important to rebuild.
Single-region deployments by regulated industries? (Score:3)
Re: Single-region deployments by regulated industr (Score:3)
They generally use a primary and standby system, just because it's a lot harder to avoid consistency problems with multiple primaries. This means that you need to direct traffic to the current primary, and redirect it to a standby when necessary, which is fine except that the system you're switching away from and the configuration interface for your DNS provider are both in us-east-1, because everything normally is. That's why they're looking for the ability to make a different region primary specifically d
Re: (Score:2)
I Don't Understand At All (Score:2)
I don't understand this slow solution. I suspect that there must be a lot more to it.
Amazon has long had a DNS service, called Route 53, that can make and propagate changes in seconds. 60 minutes for a DNS issue doesn't make sense.
Cloudflare also has a DNS service that can propagate changes in single digit seconds.
Google has the capability as well, but I don't know if they offer it as a service outside of Google.
This story is missing something.
Re: (Score:3)
The problem is that if you're running your primary DNS on Route53 in US-EAST and US-EAST goes down, you're currently fucked.
The real solution isn't this - it is decoupling DNS from AWS. You should not rely on your DNS and core infrastructure from the same vendor. Makes zero sense.
Re: (Score:2)
The problem is that if you're running your primary DNS on Route53 in US-EAST and US-EAST goes down, you're currently fucked.
But, why does this problem exist? Amazon has always had availability zones that fail over in under 120 seconds. Route 53 should be in multiple zones.
I understand that typical consumers will choose the cheapest single zone option that they can. But, Amazon themselves should not be so budget constrained.
The solution doesn't make sense, so I think that we are not understanding the real problem.
Re: I Don't Understand At All (Score:2)
Because your DNS failing over doesnt fail the lookups IN THE DNS over all your other services in another zone. You have to actually UPDATE THE DNS. Thats what this change does.
Re: (Score:3)
Amazon has long had a DNS service, called Route 53
Yes... that is ... the problem... Literally the last few major AWS outages have been the result of a Route 53 fuckup.
In other news when there's a breakdown on the highway and you reroute traffic over local roads it will be slower.
Just don't use AWS for DNS. Problem solved. (Score:2)
There is no reason your entire operation should be consolidated into one vendor's infrastructure in one region. It is just foolish to do this.
Use a different company like Cloudflare* for DNS. Use their native tooling to be able to automatically fail-over to another AWS region when your primary region dies.
Note that this requires cross-region replication to be set up, which is expensive so it only makes sense to do this when you are a mega enterprise.
* Yes I realize Cloudflare also went down recently. But th
Re: (Score:2)
Re: (Score:2)
Any DNS service can return any IP address for a lookup, this is not an AWS specific feature.
Re: (Score:2)
VPC+2 resolver address and link-local aliases: The resolver is automatically available at the base IPv4 address of the VPC CIDR plus 2 (for example, 10.0.0.2 in a 10.0.0.0/16 VPC), and also at fixed link-local IPv4/IPv6 addresses, without you having to run or configure your own DNS server in that address space.
* Resolution of Amazon-provided private hostnames
When enableDnsSupport and enableDnsHostnames are true, instances can resolve their own and other instanc
Re: (Score:2)
Use it but don't rely on the AWS Control Plane (Score:1)
The 60-minute recovery time is due to control plane contention. The better strategy is to use an independent WAF like Cloudflare for HA.
Route53 Data Plane almost never fails. Just point your records permanently to the external WAF. When AWS wobbles, you handle the failover/redirect logic at the WAF layer. This bypasses the need to push DNS updates through a congested AWS control plane when the region is already on fire.