Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
Cloud IT

AWS Introduces DNS Failover Feature for Its Notoriously Unreliable US East Region (theregister.com) 25

Amazon Web Services has rolled out a DNS resilience feature that allows customers to make domain name system changes within 60 minutes of a service disruption in its US East region, a direct response to the long history of outages at the cloud giant's most troubled infrastructure.

AWS said customers in regulated industries like banking, fintech and SaaS had asked for additional capabilities to meet business continuity and compliance requirements, specifically the ability to provision standby resources or redirect traffic during unexpected regional disruptions. The 60-minute recovery time objective still leaves a substantial window for outages to cascade, and the timing of the announcement -- less than six weeks after an October 20th DynamoDB incident and a subsequent VM problem drew criticism -- underscores how persistent US East's reliability issues have been.

AWS Introduces DNS Failover Feature for Its Notoriously Unreliable US East Region

Comments Filter:
  • I think this pretty much just underlines the fact that US-East-1 is basically their "test" version of prod...

    • by ffkom ( 3519199 ) on Friday November 28, 2025 @06:34AM (#65822717)
      us-east-1 is more like their traditionally most overcrowded region, that will experience all sorts of scaling or migration-from-legacy-to-newer issues first. To the extent that AWS themselves have recommended to customers that they should preferably deploy their stuff in other regions. Which of course also conveniently earns them $$$ when there is significant data traffic from deployments in us-east-1 to deployments in other regions.

      More generally, AWS has an incentive to not let people get the impression that deploying their stuff within one region provides sufficient availability. Because two deployments in two regions earn them more than twice what one deployment in one region puts on the customer's bill. Even if customers are clever enough (and very few are) to not have their second deployment with the same cloud vendor, they can at least charge for the data traffic.
      • by Cyberax ( 705495 )

        Which of course also conveniently earns them $$$ when there is significant data traffic from deployments in us-east-1 to deployments in other regions.

        Except for us-east-2. Traffic between us-east-1 and us-east-2 costs the same as traffic within us-east-1.

    • Doesn't help when:
      1. us-east-1 is the default region
      2. Customers blithely accept the default instead of taking 5s to consider if that is the most appropriate one based on their location. Chances are high that it isn't.

    • US-East-1 is the original AWS region, and they just keep hacking stuff onto it and never clean it up properly. The networking is a mess that no-one understands. On top of that, it's the default region when creating an instance, so a lot of people don't change it and end up with their stuff running there for no good reason. Never deploy anything to US-East-1 if you want it to be reliable.

      • by Bert64 ( 520050 )

        Some services are managed through us-east-1 especially when the service itself is global, things like route53 and cloudfront.

        • by _merlin ( 160982 )

          Yeah, there's also a feature for automatically restarting unresponsive instances that depends on US-East-1 no matter where you deploy. It basically works by regularly sending heartbeats between a service running at US-East-1 and your instance. If it doesn't see any heartbeats for a while, it thinks your instance is unresponsive and restarts it. When US-East-1 went down, every instance with this feature enabled was constantly being restarted, no matter where it was deployed. That contributed to a lot of

          • Ugh.

            Well, thanks for clearing up the question I had about people insisting that their sites outside US-East-1 being impacted during the outage.

      • I hear these 'it's the oldest' and think to both the New Jersey POTS cable plant, and the horrors of maintaining wiring that Alex Bell installed. Or the joy of Parisian telephone cable plant, legendary for being unmanageable.

        lest I forget, Jakarta and Manila vie for the worst overhead wiring in the world, sop dense and tangled it blots the sun in places. Ugh.

        US-East-1/2 is just complex. And too important to rebuild.

  • by ffkom ( 3519199 ) on Friday November 28, 2025 @06:25AM (#65822697)
    Should not "regulated industries" require more than one deployment in a single region of a single cloud vendor to begin with? I understand that "changing DNS entries" may be a convenient action to take for redirecting users elsewhere when there is an outage of a cheap single deployment, but when "high availability" is asked for, resolving service host names to more than one single address where the service can be reached (all the time, not just during an incident) seems like a necessity to me.
    • They generally use a primary and standby system, just because it's a lot harder to avoid consistency problems with multiple primaries. This means that you need to direct traffic to the current primary, and redirect it to a standby when necessary, which is fine except that the system you're switching away from and the configuration interface for your DNS provider are both in us-east-1, because everything normally is. That's why they're looking for the ability to make a different region primary specifically d

      • by ffkom ( 3519199 )
        Even when there is a primary/standby setup, still the IPs of both could be part of the DNS response at all times, and if a client connects to a server that does not consider itself as "the primary", that server could tell the client to rather use "the other". That way, an outage of the DNS service would not need to result in an outage of the service - at least as long clients or DNS proxies are willing to deliver a cached resolution response.
  • I don't understand this slow solution. I suspect that there must be a lot more to it.

    Amazon has long had a DNS service, called Route 53, that can make and propagate changes in seconds. 60 minutes for a DNS issue doesn't make sense.

    Cloudflare also has a DNS service that can propagate changes in single digit seconds.

    Google has the capability as well, but I don't know if they offer it as a service outside of Google.

    This story is missing something.

    • by brunes69 ( 86786 )

      The problem is that if you're running your primary DNS on Route53 in US-EAST and US-EAST goes down, you're currently fucked.

      The real solution isn't this - it is decoupling DNS from AWS. You should not rely on your DNS and core infrastructure from the same vendor. Makes zero sense.

      • The problem is that if you're running your primary DNS on Route53 in US-EAST and US-EAST goes down, you're currently fucked.

        But, why does this problem exist? Amazon has always had availability zones that fail over in under 120 seconds. Route 53 should be in multiple zones.

        I understand that typical consumers will choose the cheapest single zone option that they can. But, Amazon themselves should not be so budget constrained.

        The solution doesn't make sense, so I think that we are not understanding the real problem.

    • Amazon has long had a DNS service, called Route 53

      Yes... that is ... the problem... Literally the last few major AWS outages have been the result of a Route 53 fuckup.

      In other news when there's a breakdown on the highway and you reroute traffic over local roads it will be slower.

  • There is no reason your entire operation should be consolidated into one vendor's infrastructure in one region. It is just foolish to do this.

    Use a different company like Cloudflare* for DNS. Use their native tooling to be able to automatically fail-over to another AWS region when your primary region dies.

    Note that this requires cross-region replication to be set up, which is expensive so it only makes sense to do this when you are a mega enterprise.

    * Yes I realize Cloudflare also went down recently. But th

    • AmazonProvidedDNS provides certain additional functions not available under standard DNS. For instance providing private IPs to public domain names.
      • by Bert64 ( 520050 )

        Any DNS service can return any IP address for a lookup, this is not an AWS specific feature.

        • * Key AmazonProvidedDNS-only behaviors

          VPC+2 resolver address and link-local aliases: The resolver is automatically available at the base IPv4 address of the VPC CIDR plus 2 (for example, 10.0.0.2 in a 10.0.0.0/16 VPC), and also at fixed link-local IPv4/IPv6 addresses, without you having to run or configure your own DNS server in that address space.

          * Resolution of Amazon-provided private hostnames

          When enableDnsSupport and enableDnsHostnames are true, instances can resolve their own and other instanc
      • You can do that with a Cloudflare tunnel too. Not that I think it's a good idea.
  • The 60-minute recovery time is due to control plane contention. The better strategy is to use an independent WAF like Cloudflare for HA.

    Route53 Data Plane almost never fails. Just point your records permanently to the external WAF. When AWS wobbles, you handle the failover/redirect logic at the WAF layer. This bypasses the need to push DNS updates through a congested AWS control plane when the region is already on fire.

There are two ways to write error-free programs; only the third one works.

Working...