Cloudflare's 23-Minute Outage Today Also Took Down Major Web Sites (techcrunch.com) 57
"Many major websites and services were unreachable for a period Friday afternoon due to issues at Cloudflare's 1.1.1.1 DNS service," reports TechCrunch:
The outage seems to have started at about 2:15 Pacific time and lasted for about 25 minutes before connections began to be restored. Google DNS may also have been affected.
Cloudflare at 2:46 says "the issue has been identified and a fix is being implemented." CEO Matthew Prince explains that it all came down to a bad router in Atlanta... The company also issued a statement via email emphasizing that this was not an attack on the system...
Discord, Feedly, Politico, Shopify and League of Legends were all affected, giving an idea of the breadth of the issue. Not only were websites down but also some status pages meant to provide warnings and track outages. In at least one case, even the status page for the status page was down.
Other sites that went down briefly include Discord, Patreon, Deliveroo, GitLab, Zendesk, and Medium, according to the Verge. And FBI.gov was also down, reports PC Magazine, as well as their own site, Mashable.com, "and even Downdetector.com, a real-time outage monitoring website."
They note that Cloudflare serves 26 million sites, and Cloudflare's CEO acknowledged tonight that the outage affected about 50% of their private backbone.
Cloudflare at 2:46 says "the issue has been identified and a fix is being implemented." CEO Matthew Prince explains that it all came down to a bad router in Atlanta... The company also issued a statement via email emphasizing that this was not an attack on the system...
Discord, Feedly, Politico, Shopify and League of Legends were all affected, giving an idea of the breadth of the issue. Not only were websites down but also some status pages meant to provide warnings and track outages. In at least one case, even the status page for the status page was down.
Other sites that went down briefly include Discord, Patreon, Deliveroo, GitLab, Zendesk, and Medium, according to the Verge. And FBI.gov was also down, reports PC Magazine, as well as their own site, Mashable.com, "and even Downdetector.com, a real-time outage monitoring website."
They note that Cloudflare serves 26 million sites, and Cloudflare's CEO acknowledged tonight that the outage affected about 50% of their private backbone.
No redundancy? (Score:3)
C'mon, they're not Slashdot. They can afford redundancy in hardware, routes, ports, cables, upstream links, and so on.
Ridiculous.
Re: (Score:1)
Not an accident, frailty by design, a very convenient "off" switch
Re: (Score:1)
They can afford redundancy in hardware, routes, ports, cables, upstream links, and so on.
Of course they do. They redirected traffic off a backbone link for maintenance, on to two other links, one of which they fucked up the routers configuration file on.
Three way redundancy, capable of degrading to two way redundancy for planned maintenance.
That's why only 50% if that links traffic failed and not 100%.
Had the tech not screwed the configuration, they would have disabled an entire backbone link, still had dual redundancy on it, and you wouldn't even have noticed.
Re: (Score:2)
Even that shouldn't have been able to happen, and I'm sure they are working very hard right now to fix the underlying issues. In a piece of infrastructure as critical as CloudFlare (which hosts about 20% of the entire web IIRC) no single tech should have been able to put a configuration file live, even as a temporary patch during maintenance, without thorough inspection and testing.
Re: No redundancy? (Score:2)
Re: (Score:2)
I see this happens with many high tech companies. For an example with EarthLink, they were out due to water issues a few weeks ago that lasted a couple days!
Cloudflare is bad (Score:1)
I'm not surprised so many sites put their eggs in one basket, but it says something about responsibility when "legit" sites decide to use an unreliable service that also protects the *chan-clones and the piracy and illegal (as in actual crimes like stolen accounts) sites.
The best thing that should happen is for Cloudflare to fail and commercial services stop using it.
I believe in coincidences (Score:2)
And, odds are, the timing here (Cloudflare, the Twitter hack, whatever took down Slashdot) is just coincidental. Especially since the Cloudflare guy went out of his way to mention it was not the result of an attack (nor was it a hardware issue, despite using the term "bad router").
I'm just hoping there aren't more coincidences in early November.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
"bad router"
Sounds like another J.J. Abrams production company [wikipedia.org].
Re: (Score:2)
Several sites and services have been having disruption since Wednesday, including some work sites, an MMO I play, etc. Maybe coincidence. But it's been a frustrating few days (and for some reason, work conference calls stayed up, dammit).
Re: (Score:2)
Panicked "the network is broken" claims are a frequent escalation from the call center, from developers, and from junior admins. Teaching such personnel how to verify that it _is_ the network, or not, is often part of my mentorship work, because very often it is _not_ the network. And when it is the network, proving it to entrenched network admins who don't appreciate input from outside their department can be very difficult.
DNS update (Score:3)
Re: (Score:2)
What does that add to resiliency, beyone address resolution during an event? The last known good address could be stale due to load balancing or DDOS prevention.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
It typically nothing but a checkmark on a high availability checklist. The tertiary is still pulling from upstream somewhere, and the data expires, much as a squid server is not a substitute for a website with active content.
Re: (Score:2, Troll)
The Internet was engineered to route around errors.
The World Wide Web has managed to "fix" that "bug" by putting critical infrastructure in the hands of a select few third parties.
Re: DNS update (Score:2)
Re: (Score:2)
It really doesn't. Every major OS as well as the DHCP protocol itself is setup with the ability to fall back to other DNS servers. This applies to both the OS level and to caching DNS servers looking upstream or to the root DNS servers directly.
I noticed the Cloudflare outage on my desktop PC* which presented like an internet outage, but curiously my laptop was fine. The difference was the laptop had a secondary DNS server configured and happily fell back to using Google while on my desktop I left that fiel
Re: (Score:2)
You can configure unbound to sort of do this. You can tell it to serve expired records (with a 0 TTL). Then it goes and fetches the current record in the background so that the next request will have a fresh record. I'm not sure though how it would handle being unable to refresh the record, and if it would keep serving 0 TTL records. Cloudflare DNS actually does this, and serves 0 TTL records and refreshes it for the next query.
Concept of Internet...DE-centralization (Score:1)
Something to expect more of....and it is what you deserve when you go against the grain/principle/grind of the internet itself.
Capitalism...after a while money/power becomes centralized.
Such two opposing concepts should never been allowed to meet in hindsight.
Been on the internet since 1995. It was not much to look at back then, still it was much more fun than it is now.
Hmm...a yard? It is mine?
DNS? (Score:4, Interesting)
Re: (Score:2)
This is precisely what my laptop did. On my desktop I for some reason didn't have a secondary configured. I figured this out when I shouted "internet's down" through the house only to get a reply of "it's working fine here" from the living room.
Re: (Score:2)
Re: (Score:2)
For many people it was. The actual service issues were minor on the network on the whole, but 1.1.1.1 was in fact unreachable meaning if you didn't have a fallback configured (like I didn't) then the internet looked down, and if you did have a fallback configured (every other computer in the house did, don't know why I didn't set it up correctly here) there's a good chance that unless you lived in a very specific location you wouldn't have noticed the problem.
Really incorrect article summary (Score:4, Insightful)
As usual, this outage was entirely caused by humans and unforeseen combinations of conditions to make things go off the rails.
The router in Atlanta was no “bad”. google DNS was not also affected.
They even have a blog post laying out exactly when went wrong. They have already addressed some of the underlaying network design assumptions and issues that indirectly led to this outage.
https://blog.cloudflare.com/cl... [cloudflare.com]
Re: (Score:1)
Thanks for the link. It provided the sort of report that I could appreciate from someone close enough (and timely enough!) to the issue to be believable.
That everything mentioned here on /. had such a link.
That's the internet in 2020 (Score:2)
ajax.googleapis.com goes down and half ot the internet is broken
cdnjs.loudflare.com goes down and half of the internet is broken
1.1.1.1 or 8.8.8.8 goes down and half of the internet is broken
The internet was designed to be resiliant and resist part of it going dark. What we ended up instead is a global network at the mercy of the good will and competence of a handful of operators.
Re: (Score:1)
> Not only were websites down but also some status pages meant to provide warnings and track outages. In at least one case, even the status page for the status page was down.
Who has a status page using the same infrastructure? Who then goes one further and does the same for the status page for the status page?
If only there'd been a status page for the status page for the status page...
Re: (Score:2)
Except that's not true at all. The internet was never designed to route around a fault automatically. The internet was designed to allow operators to route around faults automatically. There's a key difference, and that is exemplified here: Cloudflare had a fault, and routed around the problem in minutes.
And no "half the internet" wasn't broken. In fact people who had proper DNS configuration (not myself, I didn't configure a secondary DNS on one of my machines it seems) didn't even notice that Cloudflare h
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
> The internet was never designed to route around a fault automatical
I can attest that "The Internet" was designed to handle many classes of failure quite automatically. It's what DNS, TCP, routers and load balancers were designed to do, for example. It is not "foolproof" because fools can make quite dramatic errors like this one.
Re: (Score:2)
They aren't "the internet", they are redundancy of singular components. They all rely on pre-configured fall-backs of individual equipment. It does not in any way take into account a misunderstanding of a configuration. The internet won't route around this, just like it won't route around a faulty BGP announcement or won't route around an unavailable ASN route.
People route around those. Always have. All "the internet" does, is try to prevent those points from failing though reliable and redundant equipment.
Re: (Score:1)
Never Rely On One Company (Score:2)
Should have used Akamai (Score:2)
Comment removed (Score:3)
Re: (Score:2)
CDNs sitting in front of websites are location dependent. The IP address 1.1.1.1 is not. Most of the world would not have noticed the Cloudflare outage unless they had DNS configured to use 1.1.1.1 and had no fallback configured. If you don't use Cloudflare for DNS or have a backup, then you would only have been affected by this bug if you were served content by the specific system which was overloaded.
That is why 1.1.1.1 is significant, it is the only component of this outage which was location independent
Re: (Score:2)
Re: (Score:2)
This is completely untrue. Most people use their ISP's DNS server, and it was most certainly noticed.
Please re-read the sentence. You were replying to. Pay attention to the subject of the sentence; specifically the subject being people who were using cloudflare DNS.
It is extremely uncommon to use 1.1.1.1 for your DNS. Whereas about 40% of the websites you're likely to visit are served by Cloudflare.
I'm not sure what you think you're replying to but somehow you fundamentally missed the point I was making: so let me restate it clearly:
Cloudflare sits in front of websites but the bug was location dependent, so only a small portion of the world would have noticed those 40% of the websites have a problem.
The thing that makes 1.1.1.1 unique and
Welcome to the internet. (Score:3)
Re: (Score:2)
Re: (Score:2)
No sites got taken out. It appeared as though sites were unreachable from a specific location, if you used their DNS, unfortunately one of those was the DNS resolver itself. If you had a fallback configured then that kept working.
Re: (Score:2)
Re: (Score:2)
I still disagree with the sentiment. It's unstable but managed under watchful eyes. Hardware fails all the time yet large patches of national internet service rarely ever goes offline. Equipment is generally well managed, and redundant to the point of incredible reliability and stability. Every so often a human error tends to take something down, and this kind of human error is not something *anyone* can design around.
Even with the potential problems what is clear even from this incident is that these error
Redundancy (Score:2)
As the old cliche goes:
"Two is one, and one is none."
This is why my DNS points to 8.8.8.8, 1.1.1.1, and 8.8.4.4. Though I think I may have to secure another DNS provider if things keeps going the way they are... :-/