Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet

Cloudflare's 23-Minute Outage Today Also Took Down Major Web Sites (techcrunch.com) 57

"Many major websites and services were unreachable for a period Friday afternoon due to issues at Cloudflare's 1.1.1.1 DNS service," reports TechCrunch: The outage seems to have started at about 2:15 Pacific time and lasted for about 25 minutes before connections began to be restored. Google DNS may also have been affected.

Cloudflare at 2:46 says "the issue has been identified and a fix is being implemented." CEO Matthew Prince explains that it all came down to a bad router in Atlanta... The company also issued a statement via email emphasizing that this was not an attack on the system...

Discord, Feedly, Politico, Shopify and League of Legends were all affected, giving an idea of the breadth of the issue. Not only were websites down but also some status pages meant to provide warnings and track outages. In at least one case, even the status page for the status page was down.

Other sites that went down briefly include Discord, Patreon, Deliveroo, GitLab, Zendesk, and Medium, according to the Verge. And FBI.gov was also down, reports PC Magazine, as well as their own site, Mashable.com, "and even Downdetector.com, a real-time outage monitoring website."

They note that Cloudflare serves 26 million sites, and Cloudflare's CEO acknowledged tonight that the outage affected about 50% of their private backbone.
This discussion has been archived. No new comments can be posted.

Cloudflare's 23-Minute Outage Today Also Took Down Major Web Sites

Comments Filter:
  • by Way Smarter Than You ( 6157664 ) on Saturday July 18, 2020 @01:17AM (#60302981)

    C'mon, they're not Slashdot. They can afford redundancy in hardware, routes, ports, cables, upstream links, and so on.

    Ridiculous.

    • Not an accident, frailty by design, a very convenient "off" switch

    • by Anonymous Coward

      They can afford redundancy in hardware, routes, ports, cables, upstream links, and so on.

      Of course they do. They redirected traffic off a backbone link for maintenance, on to two other links, one of which they fucked up the routers configuration file on.

      Three way redundancy, capable of degrading to two way redundancy for planned maintenance.
      That's why only 50% if that links traffic failed and not 100%.

      Had the tech not screwed the configuration, they would have disabled an entire backbone link, still had dual redundancy on it, and you wouldn't even have noticed.

      • Even that shouldn't have been able to happen, and I'm sure they are working very hard right now to fix the underlying issues. In a piece of infrastructure as critical as CloudFlare (which hosts about 20% of the entire web IIRC) no single tech should have been able to put a configuration file live, even as a temporary patch during maintenance, without thorough inspection and testing.

    • by antdude ( 79039 )

      I see this happens with many high tech companies. For an example with EarthLink, they were out due to water issues a few weeks ago that lasted a couple days!

  • I'm not surprised so many sites put their eggs in one basket, but it says something about responsibility when "legit" sites decide to use an unreliable service that also protects the *chan-clones and the piracy and illegal (as in actual crimes like stolen accounts) sites.

    The best thing that should happen is for Cloudflare to fail and commercial services stop using it.

  • And, odds are, the timing here (Cloudflare, the Twitter hack, whatever took down Slashdot) is just coincidental. Especially since the Cloudflare guy went out of his way to mention it was not the result of an attack (nor was it a hardware issue, despite using the term "bad router").

    I'm just hoping there aren't more coincidences in early November.

    • by fermion ( 181285 )
      So what did take down /.
    • https://slashdot.org/story/06/... [slashdot.org] Possibly a repeat of this, but with story IDs instead of comment IDs.
    • "bad router"

      Sounds like another J.J. Abrams production company [wikipedia.org].

    • Several sites and services have been having disruption since Wednesday, including some work sites, an MMO I play, etc. Maybe coincidence. But it's been a frustrating few days (and for some reason, work conference calls stayed up, dammit).

      • Panicked "the network is broken" claims are a frequent escalation from the call center, from developers, and from junior admins. Teaching such personnel how to verify that it _is_ the network, or not, is often part of my mentorship work, because very often it is _not_ the network. And when it is the network, proving it to entrenched network admins who don't appreciate input from outside their department can be very difficult.

  • by spaceman375 ( 780812 ) on Saturday July 18, 2020 @01:30AM (#60303007)
    This points out the need for a local tertiary caching DNS server that provides a 'Last Known Good' service. Current DNS structure honours Time To Live too strictly it seems.
    • What does that add to resiliency, beyone address resolution during an event? The last known good address could be stale due to load balancing or DDOS prevention.

      • That's why I said tertiary. If your first two servers have no answer, last known good is your best bet. It won't confuse load balancing or DDOS mitigation at all. It won't help if they even give a wrong answer. Can you propose something better?
        • That kind of fallback can be a security issue. If an attacker can force you to use known bad data with a denial-of-service attack, then the mechanism you have for marking information as bad becomes ineffective.
      • It typically nothing but a checkmark on a high availability checklist. The tertiary is still pulling from upstream somewhere, and the data expires, much as a squid server is not a substitute for a website with active content.

    • Re: (Score:2, Troll)

      by Dracos ( 107777 )

      The Internet was engineered to route around errors.

      The World Wide Web has managed to "fix" that "bug" by putting critical infrastructure in the hands of a select few third parties.

      • Seriously. When registering a domain name, there are usually options to set three if not more name servers. God only knows why morons consistently use only two name servers, and both of them are the same company. When that one company that runs both of your name servers has issues, your site goes dead.
    • It really doesn't. Every major OS as well as the DHCP protocol itself is setup with the ability to fall back to other DNS servers. This applies to both the OS level and to caching DNS servers looking upstream or to the root DNS servers directly.

      I noticed the Cloudflare outage on my desktop PC* which presented like an internet outage, but curiously my laptop was fine. The difference was the laptop had a secondary DNS server configured and happily fell back to using Google while on my desktop I left that fiel

    • You can configure unbound to sort of do this. You can tell it to serve expired records (with a 0 TTL). Then it goes and fetches the current record in the background so that the next request will have a fresh record. I'm not sure though how it would handle being unable to refresh the record, and if it would keep serving 0 TTL records. Cloudflare DNS actually does this, and serves 0 TTL records and refreshes it for the next query.

  • Something to expect more of....and it is what you deserve when you go against the grain/principle/grind of the internet itself.

    Capitalism...after a while money/power becomes centralized.

    Such two opposing concepts should never been allowed to meet in hindsight.

    Been on the internet since 1995. It was not much to look at back then, still it was much more fun than it is now.

    Hmm...a yard? It is mine?

  • DNS? (Score:4, Interesting)

    by h33t l4x0r ( 4107715 ) on Saturday July 18, 2020 @02:04AM (#60303055)
    If the first DNS server fails it should fail over to the second one. So 1.1.1.1 (CF) and 8.8.8.8 (Google) seems like it should cover your bases pretty well.
    • This is precisely what my laptop did. On my desktop I for some reason didn't have a secondary configured. I figured this out when I shouted "internet's down" through the house only to get a reply of "it's working fine here" from the living room.

    • Comment removed based on user account deletion
      • For many people it was. The actual service issues were minor on the network on the whole, but 1.1.1.1 was in fact unreachable meaning if you didn't have a fallback configured (like I didn't) then the internet looked down, and if you did have a fallback configured (every other computer in the house did, don't know why I didn't set it up correctly here) there's a good chance that unless you lived in a very specific location you wouldn't have noticed the problem.

  • by troutman ( 26963 ) on Saturday July 18, 2020 @03:23AM (#60303189) Homepage

    As usual, this outage was entirely caused by humans and unforeseen combinations of conditions to make things go off the rails.

    The router in Atlanta was no “bad”. google DNS was not also affected.

    They even have a blog post laying out exactly when went wrong. They have already addressed some of the underlaying network design assumptions and issues that indirectly led to this outage.

    https://blog.cloudflare.com/cl... [cloudflare.com]

    • by gwgwgw ( 415150 )

      Thanks for the link. It provided the sort of report that I could appreciate from someone close enough (and timely enough!) to the issue to be believable.

      That everything mentioned here on /. had such a link.

  • ajax.googleapis.com goes down and half ot the internet is broken
    cdnjs.loudflare.com goes down and half of the internet is broken
    1.1.1.1 or 8.8.8.8 goes down and half of the internet is broken

    The internet was designed to be resiliant and resist part of it going dark. What we ended up instead is a global network at the mercy of the good will and competence of a handful of operators.

    • > Not only were websites down but also some status pages meant to provide warnings and track outages. In at least one case, even the status page for the status page was down.

      Who has a status page using the same infrastructure? Who then goes one further and does the same for the status page for the status page?

      If only there'd been a status page for the status page for the status page...

    • Except that's not true at all. The internet was never designed to route around a fault automatically. The internet was designed to allow operators to route around faults automatically. There's a key difference, and that is exemplified here: Cloudflare had a fault, and routed around the problem in minutes.

      And no "half the internet" wasn't broken. In fact people who had proper DNS configuration (not myself, I didn't configure a secondary DNS on one of my machines it seems) didn't even notice that Cloudflare h

      • by bn-7bc ( 909819 )
        well the problem is that dns selection is not ímplemented as primary/secondary in most oses they round robin, so if you have to differen providers with different response times, you end up with different resolve times, ie not always as fast even when the primary is up so people override the system, either by configuring only one dns or putting the same address in both fields
        • by bn-7bc ( 909819 )
          Sorry for replying to myself but i forgot somthing and Slasdot lacks an edit fcunction, come on dice it's 2020 an edit fungtion is not exactly hard. Anyway Another isssue is that dns is now being used for functions it was never designed for (content filtering etc) which makes people even more pickuy about what dns server they want to serve them
        • by Bengie ( 1121981 )
          Quite a few systems will send a dns query to all registered servers and which ever responds first is the one that is used.
      • > The internet was never designed to route around a fault automatical

        I can attest that "The Internet" was designed to handle many classes of failure quite automatically. It's what DNS, TCP, routers and load balancers were designed to do, for example. It is not "foolproof" because fools can make quite dramatic errors like this one.

        • They aren't "the internet", they are redundancy of singular components. They all rely on pre-configured fall-backs of individual equipment. It does not in any way take into account a misunderstanding of a configuration. The internet won't route around this, just like it won't route around a faulty BGP announcement or won't route around an unavailable ASN route.

          People route around those. Always have. All "the internet" does, is try to prevent those points from failing though reliable and redundant equipment.

    • by bn-7bc ( 909819 )
      Well here is a newsflash, the internet still techniccaly works, ie packets expounds know hoe to address to it's destination still gets routed, so the internet still works as intended. That most enpoints have chosen one of two dns providers is neither here nor there. This issue just highlights that the ISPs that are supposed to provide stable and fast (becomes any delay makes costumers switch dns) do a realy poor job at it , else people would not bother
  • I mean I get the reason behind cloudflare; but if you want to be dumb enough to put your eggs in one basket....then good luck.
  • We never had an outage with them.
  • by account_deleted ( 4530225 ) on Saturday July 18, 2020 @10:16AM (#60303741)
    Comment removed based on user account deletion
    • CDNs sitting in front of websites are location dependent. The IP address 1.1.1.1 is not. Most of the world would not have noticed the Cloudflare outage unless they had DNS configured to use 1.1.1.1 and had no fallback configured. If you don't use Cloudflare for DNS or have a backup, then you would only have been affected by this bug if you were served content by the specific system which was overloaded.

      That is why 1.1.1.1 is significant, it is the only component of this outage which was location independent

      • Comment removed based on user account deletion
        • This is completely untrue. Most people use their ISP's DNS server, and it was most certainly noticed.

          Please re-read the sentence. You were replying to. Pay attention to the subject of the sentence; specifically the subject being people who were using cloudflare DNS.

          It is extremely uncommon to use 1.1.1.1 for your DNS. Whereas about 40% of the websites you're likely to visit are served by Cloudflare.

          I'm not sure what you think you're replying to but somehow you fundamentally missed the point I was making: so let me restate it clearly:

          Cloudflare sits in front of websites but the bug was location dependent, so only a small portion of the world would have noticed those 40% of the websites have a problem.
          The thing that makes 1.1.1.1 unique and

  • by Murdoch5 ( 1563847 ) on Saturday July 18, 2020 @10:20AM (#60303753) Homepage
    It's amazing how fragile the internet truly is, one bad router and MILLIONS of sites get taken out.
    • The internet had no problem. One company's service went down, it just so happens a significant number of websites use that one company's service.
    • No sites got taken out. It appeared as though sites were unreachable from a specific location, if you used their DNS, unfortunately one of those was the DNS resolver itself. If you had a fallback configured then that kept working.

      • I understand what happened, the point is that the internet isn't very stable. If a few key routers go offline, LARGE patches of national internet service go down, because the entire system is built on an agreement to be careful, not screw up the config's and pray that hardware doesn't fail (but don't actually pray, that doesn't help).
        • I still disagree with the sentiment. It's unstable but managed under watchful eyes. Hardware fails all the time yet large patches of national internet service rarely ever goes offline. Equipment is generally well managed, and redundant to the point of incredible reliability and stability. Every so often a human error tends to take something down, and this kind of human error is not something *anyone* can design around.

          Even with the potential problems what is clear even from this incident is that these error

  • As the old cliche goes:

    "Two is one, and one is none."

    This is why my DNS points to 8.8.8.8, 1.1.1.1, and 8.8.4.4. Though I think I may have to secure another DNS provider if things keeps going the way they are... :-/

With your bare hands?!?

Working...