Forgot your password?
typodupeerror
Bug The Internet Databases Social Networks

Why Browsers Blamed DNS For Facebook Outage 96

Posted by timothy
from the three-letters-bad dept.
Julie188 writes "That was probably the only time 'DNS' will ever be a trending term on Twitter. The cause was Facebook's 2.5 hour outage on Thursday, which incorrectly told users trying to access the site that a DNS error was to blame. In truth, experts who've read Facebook's explanation say the site went down because Facebook gave itself a distributed denial-of-service attack when a system admin misconfigured a database. So why was DNS blamed? The 27-year-old communications protocol has been known to cause other, somewhat similar outages."
This discussion has been archived. No new comments can be posted.

Why Browsers Blamed DNS For Facebook Outage

Comments Filter:
  • by mfh (56)

    Is contagious, it seems.

    • Indeed. I don't do Facebook, but if I had got such a message, my first response would be to look at my own /etc/hosts file. From time to time I manage to bite myself on the ass with my block-list, but I can live with that...
    • People are always looking to blame someone else for their problems or someone else's. Its just human nature.
      • What percentage of slashdotters actually noticed the facebook outage when it happened? As opposed to merely participating in the post-hoc commentary after they read about it. It should have been posted to slashdot's idle category.
        • I disagree (Score:1, Informative)

          by Anonymous Coward

          It is the most used website in the world (more userhours/month spent of Facebook than any other site), the fastest growing internet community (when measured in new users/month), etc... And as such it is an engineering masterpiece (in software engineering and probably in several other areas, too). When it goes down for several hours, it is a newsworthy event.

          For us who work for advertising agencies, FB downtime is also a financially notable event.

          • Re: (Score:2, Interesting)

            by Kvasio (127200)

            Yet, you failed to notice that /. is a site for nerds.
            Many nerds do not thrive to cultivate their social skills.
            Checking their friends status on social network might not be on top of their agendas.
            So: event was notable, but not very important to many slashdotters.

        • by ryanov (193048)

          I noticed, and saw the DNS message when it was there. When I read this, I said to myself "umm, why did people blame DNS? That's what the message said!"

  • because chrome stopped at "resloving host"
  • by Anonymous Coward

    It wasn't your browser having a DNS error, it was the user facing servers at Facebook reporting DNS problems talking to whoever they talk to. Maybe when they decided the way to fix the problem was to take down the site, they just removed the back end server cluster from their internal DNS.

  • Duh (Score:5, Insightful)

    by vlm (69642) on Sunday September 26, 2010 @12:00PM (#33703620)

    So why was DNS blamed?

    From http://www.facebook.com/note.php?note_id=431441338919&id=9445547199&ref=mf&_fb_noscript=1 [facebook.com]

    The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site.

    I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up... they could redirect to another site, Orkut or myspace would have been mildly humorous. I am mildly surprised they don't have a simple emergency box with a simple static "undergoing repair" page, but, whatever ...

    So, other than zapping the A records and waiting, what are they supposed to do? Bonus points if they were doing DNS based load balancing and simply unplugged their (dns based) load balancer.

    I have no dog in the fight, having deleted my facebook account months ago. It is kind of funny that a page of technobabble is described as "technical details" as if folks like us/me would find it to be a complete description rather than pretty vague. Then again we're dealing with farmville addicts and you can't reason with addicts.

    • by kasperd (592156)

      I'm, uh, taking a wild guess that simply shutting off port 80 is not going to allow for a controllable ramp up...

      Both approaches allow for a controllable ramp up given the right software on their servers. And I think with the typical off the shelf software neither of them allow for a controllable ramp up.

      But did they even need a controllable ramp up of user requests? It sounded like the overloaded system was overloaded by internal requests, that were unrelated to the number of requests they got from end

      • by vlm (69642)

        But did they even need a controllable ramp up of user requests? It sounded like the overloaded system was overloaded by internal requests, that were unrelated to the number of requests they got from end users.

        When you hear hoofs, think horses not zebras.

        Seeing my servers spike to 100% CPU or 100% I/O and stay there, I'd look outside first before looking inside... So my first goal would be to act for a controllable ramp up of user requests. If the systems are so overloaded I can't troubleshoot at 100% of users, maybe I COULD log in and troubleshoot at 50 or 90 % load.

        Also, I've worked at places that won't upgrade until outages due to high utilization are some large multiple of the cost of upgrading, this would

    • Re: (Score:3, Funny)

      by PaganRitual (551879)

      This whole situation does explain why my mother appeared to be sick on the couch at my parent's place on Thursday afternoon when I paid them a visit. With all the shaking and huddling under the covers and looking pale-faced I presumed she had come down with the flu or something.

      Then again we're dealing with farmville addicts and you can't reason with addicts.

      They aren't addicts, that's patently unfair. They can stop any time they want. What is most admirable about them is that they are simply so time-savvy that they coincide those times at which they wish to stop with the periods during

  • Ageism (Score:5, Informative)

    by Vahokif (1292866) on Sunday September 26, 2010 @12:05PM (#33703658)

    The 27-year-old communications protocol

    So? TCP/IP is 36 years old.

    • Re: (Score:1, Funny)

      by Anonymous Coward

      PILFS!

    • Re:Ageism (Score:5, Insightful)

      by kasperd (592156) on Sunday September 26, 2010 @01:22PM (#33704098) Homepage Journal
      Some people think technology should be replaced just because it is old. But really, it should be replaced if it doesn't suit our needs and there is a different technology that does suit it.

      It is better to replace a 1 year old technology that does not suit our needs than to replace a 50 year old one that does. Usually when replacing, you want to replace with something newer. But in some cases it may turn out to be better to replace a new and misdesigned technology with an older and proven one.

      That said, there are improvements to both IP and DNS which should be rolled out because they fix real problems. The rollouts are not happening as fast as they ought to, mainly because it is problematic to roll out a change to the entire Internet, especially when not everybody involved is cooperating.

      But I don't think that really has anything to do with this outage.
      • The people who think that are

        1) The people with patents on the new technology, or who are planning to sell stuff for it.
        2) The people who have been convinced by the marketing budgets made possible by 1)

      • I'll take the quality of design of IP or DNS over what passes on for "The Web" these days. The browser as a concept is bending towards it's breaking point as it tries to cope with the fact it's treated as a clown car.

        I guess it's historical legacy that we started with HTML and crap like that for browser interaction and everything sort of grew from there, but we're doing the whole "web as an applications platform" wrong.
    • Re: (Score:2, Funny)

      by oldspewey (1303305)

      So? TCP/IP is 36 years old.

      Yeah, but it still lives in its parents' basement.

    • Re: (Score:3, Insightful)

      by dlgeek (1065796)
      And is definitely showing it's age. There's been a big cry for years from those working at the really high end of networking that we need to replace (really just extend) TCP because it doesn't work well with high bandwidth-delay-product links. This is because the max window size and ramp-up algorithm (slow start) don't allow you to saturate the pipe quickly enough or even at all. There are several proposed extensions floating around to fix the problem but none of them have widespread adoption.

      This actual
      • by HBI (604924)

        There is a whole market devoted to handling high delay TCP connections. It works. It's what I do. Well, part of it.

        Replacing the protocol for this reason would be kind of lame.

      • How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?

        There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free. Second, requiring strong digital ID makes it difficult for someone living under a government that suppresses spe

        • by jgrahn (181062)

          How much easier would fighting spam be if SMTP had a strong authentication system for sent messages?

          There is one, called OpenPGP. There is another one, called S/MIME. Implementation of these in real-world MUAs awaits a decision on best practices for how strong the authentication needs to be. Stronger authentication has two downsides. First, the cost of obtaining a digital ID goes up with strength; even with the OpenPGP web of trust, travel to a key signing party hundreds of km away is not free.

          You wouldn'

          • If everyone had as a personal policy "only read OpenPGP-signed mail, and distrust mail signed with a key I haven't personally downloaded from a key server"

            Then it would it would still fall under the "Requires immediate total cooperation from everybody at once" line of the well-known copypasta [craphound.com], and possibly "Mailing lists and other legitimate email uses would be affected" and "Many email users cannot afford to lose business or alienate potential employers" depending on how it is implemented.

    • So? TCP/IP is 36 years old.

      And can't even cope with lossy network connections (i.e. mobile).

      • by godefroi (52421)

        On the contrary; it copes very well with lossy network connections. The real problem is YOU and your insistence on receiving everything that was sent, and in the correct order even.

        If you were willing to see half-pages and miss images, then UDP would be a splendid protocol for you, and you wouldn't have to wait for timeouts and retransmissions.

        • On the contrary; it copes very well with lossy network connections

          I suspect this is going for Funny, but just in case: the basic problem is that TCP congestion control sees a lossy network as busy and backs off on transmission speed.

          It's an open research topic, and currently handled in L2 on mobile networks since TCP can't cope.

  • Terrible article. What "DNS error"? Is Facebook running its own DNS servers that do something funny, or what?

    As for DNS "moving to the cloud", DNS is already far more distributed than any of the "cloud" systems. Which is a good thing.

  • Yeah...from reading that Facebook note, it's pretty clear that DNS had nothing to do with the outage. Do you guys think the outage would've been better or worse had it been one?
  • by j_col (1895476)
    I found the genuine panic from many Facebook users to this outage very amusing.
    • by Anonymous Coward

      I found the genuine panic from many Facebook users to this outage very amusing.

      I, too, laugh at the misfortune of others.

      • by Skylinux (942824)

        Des einen Leid ist des anderen Freud. -- Of one man's meat is another man's poison.

    • Re: (Score:2, Flamebait)

      by kiwimate (458274)

      I suppose if I were an angst-ridden bitter friendless teenager I may have found it amusing too. Luckily, I'm an adult. (How sad that this comment is currently marked insightful.)

      And - really? Genuine panic? I think that says more about the specific subset of Facebook users within your anecdote set than anything else. Or do you also extrapolate out from the frequent racist troll comments on Slashdot?

    • by definate (876684)

      LOL Yeah it was hilarious [publicradio.org] when people [dailymail.co.uk] were complaining about being unable to get on Facebook. So funny that people need services to keep in contact with others, it's like why don't you just talk to them in person? I mean like, HELLO, am I the only one getting this? Geeze. If it's so important to you then you should be more redundant with your services, like, everyone knows that!

  • It's an advertisement platform that rides solely on the ignorance if its users. So people had two and a half hours to take a break from their narcissism... this is something worthy of finger pointing?
    • by Anonymous Coward

      high and mighty non-trend followers are pretty trendy, just saying...

    • by _Shad0w_ (127912)
      There are adverts on there?
      • Re: (Score:3, Funny)

        by Sir_Lewk (967686)

        No. Facebook doesn't do data-mining, and they don't serve ads. They simply pull money out of their ass.

    • Re: (Score:3, Insightful)

      by kiwimate (458274)

      So is Slashdot.

      I don't know that finger pointing is necessarily healthy - that tends to suggest CYA and childish blame games. But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.

      • by Skylinux (942824)

        But on a technical IT focused web site, one might suppose that a lessons learned exercise on the root cause of the failure of a massive website would be of interest and hopefully even an educational experience.

        I can't remember ever seeing an article about a major outage from some big website where they delivered enough information that one could learn from it.
        So what did you learn from this article? Don't fuck up when you are an admin or maybe to create a better error page.
        Yes very informative and of interest to nerds, indeed.

  • With all the adblocking software, no scripting, and other misc Firefox plugins I have no ads from facebook, let alone any other page. Firefox is set to delete cookies and BS on exit and I keep my machine clean with bleachbit.

    Facebook allows me to connect with lots of people I never would see, like buddies in the Army based in Japan for instance or friends in New York etc, etc.

    I hide all the annoying spam adds for peoples stupid farms, I have convinced many of my Facebook friends and family to stop playing t

    • hell even my Linux distro has political agendas, ( damn Mint Linux!!! )

      I had never heard of them (I'm an old Slackware hand, and more recently Arch), but Mint's webpage is so incredibly slow to load, it's impossible to see what that agenda is. It doesn't inspire much confidence in them. :-|
      • Re: (Score:1, Informative)

        by Anonymous Coward

        You won't find it on the home page. It was a post by a developer on the dev blog. He later removed it and apparently moved it to his personal blog.

        Palestine Written by Clem on Sunday, May 3rd, 2009 @ 12:34 am | Main Topics

        This is not the place to talk about this but I am deeply touched by what is happening over there. I feel disgust and guilt with us passively witnessing it and our money and weapons supporting it. I don't want to use my name or this project to push my own ideas about this but I spend a lot of time working and giving away, sharing and receiving to and from a lot of people.

        I'm only going to ask for one thing here. If you do not agree I kindly ask you not to use Linux Mint and not to donate money to it.

        I hope for these people to be able to live decently in the future and for me not to have anything to do with the misery they're in at the moment.

        I promise not to talk about this anymore. I don't want any money or help coming from Israel or people who support the action of their current government.

        Thank you for your understanding. This is very important to me.

  • by ivan_w (1115485) on Sunday September 26, 2010 @01:06PM (#33704004) Homepage

    The confusion might have come from the fact that when I looked, there seemed to also be some DNS problem.

    Basically, when asking directly, the servers that are authoritative for the zone were giving me a CNAME for the 'ANY' query, but not the associated A records, which it should, since the CNAME was pointing to a host name within the same authority. At this point, any sensible resolver stops asking !

    This only lasted for a little while though - so it might have been a glitch or possibly a deliberate action related to how they were trying to fix the underlying issue itself - possibly averting traffic until they actually solved the actual problem.

    --Ivan

    • by Skapare (16644)

      This kind of thing can happen when records are being changed (say from A to CNAME) and the A record has not expired from your cache, yet. Did you do a "dig trace" around the cache to verify?

  • Cause browsers is stoopid!

  • It wasn't DNS. (Score:1, Redundant)

    by meerling (1487879)
    That was obvious, it showed symptoms of a DDoS attack, not a DNS problem. I find it funny it was caused by their own error.
  • To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover.

    Even when the database has a valid value, if failures to get a value from the database can creating a growing cascade of errors, then this design is still poised for a future failure for simple things like a partial outage of databases or network access to them. Ideally, once the data was valid, the number of clients not getting a valid value should gradually decrease as more and more get valid values and don't have to requery. But if the scale was such that none could get anything when all were trying (a

  • ... and browsers will think their DNS is dead ... because, well, it is ... and is the first thing a browser needs to access.

  • The "error page" is clearly a Facebook server reporting a DNS failure within Facebook's own network. Facebook requests are processed by user-facing servers which make RPC calls (not HTTP) into Facebook's internal network. Machines in multiple locations may be involved in generating a single Facebook page. If their in-house DNS system for organizing their internal network failed, they might produce messages like that.

    • by rekoil (168689) on Sunday September 26, 2010 @02:21PM (#33704420)

      It didn't fail, they turned it off. This was the easiest way to "shut off the entire site" as their post-mortem describes. The DNS errors users saw were being generated by the front-end HTTP proxies, not by client browsers, which caused most of this confusion. Once the database issue cleared, they reactivated the DNS entries for the back-end servers one cluster at a time and the site came back.

      • by Skapare (16644)
        You seem informed. Maybe you can explain why it is that clients would not be picking up the corrected info and reducing their "attack" on the database servers (more so than everything being turned off and back on).
        • by rekoil (168689)

          This is explained in the post-mortem. Basically, the problem was that clients were reacting to corrupt data being served up by the origin DB cluster the same way that they reacted to bad data coming from the memcached cluster - by deleting the offending entry in memcached and re-sending the query to the origin DB. So a client queried the origin, got bad data, and then deleted the key from memcached - resulting in every other client (tens of thousands of them, most likely) then querying the cluster for the s

  • Browsers are fucking software. They don't blame anything for anything.

    Facebook was down. That's the only that matters to most people.

  • it might be a buzz acronym on twitter anyway, cause it also means DNA in german

To be a kind of moral Unix, he touched the hem of Nature's shift. -- Shelley

Working...