Facebook Unveils Details of Downtime 103
An anonymous reader writes "Facebook officially gave out more technical details on the endless loop in a database control mechanism that forced a 2.5-hour shutdown of the social site, and the resulting combination of a productivity burst, increased fertility (check back on June 25, 2011) and mass hysteria all around the world."
not very technical (Score:5, Interesting)
Re:not very technical (Score:5, Informative)
Correct link to technical details:
http://www.facebook.com/note.php?note_id=431441338919&id=9445547199&ref=mf [facebook.com]
(anon because I'm not a karma whore)
Re: (Score:2)
Correct link to technical details:
Sounds like someone didn't do any testing or that their testing is in adequate.
Re:not very technical (Score:5, Funny)
Anoesj Sadraee It's great to hear and see that big companies like Facebook are so open with what they do. That's rare, very rare. Thanks!
Anne Uriarte ~facebook is stiLL sooo sLow for uz irr! >;'((
Phil McBride this site is becoming less secure lately... hackers are becoming more and more intelligent, i would know, cuz im a white hat lol
Re: (Score:2)
If you think that is funny, trying reading the article summary this way:
Re: (Score:2)
Paul Diaz: Will What i Say is get a front page they say facebook down due to server's and we are working hard to fix it get free cash ?:)
Mouhssine Freedom Elmezyani It's very easy to rape facebook !! i know some friends can hack your compt through your electronic adress !& they hacked my compt several time in the pretext of kidding !!
Mauro Guberti I'd like to know what's the necessary qualifications to work like moderator.
And the one that prompted me to close the tab:
Joanne Bozik The following link is the problem.........these people have been sending my name and pic to many stating that I purchased this product and I also in return am receiving the following link in my friends names......Please get after these people
http://www.facebook.com/facebook?v=wall#!/note.php?note_id=431441338919&id=9445547199&ref=mf [facebook.com]
(note, the link is the link to the explanation for the outage)
Re: (Score:2)
"I would know, cuz im a white hat lol"
I'm going to see how many times I can use this line in conversation at work today. It's just brilliant.
Re: (Score:2)
Well that's daft because I can't see much difference in real value whether you have 5000 karma points or 5,000,000 'points' unless you know of a way to convert that to cash.
Re:not very technical (Score:5, Funny)
Re: (Score:2)
But all the Karma whores offer their services for free anyway.
Re: (Score:2)
I can't see much difference in real value whether you have 5000 karma points or 5,000,000 'points'
The difference is if you have 5,000 karma points...well, you have 5,000 virtual points on a site only geeks care about or even know exist.
If you have 5,000,000 points, on the other hand, THAT'S OVER NINE THOUSAAAND!!!
Re: (Score:2)
Points let you level up, upgrading your abilities and attributes, while karma points let you select a more powerful base class on your next playthrough.
Re: (Score:1)
But breaking a production system for that is either stupid or malicious.
I'd guess he had a nerd rage fit, targeted at management most likely, something I think most of the people here can somewhat relate to, but really, what he did is way over the top.
OH NOES (Score:5, Insightful)
Meh, it happens...just like a power company, no one says a word when the thing works fine for weeks or months at a time...but when it goes down for a couple hours, people act like it never works.
Re: (Score:3, Insightful)
Re: (Score:1)
But it's a helluva a lot more important for a power company to stay up than FB, no power can cause serious problems. But FB down for two hours, man, the gods forbid you actually are productive or something . . .
This. FB needs perspective [bit.ly].
Paywall (Score:2)
FB needs perspective [New York Post].
That'd be a bit more convincing if it were more than just an ad for a $3.95 article.
Re: (Score:1)
Official Technical Details (Score:4, Funny)
Obviously, the error was caused by too many people not keeping it real.
Re: (Score:3, Insightful)
Link to Facebook Blog Post (Score:5, Informative)
Since the link in the summary is broken, this is the facebook blog post [facebook.com].
Post contents:
Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.
The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.
This got the site back up and running today, and for now we’ve turned off the system that attempts to correct configuration values. We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.
We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.
Re:Link to Facebook Blog Post (Score:4, Interesting)
Re: (Score:1)
This shows the beginning of the end for Facebook. Reading the summary they provided provides many details such as the fact they don't have a QA environment or regional segments or anything. It's pretty dangerous to run a site that big like that. And I've read much more they've released that basically says they just hacked mysql replication to update their caches to get it real time across regions. What they should have done to horizontally scale is to implement regional shards and then some type of inter
Re: (Score:1)
We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously.
Just like their commitment to security, I must ask this:
SINCE WHEN?
Wifey was gutted... I wasn't (Score:5, Funny)
Re:Wifey was gutted... I wasn't (Score:4, Insightful)
"John Marshall: how do i get job workin with facebook i live in newcastle in uk can any one from facebook staff get or can some one give me a email address that i can use to contact facebook please"
:|
very doomed.
Does the woman have a Wii? (Score:2)
Re: (Score:1)
Not going to happen.
Re: (Score:2)
Re: (Score:2)
so you, being of independent mind, married a sheeple? you're a sheeple-fucker! and your children will be sheeple!
conclusion: the sheeple aren't doomed....
Re: (Score:1)
Right. . . . [bash.org]
"mass hysteria"? (Score:5, Insightful)
and mass hysteria all around the world.
[citation needed].
First I knew was when I read about it on another tech blog, hours after it'd happened...and I use Facebook. And I work with a ton of people who use it (grad students.)
There wasn't mass hysteria; there was mass ambivalence. I'm now reading all these blog/news postings about how "everyone" went crazy. Nobody was talking about it where I ate dinner. Nobody was talking about it where I had coffee that evening. It didn't make my city newspaper- no "Facebook down, residents in despair" stories to be found.
All this coverage claiming that everyone went nuts seems like a desperate attempt by Facebook PR to make something positive out of this...namely, trying to convince us that Facebook is so integral to the people who use it, it must, of course, be to us as well.
Re:"mass hysteria"? (Score:4, Funny)
To me, Facebook is about as integral to my life as the toilet is. It's there, it's gonna be used every once in a while and it involves a bit of dirty business that you just can't avoid.
Re:"mass hysteria"? (Score:4, Insightful)
It's a toilet all right, but a very annoying one: you hear every flush coming from your friends' toilet as well.
Re: (Score:3)
Really?
The first time I see notice of anyone's flushes, I block the flush notices.
Seems to me, if you're complaining about it, you don't really know what you're doing (and that's considering all you have to do is hover your mouse over their post).
Re: (Score:2)
You've never used a Bidet have you.
It makes the "dirty business" downright civilized.
That said, you do NOT have to use the toilet paper that is facebook. really.
Re: (Score:2)
Re:"mass hysteria"? (Score:4, Insightful)
[citation needed].
[citation] [wikipedia.org]
Re: (Score:3, Interesting)
Re: (Score:2)
They 'need to come up for air' every 2.5 hours? I don't know where to start.
Re: (Score:2)
Haven't you seen Waterworld?
Re: (Score:2)
OK- I guess I should have started...
I get the idiom- I think its sad that taking a break after 2.5 is worthy of the phrase... Is 2.5 hours of concentrating outside the realm of normal?
Re: (Score:2)
Actually, all the coverage I've seen has been slanted like the summary above - that is, an attempt to denigrate and marginalize those that use Facebook. See the comments in Slashdot's previous coverage for some pretty clear examples of this.
Re: (Score:2)
you are so on the right track there. the desire to spread news, especially bad news, is screwing with 'the system' there were places built to be safely ignorant... and some to just be quiet and relaxing... oh well.
Re: (Score:2, Insightful)
No one is thinking about the big losses here... (Score:5, Funny)
Re: (Score:2)
please, what are these animals and why when I hear about FB I hear about farms?
Re: (Score:2)
They are a harbinger of the apocalypse.
http://g4tv.com/thefeed/blog/post/702668/DICE-2010-Video-Design-Outside-The-Box.html [g4tv.com]
Re: (Score:1)
Re: (Score:1)
The metaphysical reason was karmic payback (Score:2, Insightful)
If you suck so bad on a global scale long enough, eventually the universe tries to step in.
Farrmville continued to run? (Score:3, Insightful)
Does the clock stop for Farmville if Facebook goes down?
Re: (Score:2)
Only if you care.
Re: (Score:2)
The data for these games are hosted on Zynga servers, not Facebook servers. So, yes, the clock keeps running.
Re: (Score:2)
nope farmville.com -- the fb net protocol didn't go down just part of it's db so farmville was still playable.
Twitter... (Score:4, Interesting)
Re: (Score:1)
Let's just say that #failbook was rather high in the Twitter trends during the outage ;-)
less productivity actually (Score:3, Funny)
instead of people checking facebook every 5 minutes for the latest, very important, updates as they always do they now constantly was hitting reload for 2.5 hour
Re: (Score:1, Troll)
Facebook is just a waste of time.
... said the Slashdot commentor on a Saturday.
Huh? (Score:1)
What is this Facebook thing? Isn't that something kids do on computers?
Fertility was not affected (Score:4, Informative)
My Favorite Downtime (Score:5, Funny)
Summary FTW!! (Score:1)
Eerie sight (Score:5, Funny)
it's only facebook (Score:5, Funny)
Re: (Score:2)
Well, we care because as one of the largest sites, they are expected to have their shit together. So when they don't, it's interesting to see what happened.
It's professional interest, it's not that I'm worried that people couldn't plant their snow peas for two hours, or whatever.
Re: (Score:2)
Re: (Score:2)
No one said it is, actually I don't think I said anything about perfection. Facebook is one of the largest apps in the world, you don't think the problems they face can be informative to some of Slashdot's readership?
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
No.
Somebody made a configuration change, and Facebook's error checking code thought the new configuration was invalid. This code was designed to automatically correct for cache errors by querying the database to get the correct configuration - which it thought was invalid, so it would check again.
It doesn't matter what kind of database you have when you've got a system like this in place.
Obviously, their implementation of this error checking system is fatally flawed. That's why they've turned it off, unti
Re: (Score:1)
Re: (Score:1)