Facebook Engineers Crash Data Centers In Real-World Stress Test (ieee.org) 52
An anonymous reader writes: In a report via IEEE Spectrum, Facebook's VP of Engineering Jay Parikh described the company's "Project Storm" -- regular takedowns of Facebook's data center intended to stress test the company's disaster recovery efforts. The first few didn't go so well, he reports. (Perhaps doing a test during a World Cup final was not such a good idea). Months and months of planning went into the initial effort, though up until the actual moment, other Facebook leaders didn't think he'd actually take out an active data center. "In 2014, Parikh decided Project Storm was ready for a real-world test: The team would take down an actual data center during a normal working day and see if they could orchestrate the traffic shift smoothly," reports IEEE Spectrum. Parikh recalls: "I was having coffee with a colleague just before the first drill. He said, 'You're not going to go through with it; you've done all the prep work, so you're done, right?' I told him, 'There's only one way to find out'" if it works. (Parikh made the remarks at this week's @Scale conference in San Jose.) Parikh says there never seemed to be a good time to perform the live takedowns. "Something always ended up happening in the world or the company. One was during the World Cup final, another during a major product launch." The report adds, "The live takedowns continue today, with the Project Storm team members coming up with crazier and crazier ambitions for just what to take offline, Parikh says.
Worth it (Score:5, Funny)
Re: Worth it (Score:5, Insightful)
Re: Worth it (Score:3)
Re: (Score:2)
Considering that Facebook is arguably the world's biggest news service, it actually is sort of important.
News DISTRIBUTION service. It's not like they provide any original content like AP, Reuters, etc.
Re: Worth it (Score:4, Insightful)
News DISTRIBUTION service. It's not like they provide any original content like AP, Reuters, etc.
In that AP and Reuters are just distribution services, Facebook is arguably a larger source of original news distribution than those two.
And kudos to their engineering team for not just paying lip service to reliability.
Re: (Score:1)
News DISTRIBUTION service. It's not like they provide any original content like AP, Reuters, etc.
In that AP and Reuters are just distribution services, Facebook is arguably a larger source of original news distribution than those two.
AP and Reuters both have reporters in their employ. FB does not, AFAIK. I'd guess FB also technically could probably be accused of copyright violations regarding reposting AP/Reuters stories. FB nothing more than a massive "look at me" and gossip site.
And kudos to their engineering team for not just paying lip service to reliability.
I would agree with this. Running real world tests is the only way to be sure.
Re: (Score:1)
Re: (Score:2)
Re: (Score:2)
"Considering that Facebook is arguably the world's biggest marketing service, it actually is sort of important."
There you go. Fixed that for you.
Re: (Score:2)
This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.
After 10,000 likes the radiation poisoning gets better
Re: (Score:3)
This is totally worth it. What would happen if there was a REAL disaster (like a nuclear strike) and people couldn't check their facebook feed and post "thoughts and prayers" messages? Too terrible to think about.
Or maybe given Facebook's system of being able to announce on your feed to your friends and family that you are in fact okay thus reducing panic situations, it's much more important than your prejudices make it out to be.
Re: (Score:3)
Re: (Score:2)
People think it's easier to get an internet connection (and assume that others have an internet connection) and view/post about being ok than it is to simply pick up a phone and call your loved ones, or even stranger, leaving the house and actually checking on them.
Re: (Score:2)
In disasters where internet connections are major issues, so are typically phone calls. But this isn't the scenario I was talking about. In every disaster there are a higher portion of people not affected than are affected. Getting a message out casually on social media can help reduce the stress on the phone network which is better served for the actual emergencies going on at the time.
Re: (Score:2)
Re: (Score:3)
Yes that's exactly what I was saying, and not everyone is in position in every case to help someone. The idea that an entire city of people will suddenly flock to another to "scramble to help" is simply absurd. The world will keep turning and no one can do 100% all the time so critising people for being on facebook is not really thinking ahead.
Now on the flip side Italy had an earthquake the other day. My sister was in Italy, I don't know where she was, just that she was travelling through. My first reactio
Re: (Score:2)
The idea that an entire city of people will suddenly flock to another to "scramble to help" is simply absurd.
...is simply absurd. Not sure where you live, but I'm very near to Mississippi (a place where we used to get multiple hurricanes every year). When Hurricane Katrina hit, it was not possible to reach anyone there without traveling to the area (southern-most parts of Louisiana to Mississippi). The only government help in place was positioned at the Walmart(s) to
Re: (Score:2)
Back in the 70s (Before I was in the field professionally, but knew 'enough') I was brought by my dad to Bunker Remo, who did all the stock market data. (My dad worked on their HVAC)
At the one site, they had TWO mainframes (yes, this was pre IBM PC era) with DRUM memory (Yes, I've seen operational drum memory!! - an I have one word of memory from the computer - discrete transistors!!!)
Anyway, it was a cluster! One computer could take over for the other. Guy said "Oh, that's nothing, there are 2 more in Midt
Somebody Finally Gets It! (Score:5, Insightful)
Good for him! Most DR exercises I've seen are planned weeks, if not months in advance. They are more of a scheduled fail-over to a redundant site and not an actual disaster recovery test.
In the event of an actual disaster, there would be no recovery.
I'm heartened to see SOMEONE does it right.
Re: (Score:2)
Most large banks I have worked with do full DR exercises, and have since the 90's at least. Smaller banks will simulate typically, but one bank I know of actually shifted mainframes to their DR warehouse and brought things up from there.
Now with hot-hot sites, the activity is much more trivial, but it is obviously not a universal thing across all organizations.
Disaster wont wait untill your not busy to strike (Score:1)
Re: (Score:2)
And this is, sir, why IT is its current lame shape: allowing incompetent people taking key decisions just because they happen to relate to the right people.
Now, for a real world working DR plan:
10 REM 'DR Master Plan'
20 LET Power = OFF
30 PRINT 'AIIIIEEEEEEEE!'
40 GOTO 30
Now: *THAT'S* how professionalism looks like.
Re: (Score:2)
facebook and engineering aren't something I'd put into a sentence together.
And how do you think they got to handle the amount of data they did? The same comments rolled in when Walmart released its cloud service. Of course they have an engineering department.
Facebook engineers are working on a scale that most people will never see.
What's this World coming to when twitterbook has to be protected from natural disasters.
A ubiquitous service that most people have access to is one that speeds up disaster recovery. People have already used groups to organize disaster recovery efforts on small and large scales.
No one says you have to use it to upload food selfies.
Netflix Simian Army and Microservice Architectures (Score:4, Interesting)
Re: (Score:2)
Actually, this is one of the better story summaries I've seen here.
I knew what it was talking about (no unexplained mayfly buzzwords), I knew who the protagonists were, and I knew what was at stake. The only implied innovation was one of personal chutzpah, against the backdrop of an organization notorious for taking all things in collective stride (these being very, very short strides).
Working at Facebook Soun [slashdot.org]
And nothing of value was lost (Score:4, Insightful)
Pity.
Fault-tolerant privacy invading data centers (Score:2, Funny)
Testing ideas (Score:2)
It would be very useful for Facebook to stop announcing ASN 32934 for a few centuries as an experiment just to see what would happen.
Or permanently remove authority records for facebook.com.
You know just to see what would happen.
Who's on first? (Score:2)
Months and months of planning went into the initial effort, ...
Into the take down or recovery? 'Cause the former just requires pulling a cable of some sort. :-) TFS says the team would take down a site and try to migrate the traffic, but wouldn't it be better if the disaster group and the recovery group were different teams for a "real world" stress test?
Google has been doing this for many, many years (Score:2)
There it's called "DiRT" (stands for "Disaster Recovery Test").
Re: (Score:2)
https://www.youtube.com/watch?... [youtube.com]
Netflix Simian Army (Score:2)
Reminds me of "Chaos Monkey" from the netflix simian army.
Playhouse (Score:1)