Resilience Testing at Amazon, Etsy, and Google

CowboyRobot writes: "Kripa Krishnan and Tom Limoncelli at Google have a detailed look into Google's GameDay resiliance exercise, what they call DiRT (Disaster Recovery Testing) and in related pieces, Etsy's John Allspaw makes the case for resilience testing, and the three continue with a roundtable discussion with Amazon's Jesse Robbins on lessons learned from these kinds of exercises.

Among other insights and anecdotes, "We simulated a long-term power outage at a data center. This test challenged the facility to run on backup generator power for an extended period, which in turn required the purchase of considerable amounts of diesel fuel without access to the usual chain of approvers at HQ. We expected someone in the facility to invoke our documented emergency spending process, but since they didn't know where that was, the test takers creatively found an employee who offered to put the entire six-digit charge on his personal credit card. Copious documentation on how something should work doesn't mean anyone will use it, or that it will work if they do. The only way to make sure is through testing.."

Resilience Testing at Amazon, Etsy, and Google

