How an 'Unprecedented' Google Cloud Event Wiped Out a Major Customer's Account (arstechnica.com) 50
Ars Technica looks at what happened after Google's answer to Amazon's cloud service "accidentally deleted a giant customer account for no reason..."
"[A]ccording to UniSuper's incident log, downtime started May 2, and a full restoration of services didn't happen until May 15." UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service... UniSuper's website is now full of must-read admin nightmare fuel about how this all happened. First is a wild page posted on May 8 titled "A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian...." Google Cloud is supposed to have safeguards that don't allow account deletion, but none of them worked apparently, and the only option was a restore from a separate cloud provider (shoutout to the hero at UniSuper who chose a multi-cloud solution)... The many stakeholders in the service meant service restoration wasn't just about restoring backups but also processing all the requests and payments that still needed to happen during the two weeks of downtime.
The second must-read document in this whole saga is the outage update page, which contains 12 statements as the cloud devs worked through this catastrophe. The first update is May 2 with the ominous statement, "You may be aware of a service disruption affecting UniSuper's systems...." Seven days after the outage, on May 9, we saw the first signs of life again for UniSuper. Logins started working for "online UniSuper accounts" (I think that only means the website), but the outage page noted that "account balances shown may not reflect transactions which have not yet been processed due to the outage...." May 13 is the first mention of the mobile app beginning to work again. This update noted that balances still weren't up to date and that "We are processing transactions as quickly as we can." The last update, on May 15, states, "UniSuper can confirm that all member-facing services have been fully restored, with our retirement calculators now available again."
The joint statement and the outage updates are still not a technical post-mortem of what happened, and it's unclear if we'll get one. Google PR confirmed in multiple places it signed off on the statement, but a great breakdown from software developer Daniel Compton points out that the statement is not just vague, it's also full of terminology that doesn't align with Google Cloud products. The imprecise language makes it seem like the statement was written entirely by UniSuper.
Thanks to long-time Slashdot reader swm for sharing the news.
"[A]ccording to UniSuper's incident log, downtime started May 2, and a full restoration of services didn't happen until May 15." UniSuper, an Australian pension fund that manages $135 billion worth of funds and has 647,000 members, had its entire account wiped out at Google Cloud, including all its backups that were stored on the service... UniSuper's website is now full of must-read admin nightmare fuel about how this all happened. First is a wild page posted on May 8 titled "A joint statement from UniSuper CEO Peter Chun, and Google Cloud CEO, Thomas Kurian...." Google Cloud is supposed to have safeguards that don't allow account deletion, but none of them worked apparently, and the only option was a restore from a separate cloud provider (shoutout to the hero at UniSuper who chose a multi-cloud solution)... The many stakeholders in the service meant service restoration wasn't just about restoring backups but also processing all the requests and payments that still needed to happen during the two weeks of downtime.
The second must-read document in this whole saga is the outage update page, which contains 12 statements as the cloud devs worked through this catastrophe. The first update is May 2 with the ominous statement, "You may be aware of a service disruption affecting UniSuper's systems...." Seven days after the outage, on May 9, we saw the first signs of life again for UniSuper. Logins started working for "online UniSuper accounts" (I think that only means the website), but the outage page noted that "account balances shown may not reflect transactions which have not yet been processed due to the outage...." May 13 is the first mention of the mobile app beginning to work again. This update noted that balances still weren't up to date and that "We are processing transactions as quickly as we can." The last update, on May 15, states, "UniSuper can confirm that all member-facing services have been fully restored, with our retirement calculators now available again."
The joint statement and the outage updates are still not a technical post-mortem of what happened, and it's unclear if we'll get one. Google PR confirmed in multiple places it signed off on the statement, but a great breakdown from software developer Daniel Compton points out that the statement is not just vague, it's also full of terminology that doesn't align with Google Cloud products. The imprecise language makes it seem like the statement was written entirely by UniSuper.
Thanks to long-time Slashdot reader swm for sharing the news.
Normal for Google (Score:5, Funny)
They're great at deleting products and services. No surprises here!
Nothing happens "for no reason". (Score:2, Insightful)
There ALWAYS is a reason things happen.
Not if cloud fails, its's when for your company (Score:3)
The chance of a business destroying cloud failure is 100% given enough time for every business relying on the cloud.
All it takes is for Bank of America, Wells Fargo, Citigroup, Goldman Sacs, Travelers insurance, or any of a host of too big to fail companies fails due to the cloud.
If Bank of America cannot vouch that their counterparty hedging billions of low credit loans can pay, Bank of America will have to declare itself insolvent. And the 2008 financial crisis begins again.
The chain of cross-hedging bet
Did they leave a note? (Score:4, Funny)
Re: (Score:2)
This is true of all storage media. HDDs, SSDs, tapes, archival Bluray, cloud, SAN, everything.
It's the reason you have a backup and a recovery plan.
Re: (Score:1)
call me a conspiracy theorist (Score:2)
Re: (Score:2)
Yep, you're a conspiracy theorist.
If there were any suspicion on the part of UniSuper that it was intentional, they'd be suing Google for a lot more than actual damages. Those kinds of events *always* leave a trail.
Re:call me a conspiracy theorist (Score:4, Interesting)
I read elsewhere that because it was such a big and complex account there was some manual elements involved in the original accounts creation. It was that manually set configuration that eventually led to this clusterfuck and why it was a "one off" event - such things aren't done or allowed anymore. So my guess is that some engineer years ago either bypassed or failed to set some flag(s) that the normal tools would configure.
Re:call me a conspiracy theorist (Score:5, Funny)
Forgot to set $dont_delete_big_customer = TRUE.
Re: (Score:2)
Re: (Score:2)
That explains it. A cosmic ray flipped the don't-be-evil bit.
Re: (Score:1)
Post restored from backup? (Score:3, Informative)
How an 'Unprecedented' Google Cloud Event Wiped Out a Major Customer's Account
Precedented [slashdot.org]
Re:Post restored from backup? (Score:5, Funny)
The Past is just another availability zone around here.
Temporal redundancy.
Re:Post restored from backup? (Score:4, Insightful)
This is not a dupe, but a follow-up. The post from May 11 obviously lacks info from May 15...
Re: (Score:2)
Posting several time is just a way to backup the flow. That way, if Slashdot were to lose a full day of stories, nothing of value would be lost.
But AI will fix it (Score:4, Funny)
Seriously though, I swear Google's corporate ADHD is getting... Oh look here's some AI!
Newer (Score:1)
Newer trust that your backup is secure on someone else's server !
Re: (Score:2)
There are no "backups on a server". What you have "on a server" is a local copy. A "backup" is a copy that is kept off-site, hopefully on a piece of media that is safe from accidental over-writing.
Re: (Score:1)
There are no "backups on a server". What you have "on a server" is a local copy. A "backup" is a copy that is kept off-site, hopefully on a piece of media that is safe from accidental over-writing.
There is one and only one single requirement that differentiates a backup from a copy.
That is a backup taken in the past cannot ever be affected by changes to the source in the present or future.
If you make separate copies every day, those copies are not copies but backups.
If you make one copy that is overwritten, for any reason, but especially by being replaced, that is a copy, not a backup.
The device copying the bits around does not matter. A server can make backups just the same as a desktop, laptop, or
Re: (Score:1)
better to happen to someone big (Score:5, Insightful)
Kinda glad this happened to someone with enough weight to throw around to make Google care.
Imagine if they did this to your small business. Good luck getting a joint statement from the CEO then.
Re: (Score:3)
Yep. And somebody that is not willing to lie for them. Although I notice this report comes _really_ late here. I read about it European IT media a week ago or more.
Re: (Score:2)
Ah, just saw this is a really late dupe.
Re:better to happen to someone big (Score:5, Insightful)
Kinda glad this happened to someone with enough weight to throw around to make Google care.
Imagine if they did this to your small business. Good luck getting a joint statement from the CEO then.
It probably *does* happen to smaller customers, but we don't hear about those incidents.
Re: (Score:2)
Especially given Google's default support policy: "Talk to the hand".
Re: better to happen to someone big (Score:2)
Re: better to happen to someone big (Score:4, Insightful)
A backup is not a backup until you attempt (and succeed at) a restore
Re: (Score:2)
"until you attempt (and succeed at) a restore"
Which is what happened here, thanks to cloud backup.
The Cloud = somebody else's computer (Score:4, Interesting)
And if they do crappy stuff and automate things that should never be automated, arbitrary things can happen to your data and infrastructure. All of it vanishing without warning, for example.
What happens when you connect a high trust system (Score:5, Insightful)
Their entire business model is based on an outdated and predatory view on data collection. They quite simply cannot be trusted. Their rap sheet includes 43 offenses and $2 billion in fines: https://violationtracker.goodj... [goodjobsfirst.org]
Industry-darling tech bloggers keep having tiny conversations about specific products failing at specific times instead of having the real conversation. Corporate structuring and liability shielding has removed any incentive for Google to run an honest and secure business. The security flaw isn't in the software or middle management. Google corporate leadership is a low trust system. Any resource, platform or product connected to Googles board of directors is going to be insecure. That's their job.
The world is broken (Score:1)
Since when do you have to get permission from a company to tell the world about them fucking you over?
Re: (Score:2)
Since you signed an NDA.
ALWAYS remember when speaking to clients; (Score:4, Interesting)
Replace the term "Cloud" with "Other Peoples Computers". This cures many of the worst impulses C-level mooks have about cloud storage magically solving all problems.
Re: (Score:3)
The cloud is what saved them. They had an off-site backup at a different cloud provider.
If they were working out of an on-premises data center and it failed, they still would have needed an off-site backup to restore from.
Re: (Score:2)
True, but on the other hand, they probably wouldn't have accidentally deleted the data center.
Re: (Score:2)
Replace the term "Cloud" with "Other Peoples Computers".
Heaven sounds like a weird place now.
Download all your monthly statements (Score:2)
I try to make it a habit to download all my bank, brokerage, etc. monthly statements. It's classic recordkeeping just like in the days of putting paper statements in file folders. Those are your records and evidence if and when something like this happens and you need to prove what's yours.
Re: (Score:2)
Re: (Score:2)
Completely agree with you Mean Variance !!
There's a particularly annoying [to me at least] advert running in the UK at present.
it features an office type staring at a PC alone in the office after hours - (a stereotype of a working all hours banker type) who hears a colleague using a shredder. He leaps up saying "Don't do it, that document's the only proof we've got!" to which his mate replies "We don't need this, everone's gone paperless these days". The advert finishes with a voice over that TV Licences ne
US Government Copy (Score:2)
Doesn't the US keep a copy of all Google data?
The pension fund should ask the US for the missing data.
The wrong lesson (Score:2, Interesting)
Know what the difference is between this and an internally-run data center failing and taking its local backup with it?
Nothing.
There's a slot of smooth-brains that will squawk that "other people's computer" line, which is by far the dumbest thing jwz has ever said.
The reality is that having your working copy of your data in the cloud is no more or less dangerous that having your working copy on-premises. The danger is having all of your copies in one location.
Note that in this scenario there were backups at
Re:The wrong lesson (Score:4, Interesting)
There is one fundamental difference and that's the management of the server configuration and the data it controls. On premises it's your own staff, on cloud it's not google staff, it's a few thousand lines of python and yaml or whatever. Someone makes a change, tests it against their test data, gets all the green OKs and pushes it to production. And it promptly screws over a handful of customers whose systems need special attention and processes not properly documented.
I had this happen to me twice. We have very simple requirements, defaults are usually good enough, but we have the problem that once it's set up we rarely have to talk to the hosting company. It just keeps on trucking. And so nobody is looking at all that old configuration data. And it's all good until their IaaS platform needs updating and we find out that our defaults are not the current defaults and oh shit, your config is hosed.
This is the reason I want to get away from a managed service. Because their management is concerned about different things to our management and move-fast-and-break-things is at odds with our do-it-once-and-use-it-forever approach.
Oooppppsssiiieeee! (Score:2)
And then Google... (Score:3)
And then Google posted that "We take our customer's data seriously and value their patronage..." in a cue from the Microsoft playbook...
The corporate bafflegab for "Whoops! But its all your fault..."
JoshK.
Re: (Score:2)
And then Google posted that "We take our customer's data seriously and value their patronage..." in a cue from the Microsoft playbook...
I think there's some punctuation missing:
"We take our customer's data, seriously - and value, their patronage..."
Maybe tack a Zuckerbergian "dumb fucks..." at the end...
Re: (Score:2)
Quite. Your point is well taken, indeed.
One might use a LISP (lith-p) and say "th-eriously"
"We take our customer's data, th-eriously - and value, their patronage..."
Yes, good addendum, the mother-zucker comment, or the Enron "asshole" while the mic is still on...
JoshK.
Safeguards that did nothing (Score:1)
How do you screw that up? Deleting everything permanently? I'm glad the company gave a shit about a real DR plan and used a secondary cloud provider.