Google Cloud Explains How It Accidentally Deleted a Customer Account (arstechnica.com) 79
Google Cloud faced a major setback earlier this month when it accidentally deleted the account of UniSuper, an Australian pension fund managing $135 billion in assets, causing a two-week outage for its 647,000 members. Google Cloud has since completed an internal review of the incident and published a blog post detailing the findings. ArsTechnica: Google has a "TL;DR" at the top of the post, and it sounds like a Google employee got an input wrong.
"During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer's GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. The incident trigger and the downstream system behavior have both been corrected to ensure that this cannot happen again."
"During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer's GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period. The incident trigger and the downstream system behavior have both been corrected to ensure that this cannot happen again."
parameter blank = Automatic deletion period? (Score:3)
also no pops saying you are near an of an period and data will be lost.
Re: (Score:2)
Yeah pretty dumb, right?
This is an illustration of why the job of a Business Analyst or Product Owner is so important, and so underrated. In many ways, the design of the work flow is more difficult to get right, than it is to get the code right.
Re: (Score:2)
It really is both and both need somebody competent with enough time to get it right. The main problem on the coder side is that they are often not very good, and here we have a nice example of a borked, unsave default value. The main problem on the process person side is that this person may actually be missing. A coder designing a process is an accident waiting to happen. Oh, and look, it did.
Re: (Score:2)
also no pops saying you are near an of an period and data will be lost.
Or saving account data for a period of time after it is deleted in case of ... oops, or not to mention someone who wants to pay to recover an account?
Re: (Score:2)
Considering when you fill in a registration form online for whatever, if you miss a field it won't let you proceed and flags that particular field, it's interesting that a behemoth company like Google doesn't/didn't have something similar in place when creating accounts. It would seem, in hindsight, a standardized form the person creating the account/vm should have to complete to get everything right would be in order.
No data is safe (Score:2)
Re:No data is safe (Score:5, Interesting)
On-prem is likewise one "accident" away from deletion.
I once worked for a company that had all its servers on-prem. The IT staff had a habit of referencing every machine by its IP address, partly due to the way their infrastructure management tools worked, and partly because they didn't want to mess with DNS for everything. One of the IT guys accidentally formatted the hard drive of a production server, because he mis-typed the IP address.
With cloud, at least, the job of managing infrastructure is the CORE of what they do. It's a profit center, instead of a cost center. This motivates them to put more energy into getting it right, than a company with on-prem servers, which is motivated to go as cheap as possible, both on equipment and staffing.
Personally, I'd trust the cloud more than the IT guy in the back room.
Re: No data is safe (Score:4, Insightful)
Re: (Score:1)
Do you want to start comparing the number of times an incident like this happened, verses the number of times an on-site system was borked?
Re: (Score:1)
Quite the contrary.
Cloud providers will have incidents, sure. But if they want to stay in business, they will respond by building additional safeguards to prevent such incidents from happening in the future. They are motivated by the threat of lost customers, to make these kinds of corrections.
By contrast, an on-prem IT department makes a boneheaded mistake, and they tell the CFO they need to spend 6 weeks of programming time to make sure the mistake never happens again, the CFO is going to say "Sorry, that
Re: (Score:1)
Re: (Score:2)
There are always counter-examples. But when you look at probabilities, it's more likely that a cloud company, will invest in good redundancy and backups, than, say, a retail chain that has a 6-person IT team.
Re: (Score:3)
Another advantage of the cloud is that issues like this were an unusual combination of factors can lead to disaster are much more likely to happen to someone else and be corrected before they affect you. In-house IT systems are often quite brittle at the best of times.
These guys did the right thing. They had a cloud, they had a backup, and they had a plan. They didn't rely on any of them, no single points of failure.
Re: (Score:3)
Profit motive doesn't affect bottom rung workers. Being "cloud" doesn't change that, they're still bottom rung people. The ones making the profit may want better oversight or protections - but not if it affects profits. The profit motive will motivate cost cutting, it rarely motivates investments that seem unnecessary. Therefore the gunts get paid grunt salaries, are overworked, are kept in the dark, etc. If all it takes is one worker to screw things up, then it's not secure.
The slapdash local IT guy mig
Re: (Score:2)
but if there are full backups of everything
You have a lot of faith in that slapdash local IT guy.
The thing about backups is, unless you actually test a recovery, you don't know if your backups will work in an emergency. How many slapdash local IT guys bother to fully test a back restore?
Re: (Score:2)
When it's one guy, you're right. But when it's a team, and you see the off-site data storage truck coming back twice a week, and you can get backups recovered when you ask for it, then I think they're doing it right.
These days though that may be obsolete. Modern IT where I am now is "just use OneDrive", which is not a suitable professional backup solution.
Re: (Score:2)
Your off-site data storage truck may be able to successfully recover the data that you have provided to its storage units. But the point is, you don't know if you have properly set up your backup routines to be comprehensive enough, until you try to actually do a restore from nothing and be able to actually run your systems from the backup. In real life, when you do such a restore test, you inevitably find something your process missed. Your on-prem guy--or team--likely doesn't do that kind of testing.
Re: (Score:2)
That slapdash local IT guy is the one who came up with a backup for the incident in TFA, as part of a multi-cloud strategy.
Re: (Score:2)
The incident in TFA was a very large organization that clearly had more than just a local "IT Guy." This company was *not* your typical on-prem (or cloud) setup. Most companies have a relatively small IT department, like just a handful of people, and they are very reluctant to spend money to "do IT right."
Re: (Score:2)
And yet they come up with the money for a cloud provider that also won't do backups?
Re: (Score:2)
Two things.
Yes. They *will* come up with money for the cloud provider because it's a monthly recurring cost, which is treated very differently than a one-time capital expenditure. It is very common to be less sensitive to that monthly "rental" fee that stays mostly the same every month, than to spend one person's time for a few weeks, to set up a proper backup regimen.
And yes, cloud providers DO do backups. This is one of the benefits of cloud, is automatic backups that you don't have to think about. The ar
Re: (Score:2)
Being willing to overspend on rental rather than spend much less on owning something is a description of the problem, not a justification. It makes just as much sense as people paying many times more on rent-to-own stuff (that's worn out by the time they own it) for their apartment. It's the epitome of bad financial decisions leading to poverty. I have done analyses where the cost crossover point is about 1 quarter.
This "one off" is hardly the first time I have heard of lost data with no/useless backups fr
Re: (Score:2)
I'd argue that the "willing to spend" on rentals, is a solution to the problem that companies are "not willing to spend" on on-prem. It's a psychological problem, and if we can solve the psychological problem by going to a monthly subscription, that's great, it's still a solution!
Speaking of renting and owning, common advice is to NOT pay down your mortgage faster than you have to, because you can invest that money instead, for higher returns. But psychologically, people don't actually invest that "extra" m
Re: (Score:2)
It's a "solution" in the same sense that renting a canoe is a solution to being nervous about a trans-Atlantic flight.
The advice about the mortgage is true enough if you locked in a low interest rate.
Re: (Score:2)
I don't follow your canoe analogy, I see no way in which it applies to cloud vs. on-prem. Cloud solutions are *way* more powerful and versatile than on-prem solutions. So I'd say it's more like going on a transatlantic cruise because you don't like to fly.
Re: (Score:2)
Perhaps not a canoe, but certainly a much more expensive (in time or money) option. Cloud isn't really more powerful or versatile provided you have adequately provisioned on-prem. You can often set things up to use some cloud resources as overflow capacity if you have a spiky load. It should be possible to compute the appropriate crossover point for best cost effectiveness. For internal use resources, you can greatly improve reliability by not depending on an ISP for connectivity to the internal servers.
Re: (Score:2)
So what you're saying is, cloud isn't more powerful, except for the things that cloud does better, like overflow capacity? Got it.
Some specific ways cloud is more powerful:
- Geo-redundancy. The typical cloud pattern for geo-redundancy is that your data is mirrored to two separate data centers in one region, and to a third data center in an entirely different region. This protects against a regional natural disaster or outage.
- On-demand scaling. If you need 100 servers most of the year, and 1,000 servers du
Re: (Score:2)
I am saying that there are specific niches where the cloud makes sense. I am also saying way too many shops put way more than that in the cloud to their detriment.
A 500 user site sounds like used Dell server off of ebay territory. If you live in a studio apartment and access the internet over cellular using your phone, cloud makes sense for that. But if you already need a business internet connection, get the used server and put it in a corner somewhere.
Backup options depend on need. As simple as rsyncing t
Re: (Score:2)
Yes, a 500 user site is small. In a corporation of any size, the vast majority of the servers it uses, are even smaller. Yet because of the logistics of on-prem infrastructure, they have to provision an entire machine for each, or at least, a virtual machine, either of which costs a lot more than a similar low-power virtual machine in the cloud.
No, I've checked, there is no comparable on-prem option that would pay for itself in less than 10 years, by which time I'd have to start all over again. You've got t
Re: (Score:2)
MariaDB or PostgreSQL don't have a license cost. Nor does KVM for the virtualization environment. If you have a business office, you already have to have internet connectivity. Depending on where your business office is, you may already have dual grid connectivity. If you made the arrangement with another business mentioned earlier, you just fail over if the power goes out.
Re: (Score:2)
Yes, I know there are open source databases that are "free." That's beside the point. You can use those both on-prem and in the cloud. That's not something that distinguishes one from the other. It's not a fair comparison to say "On-prem with open source is cheaper than cloud with Microsoft stack." That may be true, but it's not relevant. The point is, with cloud, you can slice and dice licenses that you would otherwise have to pay a lot more for.
And your business office doesn't have dual power utilities, t
Re: (Score:2)
You're the one that brought up licensing costs. I just pointed out that there's no good reason that cost can't be zero. The cloud provider won't build your website for you. They won't select which software will be best suited for internal use. They certainly won't set up the LAN in the office or the VPN gateway if you do WFH. One way or another, you're going to be bringing in IT people.
Isn't it a bit odd that in a discussion about the cost of servers vs. cloud that you would claim cost isn't relevant?
It rea
Re: (Score:2)
I never said cost isn't relevant. What I said was, cost doesn't have to be higher for cloud hosting, if managed correctly. My discussion of licenses was an example of how costs can be managed, not a comparison of open source vs. commercial licenses. Businesses make their choice of stack (LAMP vs. MS typically) separately from their decision of on-prem vs. cloud. If they are using MS technology, they aren't going to switch to MariaDB because they don't wan to save money on SQL Server. The cost of the reprogr
Re: (Score:2)
Actually, I have done failover. I have also done analysis to determine actual failover requirements rather than just assuming whatever wizz-bang hyper expensive solution advertised in Golf Digest is the best for a particular purpose.
Also note that expenses and size of IT operation between a global corporation and a small business are VASTLY different. If the corporation is large enough, it's cheaper to build their own cloud than it is to rent someone else's.
Note that most cloud services will not fail over t
Re: (Score:2)
If your idea of a database failover is restoring a VM snapshot, then no, you haven't done a database failover. VM snapshots are not suitable for a database failover, for any but the most trivial of applications. https://www.idera.com/resource... [idera.com]
Re: (Score:2)
So now you're stuffing a strawman in my mouth and then cutting it down?
Actually, I was thinking replication/mirroring for the database and spinning up the application server's VM.
Re: (Score:2)
I was reacting to your quote:
As for fail over, that's just a matter of changing a few DNS records and activating the VM on another server somewhere
This over-simplifies the failover process, especially when it involves databases (which every failover will almost certainly involve).
Re: (Score:2)
If you want me to present you with a detailed failover plan, I'll need specifics, your signature on an engagement contract, and your up-front payment.
Re: (Score:2)
Good one!
Re: (Score:2)
If I'm hiring somebody to set up a database failover, I'll hire somebody who can describe to me how one works. The answer I'm looking for has nothing to do with VM snapshots, I assure you.
Re: (Score:2)
I presume the server was re-loaded from backup?
so, I have questions not addressed in this summary (Score:2, Interesting)
2) did the customer lose anything by having its account zeroed here?
3) If 1 or 2 or both above are true, has Google received a fine for this ?
4) If either 1 or 2 or both are true, has Google given the company back everything the company or its clients have lost based on Googles actions in this?
5) has Google defined the changes in its processes to ensure something like this can never happen again, and published them so that customers can have some
Re:so, I have questions not addressed in this summ (Score:5, Informative)
They lost at least a couple days worth of data because they were at least smart enough to have a backup with another provider but the backup was a few days older than the outage start time. But more importantly, they lost 2 weeks of uptime.
The lesson to me is not just to have a backup, but to have a disaster recovery plan and an idea of how to continue operations during a restoration. Two weeks is too long to get back up and running.
But they're also the ideal case to not have to worry about losing customers over this. You can't just pick up your pension and move that easily. And by the time this is over and done, they're more prepared than the next company that has never had to deal with it.
Google already wastes so much money on Youtube storage, they can handle delaying actual deletes by several days where large customers would have time to notice a mistake and undo. They would have been up and running so quickly if it was just a call to Google to correct the misconfiguration and re-link the storage.
The Full Timeline [unisuper.com.au]
Re: (Score:1)
Additional note: Google claims responsibility, but it's probably the pension company itself that put the wrong value in the parameters. Google isn't going to throw them under the bus and lose them as a client but they probably should have had a process in place to ensure that people are paying attention during provisioning.
Re: (Score:1)
Re: (Score:1)
> Additional note: Google claims responsibility, but it's probably the pension company itself that put the wrong value in the parameters
Maybe read the TLDR
"there was an inadvertent misconfiguration of the GCVE service by Google operators"
Easier to victim blame than read a few sentences I guess.
Re: (Score:2)
Re: (Score:3)
It was only data for the customer-facing web frontend that was lost. They still had up-to-date transaction data in their internal database.
Actually, you can in Australia. There's a straightforward process for transferring from one licensed fund to another.
Re: (Score:2)
Actually, you can in Australia. There's a straightforward process for transferring from one licensed fund to another.
I assume this wouldn't be possible until after the outage is over. Especially if there were a lot of requests all at once.
At that point, they would be the provider who has already made it through and learned a lesson. It wouldn't necessarily be smart to jump over to someone who is equally unprepared.
Re: (Score:3)
Flawed design (Score:3, Interesting)
Why would you automate a system to destroy data? Maybe "flag for deletion after taking it offline for X amount of time, to be reviewed by a real person before we actually wack the data"
Oh yeah, it's google. they don't like to include real people in the loop.
Computer algorithms vs. actuation (Score:2)
This incident is related to the discussion about whether AI might do something disastrous, like enslave or obliterate humanity. The problem is not the computer system but the humans that allow computer systems to automatically do things that might be disastrous. Why would a large account by automatically deleted without requiring a human to manually approve the deletion? Seems like there is no downside to requiring a few minutes from a human to approve this type of drastic action.
In a way, this reminds o
Hate on Google... but not on this (Score:5, Insightful)
There's a lot of Hate On Google because, well, they collect a lot of data, feed it back to you, and sometimes other people as well.
BUT
In this one case, their transparency in explaining EXACTLY what happened, WHY it happened, HOW it won't happen again, and how they've improved their process is awesome.
Hate on them all you want. (I personally don't hate on them but I try to give them as little data as I can to still get useful stuff back from them) but if OTHER COMPANIES (I'm talking to you, Verizon, AT&T, FAA Aeromedical Division, Hyundai, Chase, Spotify, etc.) would do the same that would be awesome.
In case you think I'm picking on these entities I've named, that's just what came to [my] mind first. Imagine if ALL COMPANIES WERE HONEST WITH THEIR CUSTOMERS and TRANSPARENT IN POST INCIDENT REPORTS then the outliers would be even more lit up.
Light em up.
E
Re:Hate on Google... but not on this (Score:4, Insightful)
Really?
In this one case, their transparency in explaining EXACTLY what happened, WHY it happened, HOW it won't happen again, and how they've improved their process is awesome.
No, it's not. This was a very public faceplant for Google. They had no choice to be anything other than transparent, because to do otherwise would eventually cause every sizable business to migrate way.
Don't mistake corporate desperation for honorable intentions. This was a catastrophic failure of Google Cloud.
Re: (Score:3)
Also, "using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer's GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period"
This seems a bit clown world, and while I believe the exact same thing won't happen again, what about the other clown world bugs that must be in there too? Not only should it not have happened in t
Re: (Score:3)
BUT
In this one case, their transparency in explaining EXACTLY what happened, WHY it happened, HOW it won't happen again, and how they've improved their process is awesome.
You have never been through such post incident reviews in any large companies, have you? If you had, you would see through Google's blame-the-frontline-guy guise of this supposed "review".
From the summary:
During the initial deployment of a Google Cloud VMware Engine (GCVE) Private Cloud for the customer using an internal tool, there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting
Re: (Score:2)
there was an inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer’s GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period.
This leaves a lot of questions. Mostly along the lines of, "How bad is their code actually?" and "Have they considered a full code audit?" Because all signs point to "their code sucks badly."
I would also like to know if they had unit tests, but that's just a personal preference.
Re: (Score:2)
What we have learned... (Score:4, Insightful)
1. Google has an inept backup system, since it took a week to recover. Thus should have been a few hours, tops, a few minutes if properly done.
2. Google doesn't sanity-check any settings.
3. The customer had a backup system on another provider but was clearly unable to fail over to it over the entire period, which means this wasn't properly designed or tested.
4. Google doesn't provide warnings.
5. Google's delete policy is extremely aggressive.
6. The customer had failed to research Google's policies properly.
7. Signing off was done with a poor understanding of the system.
8. Neither side had comprehensive monitoring.
9. That users were concerned about a cyber attack shows the customer had failed to keep users properly informed, creating a panic unnecessarily.
10. Cloud providers don't necessarily have any more expertise than the customers and may choose to go cheap in the hope of few accidents.
Re: (Score:3)
The backup came from the pension fund storing a backup someplace else. As I understand it, Google had no backup.
Re:What we have learned... (Score:5, Informative)
Nope, UniSuper restored the data from their own backup (which happened to be on another cloud provider). Google didn't restore the data.
I'm not sure the backup was actually intended for hot failover, or if it was just intended to be used to restore from in the event of an outage like this.
I have a UniSuper account. I did get timely notifications. Later notifications contained more detail.
People were probably concerned about a cyber attack because of relatively recent cyber attacks against Optus, Latitude, etc. in which customer data was compromised. Latitude in particular were less than forthcoming about the extent of the breach. I can see why people would assume the worst.
Re: (Score:2)
I like the list, very true.
I guess there's something about asking naive questions; how are the critical and dangerous operations like sharing, deleting, moving, etc. being safeguarded? But how would any company get to inspect that with an org the size of Google?
People talk about how cloud providers must be better than on-prem, at least in terms of skills, but maybe the biggest issue is simply that, at least with on-prem, you can inspect things thoroughly yourself (if you're willing to do so).
Re: (Score:3)
1. Google has an inept backup system, since it took a week to recover. Thus should have been a few hours, tops, a few minutes if properly done.
Worse, Google didn't have ANY backup system. Google cannot recover the data, the customer recovered from their own backup.
10. Cloud providers don't necessarily have any more expertise than the customers and may choose to go cheap in the hope of few accidents.
This is the key takeaway. In-house IT teams have been saying that since the beginning of the Cloud fad, management just plug their ears after calculating how much they can pocket after creating the huge "savings" for the company by moving to the Cloud, betting they wouldn't be there when shit hits the fan.
Re: (Score:2)
1. Google has an inept backup system, since it took a week to recover. Thus should have been a few hours, tops, a few minutes if properly done.
Worse, Google didn't have ANY backup system. Google cannot recover the data, the customer recovered from their own backup.
Google has extensive backups... but also has a way to be sure that when data is deleted it's is permanently and completely gone, including all backups.
Unfortunately, if you have a good system for ensuring that data is completely destroyed you need to be really, really sure you want the data gone before you hit the delete button. Probably this should never be done without positive confirmation from the client, and after a delay period to give the client an opportunity to change their mind.
We are safe - We got our data in the cloud (Score:1)
All I want to know is (Score:2)
I'm sure outsourcing a bunch (Score:2)
And if you're frantically typing "this was before that!"... this was before the *announcements*, if you think they weren't already doing it before you heard about it in the news then... oh you sweet summer child...
Fun fact, tech companies are rapidly outsourcing to the UK & Germany because American healthcare costs have gotten so high we're just not worth it.
And this isn't new, it's just accelerating. Years ago I knew a gal who's job was to tell Amer
Re: (Score:2)
First, I congratulate you on your ability to bring politics into absolutely any subject at all...
Second, I have to wonder at your great faith in Medicaid for All. Note that that was the original name for that proposed program.
About 10 years ago I had a medical issue. I won't go into the details. This issue was chronic and lasted for 5+ years. I had a few co-morbidities including the predictable ones for slashdotters.
I also had Silicon Valley [and later Boston] white-collar medical insurance.
I also had an ac
Re: (Score:2)
Yes, I know that Medicaid and Medicare are not the same things... but my intended statement was "they called it Medicaid for All, then rebranded it w/o changing anything else". Those proposals for decreasing the Medicare eligibility age would cost a lot more than the original [2018ish] Medicaid for All proposals.
And yes, I get that I was fortunate to have a good job. That doesn't mean that I'd be *just as well off* if M4A was the law of the land... if nothing else my deductible and out-of-pocket-maximum wou
Re: (Score:1)
It occurs to me that Medicaid having a deductible may make no sense whatsoever.
VMWare? (Score:2)
Any word on whether VMWare license extort^W changes were upstream of Google's maintenance work?
So... (Score:2)
...they just added field validation to a form field somewhere?
Moral of the story? (Score:1)
Run your own servers.
Putting things in the cloud just takes them out of your control.
Your data is safe with us. LOL! (Score:2)