Facebook Engineers: We Have No Idea Where We Keep All Your Personal Data (theintercept.com) 69
An anonymous reader quotes a report from The Intercept: In March, two veteran Facebook engineers found themselves grilled about the company's sprawling data collection operations in a hearing for the ongoing lawsuit over the mishandling of private user information stemming from the Cambridge Analytica scandal. The hearing, a transcript of which was recently unsealed (PDF), was aimed at resolving one crucial issue: What information, precisely, does Facebook store about us, and where is it? The engineers' response will come as little relief to those concerned with the company's stewardship of billions of digitized lives: They don't know.
The admissions occurred during a hearing with special master Daniel Garrie, a court-appointed subject-matter expert tasked with resolving a disclosure impasse. Garrie was attempting to get the company to provide an exhaustive, definitive accounting of where personal data might be stored in some 55 Facebook subsystems. Both veteran Facebook engineers, with according to LinkedIn two decades of experience between them, struggled to even venture what may be stored in Facebook's subsystems. "I'm just trying to understand at the most basic level from this list what we're looking at," Garrie asked. "I don't believe there's a single person that exists who could answer that question," replied Eugene Zarashaw, a Facebook engineering director. "It would take a significant team effort to even be able to answer that question." When asked about how Facebook might track down every bit of data associated with a given user account, Zarashaw was stumped again: "It would take multiple teams on the ad side to track down exactly the -- where the data flows. I would be surprised if there's even a single person that can answer that narrow question conclusively." [...]
Facebook's stonewalling has been revealing on its own, providing variations on the same theme: It has amassed so much data on so many billions of people and organized it so confusingly that full transparency is impossible on a technical level. In the March 2022 hearing, Zarashaw and Steven Elia, a software engineering manager, described Facebook as a data-processing apparatus so complex that it defies understanding from within. The hearing amounted to two high-ranking engineers at one of the most powerful and resource-flush engineering outfits in history describing their product as an unknowable machine. The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. "Someone must have a diagram that says this is where this data is stored," he said, according to the transcript. Zarashaw responded: "We have a somewhat strange engineering culture compared to most where we don't generate a lot of artifacts during the engineering process. Effectively the code is its own design document often." He quickly added, "For what it's worth, this is terrifying to me when I first joined as well."
The admissions occurred during a hearing with special master Daniel Garrie, a court-appointed subject-matter expert tasked with resolving a disclosure impasse. Garrie was attempting to get the company to provide an exhaustive, definitive accounting of where personal data might be stored in some 55 Facebook subsystems. Both veteran Facebook engineers, with according to LinkedIn two decades of experience between them, struggled to even venture what may be stored in Facebook's subsystems. "I'm just trying to understand at the most basic level from this list what we're looking at," Garrie asked. "I don't believe there's a single person that exists who could answer that question," replied Eugene Zarashaw, a Facebook engineering director. "It would take a significant team effort to even be able to answer that question." When asked about how Facebook might track down every bit of data associated with a given user account, Zarashaw was stumped again: "It would take multiple teams on the ad side to track down exactly the -- where the data flows. I would be surprised if there's even a single person that can answer that narrow question conclusively." [...]
Facebook's stonewalling has been revealing on its own, providing variations on the same theme: It has amassed so much data on so many billions of people and organized it so confusingly that full transparency is impossible on a technical level. In the March 2022 hearing, Zarashaw and Steven Elia, a software engineering manager, described Facebook as a data-processing apparatus so complex that it defies understanding from within. The hearing amounted to two high-ranking engineers at one of the most powerful and resource-flush engineering outfits in history describing their product as an unknowable machine. The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. "Someone must have a diagram that says this is where this data is stored," he said, according to the transcript. Zarashaw responded: "We have a somewhat strange engineering culture compared to most where we don't generate a lot of artifacts during the engineering process. Effectively the code is its own design document often." He quickly added, "For what it's worth, this is terrifying to me when I first joined as well."
"the high code quality" (Score:2)
ugh.. no design documents and no documentation and on the data flows.. ugh..
Now I understand Mark Zuckerbergs likening the being Meta CEO to being Punched in the Stomach every morning.
I would feel like that too if I was in an organization that managed a large code base without documentation..
Re: (Score:3, Insightful)
Re: (Score:1)
They changed their name, but always remember: if it's Zuck's, it's fucked
Re: (Score:2, Informative)
He/She wasn't modded to zero, IT STARTED AT ZERO!
I wish people would stop complaining about the mod system if they DON'T UNDERSTAND HOW THE DAMN THING WORKS TO BEGIN WITH!!
All you have to do is click on the word "Score" to see the starting points and mod adjustments that results in the current score. If that option isn't available on the mobile view, then stop complaining about the score of a post if you can't see the history of mods.
Re:"the high code quality" (Score:5, Interesting)
If I were the VPE of a company like Facebook, I don't think I would be particularly interested in focusing on this form of documentation. For the most part, for storage :
- Redundancy and resiliency, I'd want the data to always be available and to not get lost.
- Performance, among other things, I'd want the data to be available rapidly. User experience dictates that it should never take more than 3 seconds for a page to load.
Building a system like this would generally require
- Tiered storage
- Everything in an object... none of this idiotic POSIX file system nonsense that is a bane of all storage systems. I don't want filenames and stupid things like that. I want and object... I don't care where that object is... let the system deal with that. Copy it, replicate it, geodistribute it... just as long as the system can find a copy of the object, we're good.
- Indices
And that's where we have the cruck of it... the ML side.
Probably 50% or more (maybe even 95% or more) of the data on a user should be stored as an ML model. The structure of those models generally are utterly incomprehensible to anyone and everyone. They're really not able to be mined either. Their structure is not meant to be an information source. Rather they are a model which provides a series of data that can simulate intuition. So, it would never store something like "list of all pictures with a teenaged boy with a pimple 18mm NE of their right pupil". Rather they would store a massive amount of data structured to intuitively identify such marks. This model may or may not contain user data, it is very likely it does contain some references to user data. It's structure though would be impossible to parse to link the user to the model.
For the sake of things like GDPR, it's possible I would flag objects with metadata such as country of origin so that objects would not be georeplicated to regions outside of their legal boundaries. But while this would be possible for the storage of the individual objects, it would be impossible for the ML model data unless the company maintains an entirely separate model for each region.
So, while plausible deniability is a genuine benefit of this, I would say there are extensive technical reasons why it wouldn't make sense to do this otherwise.
Re:"the high code quality" (Score:5, Insightful)
Re: (Score:3)
Absolutely agree, and I'll add: they put all resources ($) into coding and overall development, and essentially none into documentation. We've (coders) all done it at some point on some scale. I've even done physical (hardware) stuff that I never documented, for many reasons, mainly that I wasn't sure it was going to stay that way (Ethernet fun, for example), and it would have taken longer to document than it took to actually do it.
So much of engineering, including software and systems development, is don
Re: (Score:2)
If I were the VPE of a company like Facebook, I don't think I would be particularly interested in focusing on this form of documentation.
Let's see how well this works:
FB VP of some sort: "We've got a large customer that wants to target data type X for ads, can we do it?"
FB Engineers: "Well, we're not really sure where that data is stored, so not sure, we'll need to send out a recon team to scour through the code, hopefully we can find it eventually"
FB VP of some sort: "Uuuhhhhh, seriously, the most valuable freaking part of our system and you designed it so we can't just easily locate it to sell access to it?"
Guilty by design. (Score:5, Insightful)
They're gonna mod me down for saying it, but they let it get this way on purpose with the specific goal of hiding responsibility for premeditated crimes.
Re: (Score:2)
Exactly! Zuck doesn't want to know. Much better to lie by stonewalling instead of factual lies.
Re: (Score:3)
Surprised no other dev has stated the obvious, but this is the normal practice of every business, to have no documentation on code. Who has the time to write it, and worse, who will be responsible for maintaining it? It quickly adds up to a lot of wasted dev hours!
The only time a team would care about documenting anything is when it's being shipped publicly, so the end-user is either paying for it, or it benefits in its adoption, e.g. Facebook's React framework.
I've worked at several high-profile companies,
Re: (Score:2)
even testing is often overlooked or done as a last minute thing.
Not testing is not an agile thing, that's just a bad software development process thing. Which is not to say some organizations might not use agile as an excuse to cut back on testing.
Not terribly surprised (Score:2)
I'm not terribly surprised by this. When a company blows up as quickly as Facebook did, technical debt can be impressive and SW design methods can be terrible.
Re: (Score:3)
Re:Not terribly surprised (Score:5, Insightful)
Re: (Score:3)
As a comparison, Google has been 100% clear (on a macro level, as you put it) about how they store their data and there are even public re-implementations of much of what they have. They also have very clear ways of controlling where data is and what it does. If they are lying it would be very clear in a legal situation.
There is a reason why Google is competing with AWS in the cloud business and Facebook isn't, however.
Should affect valuation of facebook (Score:2)
Re: Not terribly surprised (Score:3)
Re: (Score:2)
Setting a price doesn't actually require much knowledge of the audience... being able to plausibly claim knowledge is enough. See also over-the-air TV advertising.
Re: (Score:2)
What is even worse about that... (Score:2)
At first glance that seems pretty bad, but if you think about it much it's even worse - because if they don't know where your data is, how could they know when any of it is accessed by someone who should not be able to, and alert you?
GDPR (Score:5, Interesting)
Isn't it a violation of the European General Data Protection Regulation if they have no idea what kind of personal data they have and where it's stored?
Re:GDPR (Score:5, Informative)
It is. A rather bad one.
unstructured data (Score:2)
I suspect that FB uses a lot of unstructured data that is distributed and replicated, with information spread across different clusters. Your data is there, and there, and there, and there, and there. And they probably have to read through entire flat files to pick out your data. Gotta wonder what it means to handle GDPR/CCPA requests.
Didn't see this one coming ... (Score:5, Funny)
A special master looking into the possession and improper handling/storage of information -- and it's Facebook? Wow. If this person can figure *this* all out, I think there might be a similar position open in Florida ... :-)
Another issue altogether (Score:1, Flamebait)
Because data isn't stored in some unified architecture, because there isn't one unifying "truth", there's no possibility to know if collected data is accurate, current, and unique.
The truth is that data is stored all over the place, and engineers with their "pet projects" like trying to geolocate ex girlfriends or people they're stalking, whoever accesses data it's just strewn about. They need some sort of software system that can properly track access requests and also force them to put data into silos wi
Re: Another issue altogether (Score:2)
It's not that they don't know (Score:5, Insightful)
Re:It's not that they don't know (Score:4, Insightful)
Or they use NoSQL type solutions with different teams and different clusters, such that no one person knows all of it. Special Master needs to interview more than just a couple of Engineers, probably a lead from every team, to gather all of the info and build the big picture for the court. I'm sure you don't want to know how the sausage is made.
Re:It's not that they don't know (Score:5, Interesting)
Actually most of the data is in MySQL. But not any MySQL, a special version that they "improved". I wish I was making this up. They should be using Postgres or Oracle as that does what they need. But rather than retraining their DBops folks to use Postgres (which is a really nice DB to admin), they decided to "fix" MySQL (completely misunderstanding what MySQL does or why it is written the way it is) to do more Postgres like things. The problem with this is that when DBs are written, they are targeted at specific parts of the market (specific use cases). MySQL is a fine DB but in no way is it appropriate for anything FB does. So its planner is simple and the DB kernel lacks certain XOs (capabilities) that other DBs have so it can be better at different use cases (which FB doesn't have). For example, MySQL's prepared statement are "simulated" so it doesn't save the amount of resources that Postgres does when using prepared statements. There are plenty of other examples. MySQL isn't bad, it is just different from what Postgres is. But instead of just switching to Postgres at some point, they went on a failed multi-year effort to "fix" MySQL which ultimately and predicable failed.
It takes a special amount of ego to make this type of mistake. Knowing how to match DBs to specific usages isn't that hard and plenty of folks can do it. Writing DBs is insanely hard and requires special folks (who probably don't work for FB) and centuries of dev time. It is vastly easier to just use the right DB for the job. It is insanely difficult to make a DB do something it wasn't designed to do. They choose to do the insanely hard and risky approach for no real benefit. It takes a special and incredibly dysfunctional culture to make this type of mistake.
So yea, I believe they honestly don't know where the data is. Implementing business logic is for the peons who can't write their own DB afterall. I suspect there are people who did know where the data is and how it is moved but they probably moved on long ago. Likely just after they started rewriting MySQL.
Re: (Score:2)
Can you imagine the nightmare of Oracle licensing, compliance testing and audits?
Re: (Score:3)
https://engineering.fb.com/201... [fb.com]
Re: (Score:3)
https://dzone.com/articles/fac... [dzone.com]
Distributed data storage (Score:3)
When your storage is designed to distribute stuff nearer to the edge you don't necessarily "know" where the data is. You can find various copies of it, probably, but which piece of data is canonical? It's probably a meaningless question at this point.
That said, I'm sure there are uuids and guids. However, at some point I wouldn't be surprised if pieces of content got detached from the original user...like if they were quoted, or someone downloaded an image and re-shared it, etc.
Re: Distributed data storage (Score:3)
Microservice Architecture (Score:4, Interesting)
This is a very common issue and not in just big companies, but with the microservice architecture in general. There's so many services, APIs, databases, loggers and all sorts of other processors bolted onto the application, that it's virtually impossible to keep track of the data flow.
Re: (Score:3)
Precisely!
Facebook isn't extraordinary or wilfully hiding here; every developer knows this is the reality of most engineering departments or software houses where massive amounts of data is being gathered especially in an ORM/NoSQL env. There's a rough idea which server(s) would host the data, but in terms of what's being stored or where, no one person would know.
Re: (Score:2)
NoSQL databases make me puke. They remind me of my chest of drawers - all the random crap, devices, cables and deity_of_choice knows what else is there, but every drawer holds a new mystery and a new world order scheme.
Standard SOP (Score:5, Insightful)
Facebook always pulls stunts like this. When the .au government was trying to make them pay for news content last year, facebook had a little toddler tantrum and blocked a whole bunch of non-news content, like public information notices from the government, posts from community groups etc. Facebook claimed this was a "mistake" with their "algorithm."
This alone would be fine. Businesses make mistakes. But seriously, Facebook is a global company that makes money by accurately targeting content at billions of individuals.... but all of a sudden, they don't know to block a few dozen news sites?
It's all bullshit. When it suits them, Facebook says they a super awesome technology company and they have the best programmers and the best algorithms, and we should totally trust them with all our stuff. But whenever they face consequences for their shady behaviour, they immediately pull the "incompetence" excuse, "oh no, it's the algorithm's fault" or in this most recent case, "oh, sorry, we don't know where that data is!"
Facebook is a sewer. Even if you don't know the location of each turd, it's still full of shit.
Re: (Score:2)
But seriously, Facebook is a global company that makes money by accurately targeting content at billions of individuals....
What makes you think it's accurate? They sell a pile of horse-shit to their advertising customers. On the surface, they make it look good. But these customers don't have the clout to demand straight answers. Or the authority to impose penalties for falsehoods. Facebook knows it and odds are that their smarter customers know it as well. And factor in the fact that half of what they are seeing is probably bullshit and never going to make them money.
It's probably not much better than putting up a billboard al
Re: (Score:2)
It's probably not much better than putting up a billboard along the side of the road and hoping that a small fraction of the people that drive by it fit your target market.
More than one case study has shown that when companies pull back massively on digital ad spending, their sales and profits are unchanged in any statistically significant way. And that's what really matters - eyeballs don't earn money (well, not unless selling eyeballs is your business). Facebook (and other advertisers) may or may not be able to target content accurately - I don't have any inside information, but everything I read suggests that they make bold claims but any attempt to prove those claims resu
Incredibly Damning (Score:2)
If they don't know the data sources or the data flows there's no way they can reliably claim they comply with any privacy standards unless they don't collect the private data - and we all know that they do.
This is a Feature (Score:3)
This is on purpose, so they can sell your data to everyone and everything and then pretend it was an accident.
Not surprising. (Score:5, Insightful)
To be honest I'm not surprised at their answers. It's not just Facebook, most corporate systems grow in a pretty haphazard manner as requirements are added and storage is added for whatever data a particular new system needs without any coordination with the other systems in the company. Give it a decade to develop and the result is all but incomprehensible from the outside.
One of the reasons IBM sells so many mainframe systems even today is that the large corporate software assemblages (you can't call them systems anymore) that run on them have had so many changes made over such a long time, and so much documentation about what those changes were for simply lost in moves, that the complete requirements are unknown and any attempt to reverse-engineer them from the existing code would take so much time and effort that by the time it was done it'd be obsolete again. In some cases it'd be literally impossible because the code (at least in source form) no longer exists. It was lost to media obsolescence or discarded or forgotten in one move or another, or it's location simply never updated so it can't be located anymore. The only option is to keep running the existing programs, by whatever means necessary.
Re:Not surprising. (Score:4, Funny)
most corporate systems grow in a pretty haphazard manner
Yeah. But sometimes it's not haphazard. You get either a customer (DoD) or a regulator (FAA) that says a particular group or division is responsible for data/documents that they create. And the solution is to build a system that allows them to control access and updates and the subset of users allowed to perform each function.
I worked with a system that was a 'traffic cop' between several dozen such data warehouses. Allowing users to get overviews of engineering data from one place without having to log on to each source and pull down bits and pieces manually. During a reorganization, the company selected a new head of our division. He wanted to see what the data flows surrounding our system looked like. He asked for a data flow diagram on one sheet of paper. The diagram was 3 by 4 feet (and the fonts and symbols were pretty small). He took one look at it and quit. But we had no problems managing it. It was just a matter of not looking at every use case jammed together on one piece of paper.
Par for the course (Score:1)
Simple fix (Score:2)
Re: (Score:2, Insightful)
Re: (Score:3)
Simpler fix, throw mark in jail until someone comes up with the info. I think they have it already, frankly I think the advertisers demand that level of competency. So give them motivation to turn it over. I just find it incomprehensible FB could be that inept.
I agree that Mark Z should be doing hard time in prison! As for the advertisers, I think they _do_ demand the info, but facebook can just lie to them too. "Yes, we showed your ad to 34-year-old males with 80k income in these zip codes, here's the bill." The advertisers have no choice but to accept facebook's word that these ad impressions really were displayed, and that they really were displayed to the demographics that facebook claims.
I would expect the majority of these claims to be false. How many times
This explains why React is what it is... (Score:1)
Of course! Do you know where your cloud server is? (Score:4, Interesting)
Microsoft and Amazon keep the location of their data centers a closely guarded secret. The buildings are unmarked, you won't find them on Google Maps. In reality, unless your data is on-premise, you don't know where your data is either.
Re: (Score:2)
Yeah, but that argument really only works for end users. If you WORK at Microsoft or Amazon, you can find out where your data is stored. Those "closely guarded secrets" aren't secret at all to the people who work there. (Trust me... I knew more than one person who worked for one of those companies who drove around from data center to data center, just to amuse friends riding with them in their vehicle who always wondered where their buildings were at.)
And no, they're not *so* secure that doing such a thing
Re: (Score:2)
While I don't disagree with you, I'm sure there are a whole lot, maybe a majority, of Microsoft and Amazon employees, who have no idea where the data centers are, even if they do software development that involves those data centers.
They are just like Trump's lawyers (Score:2)
In a storeroom at Mar a Lago of course (Score:1)
hmm, GDPR (Score:3)
This sort of explanation makes it clear Facebook is completely incompatible with GDPR.
as the regulation is not slated to disappear, Facebook will have to either change, of leave the EU market entirely
Meh. easily solved. (Score:2)
Make a list of each kind of data it is possible to glean from each user (location, names, aliases, sites visited, email and other accounts etc etc etc) and tax Metabook 5 cents per data point per user per day for any that Metabook cannot prove is being collected or stored.
Re: Meh. easily solved. (Score:2)
Crappy no-good brain! That should have read: .. tax Metabook 5 cents per data point per user per day for any that Metabook cannot prove is _not_ being collected or stored.
feature not a bug (Score:2)
Single Source of Truth (Score:2)
"I don't believe there's a single person that exists who could answer that question,"
But no company owning a complex system needs a single person to explain a complex system. That's not even a team's effort, but an organization's effort as a function of policy, to have (and fund) a multi-layered team going from the CTO and enterprise architects down to the tech leads (and mediated by architects) to document what their systems are doing or supposed to be doing.
Even if there's no 100% accurate description of a system, there must be a documented system architecture and set of procedures and
Same answer my kids would give me (Score:1)
If you canâ(TM)t ID the data⦠(Score:1)
â¦then how do I know my ads I am paying for are reaching the right people? Youâ(TM)re either stonewalling the government or lying to customers. No wonder Schmucherberg says he wakes up sick every morning.
Does this surprise anyone? (Score:2)
They can exist because the system architecture is built in such a way that it can survive the continuous state of disrepair.
These systems require an awful lot of redundancy, systems are decoupled by writing to shared storage (kafka logs, k/v stores, parket/avro/whatever files, etc.). They're completely denormalized. Normalization is the
Re: (Score:1)