OpenAI Loses Fight To Keep ChatGPT Logs Secret In Copyright Case (reuters.com) 28
A federal judge has ordered OpenAI to hand over 20 million anonymized ChatGPT logs in its copyright battle with the New York Times and other outlets. Reuters reports: U.S. Magistrate Judge Ona Wang in a decision made public on Wednesday said that the 20 million logs were relevant to the outlets' claims and that handing them over would not risk violating users' privacy. The judge rejected OpenAI's privacy-related objections to an earlier order requiring the artificial intelligence startup to submit the records as evidence. "There are multiple layers of protection in this case precisely because of the highly sensitive and private nature of much of the discovery," Wang said.
An OpenAI spokesperson on Wednesday cited an earlier blog post from the company's Chief Information Security Officer Dane Stuckey, which said the Times' demand for the chat logs "disregards long-standing privacy protections" and "breaks with common-sense security practices." OpenAI has separately appealed Wang's order to the case's presiding judge, U.S. District Judge Sidney Stein.
A group of newspapers owned by Alden Global Capital's MediaNews Group is also involved in the lawsuit. MediaNews Group executive editor Frank Pine said in a statement on Wednesday that OpenAI's leadership was "hallucinating when they thought they could get away with withholding evidence about how their business model relies on stealing from hardworking journalists."
An OpenAI spokesperson on Wednesday cited an earlier blog post from the company's Chief Information Security Officer Dane Stuckey, which said the Times' demand for the chat logs "disregards long-standing privacy protections" and "breaks with common-sense security practices." OpenAI has separately appealed Wang's order to the case's presiding judge, U.S. District Judge Sidney Stein.
A group of newspapers owned by Alden Global Capital's MediaNews Group is also involved in the lawsuit. MediaNews Group executive editor Frank Pine said in a statement on Wednesday that OpenAI's leadership was "hallucinating when they thought they could get away with withholding evidence about how their business model relies on stealing from hardworking journalists."
I haven't followed this case too much... (Score:1)
But what does OpenAI think is permissible in this case? They don't want the courts see the outputs of their software and I imagine they would fight any attempts to see the how they built it. What's left? Just trust their word?
Re: (Score:2, Insightful)
Re:I haven't followed this case too much... (Score:5, Insightful)
In order to do it properly you'd need to have a process similar to declassification redactions, where a human can reason about real-world context. And you'd need a lot of bodies to do that to 20M chats in any reasonable amount of time.
"De-identification" automation can sometimes give you a dataset that by itself is anonymized. You really need structured input data for that, though, and the real problem is that there are frequently ways to "enrich" an anonymized dataset by finding other datasets you can join it to.
And here we're talking about freeform chats with multimodal inputs, those tools really can't cope with that sort of thing.
Further, the "enrichment" for this sort of thing could be weird. I could theoretically have described a situation to ChatGPT that didn't have identifying names/numbers in it, but that you could recognize, thus outing me. There's no way to redact that sort of thing.
Re: (Score:2)
There is no practical way to do that. Seriously.
I agree. Well, you cannot get everything out and specific things like, say, SSN or more common health problems, can be blanked out with patterns. But misspell the name of the condition you have or describe it instead of using its name and you are already screwed in most cases. And names, quasi-identifiers of people, etc. are basically impossible to recognize reliably.
Hence what needs to be done here is also that anybody working on the data needs to be under oath to not leak any personal data and all process
Re: (Score:2)
Sure there is. You just need a 1,000 FBI agents working overtime [house.gov], just like they did with the Epstein files.
Re: (Score:3)
There is no practical way to do that. Seriously.
You could use AI to do it. ;)
Re: (Score:3)
There is no practical way to do that. Seriously.
Yet these AI companies would be the first to tell other people that their AI can do it easily.
Re: (Score:3)
1. Almost everyone.
2. Yes.
Re: (Score:2)
The only decent thing to do is to keep these anonymized. If they become public record every bit of personal information entered into chat GPT will be public knowledge. SSNs. ID card scans. affairs. mental problems. Health problems. There shouldn't even be a question here.
All of the logs would certainly be subject to a court protective order, and everyone involved takes them very seriously. Anyone that got caught publicizing protected information would be in very deep trouble.
Re: I haven't followed this case too much... (Score:2)
Or maybe people should be aware that providing such personal information to an LLM is not safe? The statement that it is unsafe is true, regardless of whether people are aware of it. So we might as well make it clear, instead of pretending that user privacy is somewhat respected.
Why are there logs? (Score:2)
I'm not very AI saavy so this may be a dumb question.
Why do the logs exist to begin with? Do the ChatGPT algorithms use them to "learn?"
Re: (Score:3)
Training data, issue diagnosis, market research, targeting data for ads, probably to sell it to others at some time.
Re: (Score:3)
Because OpenAI doesn't actually care about privacy. They're just using that argument as a smokescreen.
Re: (Score:1)
Everything typed into AI is kept, when you use Copilot to look through things on your Windows PC it's kept, when you dump your companies sales figures in WhateverAI it's kept, when you have a transcription bot sit in your product development meeting it's kept, and with Apple, if they do add personal context your entire screens contents will be transmitted to them and kept.
Even if you delete the thread, it's kept. If you use the thumbs up button or thumbs down button it'
Re: (Score:2)
Criminals fail to hide the evidence? (Score:2)
Such a shame. I think we should be "tough on crime" on these people!
Re: (Score:2)
The "tough on crime" stance doesn't concern white-collar, billionaire crime. That one falls squarely into the "settlement" or "pardon" category, except in the rare cases that other billionaires were the target of he said crime. Mostly.
If only there was some kind of "intelligent" tool (Score:2)
that you could feed the logs into and have it detect Names, SSNs, Phone numbers and other PII data and replace it with asterisks.
You know, a tool that's not a real person but has some ability to do seemingly intelligent things. There's a name for it, it's on the tip of my tongue.
Re: (Score:2)
It is called "a perl one-liner".
We used to cobble them all the time in the few minutes between important tasks back in the day before the social networking and vibe-coding took over.
Re: (Score:1)
The text that people typed into s black box on the Internet is not their persons, houses, papers, nor effects.
The Fourth Amendment only makes sense when applied to material people actually try to keep secure. "Diary entries" is a laughable comparison.
The users sent this data to someone else's computer without even wondering what the company would do with it, let alone reading the terms they were agreeing to.
Fortunately, there's unlikely to be much criminal evidence in there that could be used against a spec
Re: (Score:2)
Short Answer: The Fourth Amendment generally does not protect queries you voluntarily enter into public websites that are logged, because courts often hold that once you share information with a third party, you lose a reasonable expectation of privacy. However, if law enforcement seeks access to those logs, constitutional protections may apply depending on the circumstances and evolving case law.
Re: (Score:2)
I want the New York Times to win... (Score:2)
I want the New York Times to win and set a precedent that feeding copyrighted works into an AI without permission is a copyright violation (regardless of what the AI does with it or what output it generates). Pop the generative AI (aka "slurp up a whole bunch of copyrighted content, mix it around and spit bits of it back out") bubble completely.