Major US Newspapers Sue OpenAI, Microsoft For Copyright Infringement (axios.com) 75
Eight prominent U.S. newspapers owned by investment giant Alden Global Capital are suing OpenAI and Microsoft for copyright infringement, in a complaint filed Tuesday in the Southern District of New York. From a report: Until now, the Times was the only major newspaper to take legal action against AI firms for copyright infringement. Many other news publishers, including the Financial Times, the Associated Press and Axel Springer, have instead opted to strike paid deals with AI companies for millions of dollars annually, undermining the Times' argument that it should be compensated billions of dollars in damages.
The lawsuit is being filed on behalf of some of the most prominent regional daily newspapers in the Alden portfolio, including the New York Daily News, Chicago Tribune, Orlando Sentinel, South Florida Sun Sentinel, San Jose Mercury News, Denver Post, Orange County Register and St. Paul Pioneer Press.
The lawsuit is being filed on behalf of some of the most prominent regional daily newspapers in the Alden portfolio, including the New York Daily News, Chicago Tribune, Orlando Sentinel, South Florida Sun Sentinel, San Jose Mercury News, Denver Post, Orange County Register and St. Paul Pioneer Press.
Makes sense (Score:5, Insightful)
Re: (Score:2)
Re: (Score:2)
You make the exact case what citations are necessary. When you see bad data was used in the response, you can throw it out.
Re: (Score:2)
Bad Data meaning "anything I don't agree with or like"
That works until it is gamed by bad actors.
Re:Makes sense (Score:4, Insightful)
I guess I really don't see the problem what data was "ingested", as long as it doesn't reconstitute it whole cloth.
What if I had a perfect memory and memorized every article and book I ever read and then was somehow able to make a living out of creating reviews or book reports on the subject matter that I read. Is that copyright infringement?
Re: (Score:3)
In your book review example, you would be specifying the book being reviewed, and all the content you create would be your original opinion about the book, maybe with a few quotes in your review, that you would indicate as quotes from the work.
Re: (Score:2)
Bullshit.
Cite YOUR fucking sources for your claims that you should cite sources for your own knowledge when used in every day scenarios. Guess what, I don't have to have footnotes when I'm talking to either another geologist, OR the moron that is trying to build a house on an unstable hillside. I just have to say "that's dangerous, permission to build there is DENIED". I don't have to cite sources of papers saying it's stupid to build a house on the edge of a unstable crumbling cliffside.
Re: (Score:2)
"Cite YOUR fucking sources for your claims that you should cite sources for your own knowledge when used in every day scenarios."
Every fucking customer I deal with demands to know how I know this or that to come to how I repaired or designed their PCB.
Any customer that isn't asking you to prove your shit is an idiot and is a major cause of why we get tons of hacks that claim they know shit but don't.
Re: Makes sense (Score:2)
They aren't making verbatim copies.
The only way to get what even looks like one is to use so many of the words from the original as input tokens that you have practically rewritten the output anyway.
Re: Makes sense (Score:2)
The publishers won't care what the output is if they can prove a million copies of their articles were made in the internal process of training.
Re: Makes sense (Score:2)
That's not how anything works. They might as well sue everyone on the planet for the temporary copy their browser made. How ignorant.
Re: Makes sense (Score:2)
You might as well comment on totally made up things rather than your modified version of what the rest of us are talking about. How ignorant.
Re: Makes sense (Score:2)
Hell no!
And what if, instead of developing perfect recall and sharing it with your family, you just put those stories on a computer and collected millions of dollars in venture capital so the public could enjoy the summaries and mash ups at a price that was practically giving them away. Would that be so wrong?
Re:Makes sense (Score:4, Insightful)
I mean, if those AI products use copyrighted data in coming up with their responses without citation, that is a copyright violation.
Data is not subject to copyright under US law. For example it is settled law anyone can OCR a copyrighted phone book into a database and there isn't anything the copyright holder can do about it. Copyright law in the US protects performances and the (re)production of works and derivatives. It does not limit access to or the use of information.
I've always pushed that ChatGPT and others need to produce a bibliography for each response indicating the sources used in coming up with it. Not only for copyright compliance, but also so the response can be verified.
LLMs are influenced by literally everything in their training set.
Re: (Score:2)
Re: (Score:2)
Uhhh, that's a perfect example?
The specific layout of the phone book is copyrightable. The data inside isn't. Just like the specific layout of the "reporting" of "news" is copyrightable, but the "news" information itself is NOT.
So unless your LLM is putting out exact replicas of the articles written without having to go through extreme measures to get it to put out anything even remotely close to the original sources ( which means something has gone VERY wrong in training I might add ), you aren't doing ANY
Re: (Score:2)
More relevant is that the US Supreme Court has ruled that copying the entire internet and storing it is not copyright infringement if the amount of data displayed at one time is de minimus. This is why search engines like Google, Bing, and Duck Duck Go are legal.
The question is going to be why is the use of LLMs not okay when search engines are okay. The basic reasoning that people don't want to admit is that search engines increase the value of copyrighted works while LLMs diminish the value of the works,
Re: Makes sense (Score:2)
And a major difference? A search engine result sends you to the source, whereas an LLM pretends (when it isn't happily supplying "actual text") there is no sou
Re:Makes sense (Score:4, Insightful)
Coming up with an answer using copyrighted materials WITH a citation is still copyright infringement unless you can make a fair use case. Citations are irrelevant for infringement. If it's fair use, you don't need a citation, either. The fair use discussion involves a number of considerations, but if the AI ingested the entire work and is spitting back something based on the entire thing, like, say "Give me the Cliff's Notes version of XXX work", it's going to be hard sell.
Re: (Score:2)
So even if the unique generated text doesn't violate copyright, by not generating "copyright management information" pertinent to some facts they could still be infringing some aspect of copyright law? Interesting. Seems to me t
Re: (Score:2)
According to the article one of the claims really is about citation: "The newspapers also claim OpenAI and Microsoft removed copyright management information, like journalists' names and titles, from their work when the information they reported was cited in answers to queries. "
So even if the unique generated text doesn't violate copyright, by not generating "copyright management information" pertinent to some facts they could still be infringing some aspect of copyright law? Interesting. Seems to me this practice is ubiquitous within the news industry, otherwise every story would have a bibliography like a scientific paper.
The copyright management bits are a DMCA thing. They are just throwing it on the wall.
Re: (Score:2)
Coming up with an answer using copyrighted materials WITH a citation is still copyright infringement unless you can make a fair use case.
No. You make it sound like the non-copyright case is the exception to the norm. In reality it's the copyright infringement which is largely the exception. Virtually nothing exists in a vacuum, and most ideas are in some way derivative.
Re: (Score:2)
Re: (Score:2)
The news outlets should step cautiously here because a whole lot of what they do is repeat each other without attribution.
All significant news outlets are owned by the same small handful of large corporations, and they collude to present a unified front to the public.
Re: (Score:3)
I mean, if those AI products use copyrighted data in coming up with their responses without citation, that is a copyright violation.
Except that it is not. Using copyrighted data is not a copyright violation (with or without citation).
Republishing copyrighted materials (or a substantial portion thereof) without a license is a copyright violation.
(There can be a fair-use exception if it is "transformative" as opposed to "derivative" -but this would be subject to interpretation.)
I hope that this goes to trial -I want to hear the legal arguments:
According to the law as written, it is not technically a violation. [republishing is not occur
Re: (Score:2)
I've always pushed that ChatGPT and others need to produce a bibliography for each response indicating the sources used in coming up with it.
It's ironic that you said that as an idea rather than actually providing a citable legal precedent as to why or at least a bibliography of where you learnt that idea from.
Re: (Score:2)
So unfortunately it's unlikely to be copyright infringement because user's who posted on these sites (/. included) are going to be bound by the TOS of the site they posted to. Which almost universally include clauses saying the site can use the posted information as desired.
My brain (Score:2, Redundant)
My squishy, human brain learns the same way, I read a newspaper and absorbs the knowledge.
Then we both answer questions about that knowledge when asked. Nothing copied verbatim.
The learning method is quite different, but I don't see how it's plagiarism.
Re: (Score:1)
> My squishy, human brain learns the same way, I read a newspaper and absorbs the knowledge.
Our heads would be taxed and billed if someone found a way. With AI it's a bit more objective to prove borrowing, at least with the current state of the art. In the future, slimebags may find ways to disguise the training source.
Re: (Score:3)
> My squishy, human brain learns the same way, I read a newspaper and absorbs the knowledge.
Our heads would be taxed and billed if someone found a way.
College textbook companies would absolutely do this if they could find a way to get away with it.
Re: My brain (Score:2)
It depends on if the LLM is producing entire passages word for word, which evidence suggests it is.
In the human world, you might be able to claim (with evidence) that you never read the original work and the issue is parallel development. It's harder for an AI to make a similar claim unless the work was published after training.
But ultimately it's a question of judgment. If it looks like it's copied, the courts are more likely to determine it's infringement.
Re: (Score:2)
It depends on if the LLM is producing entire passages word for word, which evidence suggests it is.
I'm curious how the newspapers are getting them to do this. I suspect they are picking obscure articles so there won't be a lot of training data on it, then seeding the prompt with a few sentence and asking it to generate the next sentence, then re-prompting until they get the exact sentence they are looking for, then moving on to the next one. If so, that's a bit bogus.
Re: (Score:2)
They will be required to hand over this information during discovery.
IF this goes to trial, I expect this to be made public. If it is trivial to trigger, the complainants will say so -if it is not, the defense will bring it up on cross.
Re: (Score:2)
We humans sometimes reproduce phrases we read, too. Where's the evidence LLMs are doing anything more than that?
IOW, if a human had perfect memory, wouldn't that person sometimes reproduce some phrases unintentionally?
Re: My brain (Score:2)
That's not a defense for infringement for a human either. Courts might not assess a financial penalty depending on the circumstances, but they are probably going to issue an order to stop infringing activity.
Re: (Score:2)
I don't think they're even asserting that ChatGPT is plagiarizing like a human would.
They seem to be claiming that the sort of processing and understanding ChatGPT does/has isn't enough to allow it to answer questions at all without violating copyright. It's like they are prejudiced toward human understanding.
I think they're just looking for those millions.
Re: (Score:2)
My squishy, human brain learns the same way, I read a newspaper and absorbs the knowledge.
Then we both answer questions about that knowledge when asked. Nothing copied verbatim.
The learning method is quite different, but I don't see how it's plagiarism.
A good non-fiction book often has a long bibliography of references at the back. I don't see why computers should be exempt from the same sort of verification / bibliographic references.
Re: (Score:2)
Exactly. Apparently there is a generation of people who never had to do footnotes and bibliographies on their school papers.
Re: (Score:2)
Exactly. Apparently there is a generation of people who never had to do footnotes and bibliographies on their school papers.
I'm amazed that the Slashdot crowd doesn't have more science readers that see books stuffed full of references on a regular basis. That's been standard MO for as long as any of us have been alive in the western world. Use information from a verified source in your own work: Make direct reference to said source. It seems logical, which probably explains why the techbros running the bigger LLMs would avoid it at all costs. Logic seems to have fled that entire area of expertise as fast as it could.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
ChatGPT absolutely does provide extensive references to every one of its source, annotated very thoroughly in its answers.
Everyone demanding this seems to have never used it before.
Re: (Score:2)
ChatGPT answers DO already contain extensive references to its sources.
Re: (Score:2)
ChatGPT does provided extensive references like that throughout its answers.
Re: (Score:2)
ChatGPT does provided extensive references like that throughout its answers.
Having dabbled with it a bit, I can give a firm "sometimes" agreement to this. It'd be nice if every answer came with a clickable "sources" link.
Re: (Score:2)
I use it a lot and I don't remember anything lately not providing extensive sources, usually several per sentence. But I am sure your experience is valid, too.
Re: (Score:2)
Re: (Score:2)
It's only a copyright violation if the material is reproduced exactly. IOW, plagiarized.
Good first step (Score:1, Redundant)
Re: (Score:2)
Re: (Score:2)
can just create big pile of training data that's free from copyright.
IMO hyper-focusing on copyright *status* this way doesn't make sense. If a work is created anywhere where copyright is automatic, it';s copyrighted regardless of the licensing or lack thereof. If we focus on making "using copyrighted works" itself wrong OR illegal (OR both) you'd kill off the ability to use works where the creators either gave explicit permission, or implicit permission through use of the appropriate Creative Commons license for instance, to use for training, which doesn't make sense to m
Re: (Score:2)
Darn, now will have to get my free news elsewhere (Score:2)
You can read a copyrighted work... (Score:5, Insightful)
An AI can read it also.
The only issue is whether it creates an non-derivative output.
Re: (Score:2)
Re: (Score:2)
Derivative works is still protected under United State Copyright Law.
Re: (Score:1)
Did the AI have a license to access it in the first place to read it? Can it then pass a bunch copies of what it read (not what it output) around to its progeny?
There are many issues here, and you've missed an obvious one in saying "only."
Dying Newspapers' Last Stand - Lashing Out (Score:1)
Re: (Score:1)
It's scary (Score:1)
Re: (Score:1)
this lawsuit is definitely going to shake things (Score:1)