



Lawsuit Accuses Meta Of Training AI On Torrented 82TB Dataset Of Pirated Books (hothardware.com) 47
"Meta is involved in a class action lawsuit alleging copyright infringement, a claim the company disputes..." writes the tech news site Hot Hardware.
But the site adds that newly unsealed court documents "reveal that Meta allegedly used a minimum of 81.7TB of illegally torrented data sourced from shadow libraries to train its AI models." Internal emails further show that Meta employees expressed concerns about this practice. Some employees voiced strong ethical objections, with one noting that using content from sites like LibGen, known for distributing copyrighted material, would be unethical. A research engineer with Meta, Nikolay Bashlykov, also noted that "torrenting from a corporate laptop doesn't feel right," highlighting his discomfort surrounding the practice.
Additionally, the documents suggest that these concerns, including discussions about using data from LibGen, reached CEO Mark Zuckerberg, who may have ultimately approved the activity. Furthermore, the documents showed that despite these misgivings, employees discussed using VPNs to mask Meta's IP address to create anonymity, enabling them to download and share torrented data without it being easily traced back to the company's network.
But the site adds that newly unsealed court documents "reveal that Meta allegedly used a minimum of 81.7TB of illegally torrented data sourced from shadow libraries to train its AI models." Internal emails further show that Meta employees expressed concerns about this practice. Some employees voiced strong ethical objections, with one noting that using content from sites like LibGen, known for distributing copyrighted material, would be unethical. A research engineer with Meta, Nikolay Bashlykov, also noted that "torrenting from a corporate laptop doesn't feel right," highlighting his discomfort surrounding the practice.
Additionally, the documents suggest that these concerns, including discussions about using data from LibGen, reached CEO Mark Zuckerberg, who may have ultimately approved the activity. Furthermore, the documents showed that despite these misgivings, employees discussed using VPNs to mask Meta's IP address to create anonymity, enabling them to download and share torrented data without it being easily traced back to the company's network.
Nothing would surprise me. (Score:4, Informative)
Once a crook, always a crook.
Par for the LLM course (Score:5, Informative)
Most if not all the models are being trained on stolen data. This is not new. It's just that Meta is so incompetent as to leave a paper trail. Those China models people talk about so much? Trained on this set and more. Then people get "clever" and train one LLM removed. They take a set trained on pirated data and use it to train a non-pirated set. Then the indirectly-trained LLM is used. See! Clean!
It's a constant train robbery, and we're all getting fucked every which way.
Re: (Score:2)
Hey at least Meta didn't murder anyone to cover the trail :p
Re: Par for the LLM course (Score:5, Insightful)
You're equating stifling progress of AI to stifling progress of humanity itself? AI advancement will cause the human race to regress and even [slashdot.org] Microsoft [slashdot.org] is sounding the alarm.
Re: (Score:3)
LLMs are going to advance us into being even more retarded. I look forward to the day when my LLM will talk to your LLM for an hour while we stare at porn. Then our LLMs will deliver us summaries that we will be too illiterate to read. This is totally worth destroying all incentives for humans to be creative. You're a genius.
Re: (Score:2)
Jokes on you. I have a LLM stare at porn for me.
Re: (Score:2)
The AIs who stare at porn.
Re: Par for the LLM course (Score:2)
The AI that's being trained on p0rn. I might be willing to use its services.
Re: Par for the LLM course (Score:5, Insightful)
Re: (Score:2)
Rent seeking from corporations. They'll copyright their LLMs, don't you worry.
Re: (Score:2)
"What's the point of having copyrights at all, then?"
The AI didn't copy them, it just read them.
Re: (Score:2)
Re: (Score:2)
But we're also responsible. Everyone using AI for their own benefit is responsible. I take a AI-free approach and don't use AI for that reason. If you use AI or play with it, then you are part of the problem.
Re: (Score:2)
I have to write "cunt" at the end of every google search, which is a pain to remember to do. Does t hat make me part of the problem?
Re: Par for the LLM course (Score:2)
By writing "cunt" you might be requesting search results and digests from Gemini's little sibling, Cunt.
Re:Par for the LLM course (Score:4, Insightful)
Most if not all the models are being trained on stolen data. This is not new. It's just that Meta is so incompetent as to leave a paper trail. Those China models people talk about so much? Trained on this set and more. Then people get "clever" and train one LLM removed. They take a set trained on pirated data and use it to train a non-pirated set. Then the indirectly-trained LLM is used. See! Clean!
It's a constant train robbery, and we're all getting fucked every which way.
I think that's called data laundering.
Re: Par for the LLM course (Score:1)
Re: (Score:3)
Perhaps AI will finally spark the debate we need to have about modernising copyright. Sad that it takes greedy corporations to get things moving.
Re: (Score:2)
??? The copyright laws as they exist were caused by greedy, rent-seeking, companies. Don't expect a battle over details by companies to improve things. Copyrights should not last for more than 15 years, with one allowed renewal...if you want to pay a substantial fee.
Re: (Score:2)
??? The copyright laws as they exist were caused by greedy, rent-seeking, companies. Don't expect a battle over details by companies to improve things. Copyrights should not last for more than 15 years, with one allowed renewal...if you want to pay a substantial fee.
And only applied to books, maps, and other things that were expensive to produce. Newspapers were originally not copyrightable.
I can see the argument that limiting copyright to anything that costs over $100k to produce would be in keeping with the spirit of the original copyright law.
Re: (Score:1)
Re: (Score:2)
Sorry but copying one database over to another database is not a "transformative work". It's simply infringing on copyright.
Re: (Score:2)
THIS IS NOT A DRILL. Backup Llama and other LLMs while you still can. THIS IS NOT A DRILL.
So, to de-sensationalize this: (Score:5, Informative)
The thread in question was titled "Legal Escalations", because the decision process about what they could train on heavily involved Meta's lawyers.
"torrenting from a corporate laptop doesn't feel right" wasn't expressed as "strong ethical objection" - it was followed by a laughing face emoji.
This is not a new case. Kadrey v. Meta Platforms, Inc [courtlistener.com] was filed nearly a year ago. Numerous documents, including these accusations, and responses from Meta (such as this [courtlistener.com], this [courtlistener.com], and many more) are not "news" in any way.
Re: So, to de-sensationalize this: (Score:1)
82 TB are rookie numbers. That number is nowhere even close.
Short-Lived Suit (Score:3, Interesting)
AI as a Plagerism/Copyright Legal Firewall (Score:4, Informative)
So many questions! (Score:1)
Re: So many questions! (Score:2)
And 95-year copyright is the even bigger crime, that deprives us all of the benefits of the creativity copyright was meant to support.
Pirated books?! (Score:2)
The first time Meta does some good for a change and this is how you react?
Re: (Score:3)
What good have they done?
Re: (Score:1)
Apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh water system, and public health?
Re: Pirated books?! (Score:2)
But doesn't data want to be free? (Score:2)
It's amusing to see the outrage about AI being trained on pirated books, while that same outrage conveniently disappears when one discusses piracy in general.
Based on the comments I've seen on Slashdot over the years, copyright is evil and shouldn't even exist, or else intellectual property is a complete scam, or copyright should (at most) be granted for 5 years. So what's the big deal if companies train AI on that pirated content that everyone lauds?
AIs want data to be free, too.
Re: (Score:2)
It's one thing to make it publicly available, it's another to resell a concealed version. I know which I consider a worse crime.
P.S.: Copyright should last about 15 years with one allowed (expensive) renewal.
Re: (Score:3)
Yeah, for personal people to use. For personal use.
That's "piracy in general'. Not to screw over artists by creating COMMERCIAL works based on their hard work and PROFITING off of it.
I hope you can understand the difference, but probably not.
Re: Why would the size be relevant? (Score:2)
Confused - torrents aren't used to violate copyrights, so how can using documents scrapped off torrents be copyright protected, thus should not be used for something like this?
For example, the entire Project Gutenberg library is free to use to train your AI, right?
"Suggest" "May have"? (Score:2)
Additionally, the documents suggest that these concerns, including discussions about using data from LibGen, reached CEO Mark Zuckerberg, who may have ultimately approved the activity.
The document "suggests" that Zuckerberg "may have" approved of this? Really?
Are they redistributing the material? (Score:1)
Copyright is all about copying and reselling(at any price) material. I don't understand what the big deal is, as meta is not reselling the material. It is digesting it and using it. Which is the same as any person would do when reading a book. As far as I can see there is no copyright infringement as they aren't reselling it.
This is why China is winning (Score:1)
In China, there is one collection in their digital libraries of all the world's books (with Chinese translations.) Everyone has access to it.
In the West, you can be sued for reading a book.
If we had changed our copyright laws, we could have prevailed.
Re: This is why China is winning (Score:2)