'Copyright Traps' Could Tell Writers If an AI Has Scraped Their Work 79
An anonymous reader quotes a report from MIT Technology Review: Since the beginning of the generative AI boom, content creators have argued that their work has been scraped into AI models without their consent. But until now, it has been difficult to know whether specific text has actually been used in a training data set. Now they have a new way to prove it: "copyright traps" developed by a team at Imperial College London, pieces of hidden text that allow writers and publishers to subtly mark their work in order to later detect whether it has been used in AI models or not. The idea is similar to traps that have been used by copyright holders throughout history -- strategies like including fake locations on a map or fake words in a dictionary. [...] The code to generate and detect traps is currently available on GitHub, but the team also intends to build a tool that allows people to generate and insert copyright traps themselves. "There is a complete lack of transparency in terms of which content is used to train models, and we think this is preventing finding the right balance [between AI companies and content creators]," says Yves-Alexandre de Montjoye, an associate professor of applied mathematics and computer science at Imperial College London, who led the research.
The traps aren't foolproof and can be removed, but De Montjoye says that increasing the number of traps makes it significantly more challenging and resource-intensive to remove. "Whether they can remove all of them or not is an open question, and that's likely to be a bit of a cat-and-mouse game," he says.
The traps aren't foolproof and can be removed, but De Montjoye says that increasing the number of traps makes it significantly more challenging and resource-intensive to remove. "Whether they can remove all of them or not is an open question, and that's likely to be a bit of a cat-and-mouse game," he says.
Traps work for me. (Score:2)
Re:Traps work for me. (Score:4, Insightful)
That's a great way to ensure at least someone gets paid for that use of the work. And that someone of course being your lawyer.
Re: (Score:3, Insightful)
Nobody's stealing anything from you, you still have the original file, there's just a new copy that created when it was downloaded.
Perhaps the AI checked it out from a library to read, just like I would've had I needed it to learn something from it. Are you going to sue me because I used a free copy of it to learn, and retained that knowledge?
Re: (Score:1)
I suppose it's about power. When the relatively powerless are pirating the content of the powerful corporations, one is pro piracy. When the powerful corporations are regurgitating the works of less powerful authors, then one is against piracy. Fine, but the one should be clear about it instead of appearing too hypocritical
No hypocrisy here, only consistency (Score:3)
Re: (Score:2)
The arguments have not changed and everyone is as consistent as they always have been.
Stop. The FP called copying (or actually, training LLMs) "stealing". That is some RIAA-level bullshit that would have been ridiculed and modded into oblivion here years ago.
I'm waiting for the first Slashdotter to argue against LLM training with something as asinine as "You wouldn't download a car".
Re: (Score:3)
You wouldn't train a car.
Re: (Score:1)
Re: (Score:2)
I might suplex one.
Re: (Score:2)
No, that's Mr Thou over there!
Re: (Score:2)
I suppose a different thought experiment could clear up whether pay is part of the problem.
Let's say a company produces a freely available LLM: the weights can be downloaded by anybody but the program that comes with it only works on nVidia cards. This company gets paid by nVidia to do this, as a form of advertising. (Open source programmers later cook up a TF/Python script that does the same thing irrespective of card type.)
Let's furthermore say that the company's training data contained copyrighted material. Is that bad? If no, then the problem is that the corporations demand pay. If yes, then it's something else.
I don't see this as bad anymore than I see an actual human reading a book and using that info in their line of work to better their position and get a raise in pay. If the LLM starts writing (or rather receives prompts to write) people's reports or dissertations using direct quotes without properly accrediting the original work, then that is wrong, but a further question is who is wrong here? The LLM that produced the paper, or the "author" of said paper who uses the LLM in this manner without checking?
Re: Traps work for me. (Score:4, Informative)
Re: (Score:2)
Since LLMs don't do that unless you put the whole work in as a prompt, you're not going to be calling anyone.
Re: (Score:1)
An entire work doesn't need to be stolen to be a copyright violation.
Sort of like how Elon Musk doesn't cyber stalk you personally for any reason.
Is it even illegal to scrape work without consent? (Score:2)
Last I heard this was still under debate, if ingesting someone's work via AI was considered plagiarism or not, considering that a human does a similar thing when reading other people's material for inspiration.
The main difference I'd say is the fact that it will likely plagiarize if there are enough parameters and the work is unique enough. I'm not sure how these copyright traps work, but I imagine they capitalize on that, i.e., deliberately unique strings of tokens that can be searched for.
Re: (Score:2)
Last I heard this was still under debate, if ingesting someone's work via AI was considered plagiarism or not, considering that a human does a similar thing when reading other people's material for inspiration.
This exactly.
To me, if the model doesn't simply regurgitate the ingested material with normal non-pathological prompting it's not infringing anything more than a human writing a plot summary would be.
Re: (Score:2)
Last I heard this was still under debate, if ingesting someone's work via AI was considered plagiarism or not, considering that a human does a similar thing when reading other people's material for inspiration.
This could be made clear by a clear copyright notice on the web site landing page. Not bothering to read it is not an excuse, neither is an AI's inability to understand it. The web site preventing deep linking (eg checking referrer) and having a suitable robots.txt would help.
Re: (Score:3)
This could be made clear by a clear copyright notice on the web site landing page. Not bothering to read it is not an excuse,
So it would be like all the copyright information and warnings on all the music, movies, and software people gleefully steal. Completely ignored.
If you can make up an excuse why you're stealing someone else's work, there is no excuse which can't be used for AI to do the same.
Re: (Score:2)
I think that is perfectly and exactly wrong.
That won't make anything even slightly clearer, because the default assumption is that it's under copyright. (If you find information saying it's PD, that could help, though, since it addresses an exceptional case.)
The defense for scrapers and other processes or people who read web pages, isn't that the work isn't copyrighted. It's that reading the page is considered Fair Use.
That i
Re: (Score:1)
I'm not sure that matters when it explicitly violates terms of service.
You do stuff with the content that the authors don't like, then legally the authors have recourse to tell you to stop and seek damages. A few precedents where a top tier artist wins in court against an AI company will undermine the whole industry, because it's an open secret that this work couldn't have been done if the image and tag databases weren't full of stolen content.
Meanwhile there are still people out there who think that 'promp
Re: (Score:2)
Re: (Score:1)
Copying someone else's creativity is not creative. By definition it is the opposite. It is copying.
Re: (Score:1)
Stop: humans and AI are not doing the same thing. They're about as far from doing the same thing as possible.
Every time someone says that, Jesus kills a kitten.
Think of the kittens!
Re: (Score:2)
Even if it's not considered plagiarism, it could still be copyright infringement. Copyright is literally and only about the right to copy (it's right there in the name). Plagiarism is a superset of copyright. If you copy someone else's work without authorization, you've committed copyright infringement. If you then also pass off that copy as your own work, you've committed plagiarism.
Re: (Score:2)
Copyright is literally and only about the right to copy (it's right there in the name).
Actually... No. Copyright is about the right to profit from the publication of creative works. This limited right is granted in order to encourage further creative works for the benefit of society at large.
Copyright laws were originally passed to protect authors from the predations of publishers who would produce and sell copies of the works without compensation to authors. Making a copy is only relevant to copyright where it can reasonably be seen to impact the creators ability to profit from their work
Re: (Score:2)
Scraping data from a website is not illegal in the USA. It may be a violation of the Terms of Service agreement, but courts have ruled that ToS agreements are not binding unless affirmatively accepted.
What you do with the scrapped data may constitute a Copyright violation -or it may not. Copyright law is intentionally vague, with many exceptions.
Overheard on Slashdot (Score:3, Insightful)
How it started: "Yeah, fuck those writers. Scrape it all! I'll never buy your crap! Get a real job! Learn to code! Worthless degree!"
How it's going: "How come all TV and movies suck now?"
Re: (Score:1)
Re: Overheard on Slashdot (Score:3)
Re: (Score:1)
Re: Overheard on Slashdot (Score:2)
Re: Overheard on Slashdot (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:1, Flamebait)
Re: (Score:2)
Be sure to tell your friends you used the term "woke" on a public web site, you get extra points for that y'know.
Re: (Score:2)
Re: (Score:2)
How it started: "Yeah, fuck those writers. Scrape it all! I'll never buy your crap! Get a real job! Learn to code! Worthless degree!"
How it's going: "How come all TV and movies suck now?"
TV and movies have always sucked. It's just that as we get older we forgot how much of it sucked because we only remember the good shows that stood the test of time. Sturgeon's Law (90% of everything is crap) was well understood back then.
Re: (Score:2)
Re: The terminology seems broken (Score:2)
Re: (Score:1)
If someone who didn't know Harry Potter were to write about a magic school getting attacked, they're extremely unlikely to come up with HP.
Whereas an LLM is going to feed you HP clones all day because it has no imagination or inner thoughts. It is only the product of the data fed into it.
You even say this but you think you're saying the opposite when you say, "The more you contribute, the more original the result". The only thing more data provides is a larger range of things to clone. The originality is
Re: (Score:2)
An LLM would only write you a Harry Potter clone if HP is part of the training data and you try really hard to steer it towards HP when you prompt it (at which point I'd argue you're the one violating copyright). If you don't feed it HP, then it'll never create a story that resembles HP.
Indeed, I challenge you to come up with a prompt and LLM combination that reproduces the first page of Harry Potter and the Sorcerer's Stone, where the prompt is shorter than one page, and that the prompt would not by itself
"Digital traps" otherwise known as (Score:5, Interesting)
Watermarks.
And they're old as dirt. It's hardly a new idea.
I use them all the time to figure out who sells my data to whom. Whenever I sign up to something - willingly or not - I give my name with different middle initials, like John T. H. Doe, then the next submission, I put John T. I. Doe, then John T. K. Doe and I keep track of who I gave which name to.
When a name comes back on a piece of junk mail or spam email, I know who sold my information. If it's a company, I put them on my list of companies never to buy anything from again.
It works as long as whoever sold my information doesn't strip the fake middle initials, just like the "digital traps" will work if AI doesn't mangle the original works so much that the watermarks gets destroyed.
Re: (Score:2)
I used to provide a unique email address to every company but quickly realized they all sell you.
There's no point. Now I have my real email for people I care about and my one junk email for all the spammy crap.
Re: (Score:2)
Now I have my real email for people I care about
It only takes one person you care about to have a Gmail account and your real email for real people immediately becomes spam fodder and permanently attached to your real name.
There's no point trying to protect a valuable email addy anymore, and there hasn't been for decades. Big Data will always get you and you can't do a damn thing about it because someone you can't control is always careless and it only takes one slip-up.
Sounds completely impractical for any human writer (Score:5, Interesting)
From the paper:
- They need "sequences of 100 tokens repeated 1,000 times."
- These need to be seeded into a huge dataset to resist deduplication. Not one document. Not a book. Duplicate sequences can easily be detected and removed from those. According to the paper, only "large datasets containing terabytes of text" are impractical (for now) to deduplicate. But that's literally (ha) a dataset the size of a million Bibles (~4MB).
So this won't protect Joe Writer. No writer is prolific enough to generate terabytes of text. Not even Steven King. The only ones who will benefit from this are big corporations trying to protect their own datasets from each other.
Re: (Score:2)
Phone books and maps and such have been protected for ages by including a few mistakes. If someone has all your mistakes, then they copied from you. Other types of works haven't traditionally needed it since it was obvious they were copied. I don't think there's any practical way to prevent some entity from learning from your work though.
AI brings the worst of all IP worlds. (Score:4, Insightful)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
If people can't self-sustain by creating, then they won't even try.
Nonsense. Not everything has to be about money and not everything is. Many people create many things regularly for tons of reasons. One could even argue that the stuff that is created this way often contains the more interesting things when compared to the stuff that is churned out by profit-seeking producers.
Re: (Score:2)
Re: (Score:2)
Irrelevant. My argument did not hinge on profiting vs. self-sustaining. I clearly showed that this is false: "If people can't self-sustain by creating, then they won't even try."
Don't try to move the goalposts.
Re: (Score:2)
Re: (Score:2)
No. You're missing the point: Creating doesn't have to be a job and it shouldn't be. As I said: Not everything has to be about money and not everything is.
I can create art right now, even though it is not my job. I know many people who create shittons of stuff in their 'free' time.
You're saying that art is only created if somebody can make money off it. Think about that for a second.
Re: (Score:2)
Re: (Score:1)
Cartographers (Score:2)
Map makers have been doing this for centuries.
Re: (Score:2)
I got it to dob itself in (Score:2)
1. Out of curiousity, I asked AI to write a chapter in the style of my work (more on that below)
2. one of the feedback buttons was 'that's amazing, how did you do it?' so I clicked that one despite the fact the effort was laughably bad.
3. It said thanks, then proudly informed me it had read all my novels. Odd, since I didn't realised AI had a credit card or a paypal account.
So how bad was the chapter? It used a variety of character names
Re: (Score:1)
There are audio only videos on YouTube of AI written sci-fi books (maybe others but I know of these). As an aside they're all AI voiced, too.
They're complete shit. There's a huge difference between the ones written by people, the ones where people cleaned up an AI mess and the AI mess. After trying to listen to a few of the AI ones before I realized they were AI I was like these are the worst stories ever, how can anyone think this is worth publishing even for free?
Of course people are excited by magical
Re: I got it to dob itself in (Score:2)
Re: (Score:1)
I think what we'll see in the short term is AI continuing to devalue the worth of authors and writers. Basically lowering the pay/word even lower than it is now after the book market tanked and journalism jobs disappeared as google etc vacuumed up all the ad revenue.
So AI books and writing won't be any good, but editors and employers won't pay anybody to write better than the AI would.
And the same will probably happen in other areas. I really can't imagine a lot of simpler developer jobs won't be threatened
Re: (Score:1)
Anyway, I gave up writing a few years back when I saw the excitement everyone had for AI-written prose.
If AI written prose is so terrible, as you pointed out, then you shouldn't have any problem with continuing to find readers for your good quality work.
As an avid fan of fiction myself, I would strongly encourage you to keep writing, as frankly, there is a lack of good original work out there. So many stories I have read over the years in the Sci-fi/Fantasy genre are a copy paste of the hero journey that they might as well have been written by AI, and the few times I stumble across a new author who h
Re: (Score:2)
I guess you don't know how cryptography works.