Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI The Internet

'Copyright Traps' Could Tell Writers If an AI Has Scraped Their Work 79

An anonymous reader quotes a report from MIT Technology Review: Since the beginning of the generative AI boom, content creators have argued that their work has been scraped into AI models without their consent. But until now, it has been difficult to know whether specific text has actually been used in a training data set. Now they have a new way to prove it: "copyright traps" developed by a team at Imperial College London, pieces of hidden text that allow writers and publishers to subtly mark their work in order to later detect whether it has been used in AI models or not. The idea is similar to traps that have been used by copyright holders throughout history -- strategies like including fake locations on a map or fake words in a dictionary. [...] The code to generate and detect traps is currently available on GitHub, but the team also intends to build a tool that allows people to generate and insert copyright traps themselves. "There is a complete lack of transparency in terms of which content is used to train models, and we think this is preventing finding the right balance [between AI companies and content creators]," says Yves-Alexandre de Montjoye, an associate professor of applied mathematics and computer science at Imperial College London, who led the research.

The traps aren't foolproof and can be removed, but De Montjoye says that increasing the number of traps makes it significantly more challenging and resource-intensive to remove. "Whether they can remove all of them or not is an open question, and that's likely to be a bit of a cat-and-mouse game," he says.
This discussion has been archived. No new comments can be posted.

'Copyright Traps' Could Tell Writers If an AI Has Scraped Their Work

Comments Filter:
  • If they want to steal my writing that's fine. It'll show up in Google searches just like it does when ordinary humans steal my writing, at which point we get to call the lawyers.
    • by Njovich ( 553857 ) on Saturday July 27, 2024 @01:29AM (#64659232)

      That's a great way to ensure at least someone gets paid for that use of the work. And that someone of course being your lawyer.

    • Re: (Score:3, Insightful)

      by bjwest ( 14070 )

      Nobody's stealing anything from you, you still have the original file, there's just a new copy that created when it was downloaded.

      Perhaps the AI checked it out from a library to read, just like I would've had I needed it to learn something from it. Are you going to sue me because I used a free copy of it to learn, and retained that knowledge?

      • by Anonymous Coward
        It's funny how all the arguments from the times of piracy are returning - only this time, people are against piracy.

        I suppose it's about power. When the relatively powerless are pirating the content of the powerful corporations, one is pro piracy. When the powerful corporations are regurgitating the works of less powerful authors, then one is against piracy. Fine, but the one should be clear about it instead of appearing too hypocritical :)
        • The arguments have not changed and everyone is as consistent as they always have been. When nobody is profiting from the unauthorised copying, it's generally considered fine irrespective of whether the creator of the original is rich or not. Pirates which sell unauthorised copies of works to make money have always been hated, and those who make works available widely to all for free (usually at cost to the people doing the copying) have generally been loved. If all the AI models trained on the world's conte
          • The arguments have not changed and everyone is as consistent as they always have been.

            Stop. The FP called copying (or actually, training LLMs) "stealing". That is some RIAA-level bullshit that would have been ridiculed and modded into oblivion here years ago.

            I'm waiting for the first Slashdotter to argue against LLM training with something as asinine as "You wouldn't download a car".

      • by EldoranDark ( 10182303 ) on Saturday July 27, 2024 @06:23AM (#64659528)
        You can view the library copy of my book to learn what you want, but I will certainly call the lawyer once you start publishing a suspiciously similar book with entire passages copy/pasted without my consent or any attribution.
        • Since LLMs don't do that unless you put the whole work in as a prompt, you're not going to be calling anyone.

  • Last I heard this was still under debate, if ingesting someone's work via AI was considered plagiarism or not, considering that a human does a similar thing when reading other people's material for inspiration.

    The main difference I'd say is the fact that it will likely plagiarize if there are enough parameters and the work is unique enough. I'm not sure how these copyright traps work, but I imagine they capitalize on that, i.e., deliberately unique strings of tokens that can be searched for.

    • Last I heard this was still under debate, if ingesting someone's work via AI was considered plagiarism or not, considering that a human does a similar thing when reading other people's material for inspiration.

      This exactly.

      To me, if the model doesn't simply regurgitate the ingested material with normal non-pathological prompting it's not infringing anything more than a human writing a plot summary would be.

    • Last I heard this was still under debate, if ingesting someone's work via AI was considered plagiarism or not, considering that a human does a similar thing when reading other people's material for inspiration.

      This could be made clear by a clear copyright notice on the web site landing page. Not bothering to read it is not an excuse, neither is an AI's inability to understand it. The web site preventing deep linking (eg checking referrer) and having a suitable robots.txt would help.

      • This could be made clear by a clear copyright notice on the web site landing page. Not bothering to read it is not an excuse,

        So it would be like all the copyright information and warnings on all the music, movies, and software people gleefully steal. Completely ignored.

        If you can make up an excuse why you're stealing someone else's work, there is no excuse which can't be used for AI to do the same.

      • by Sloppy ( 14984 )

        This could be made clear by a clear copyright notice on the web site landing page.

        I think that is perfectly and exactly wrong.

        That won't make anything even slightly clearer, because the default assumption is that it's under copyright. (If you find information saying it's PD, that could help, though, since it addresses an exceptional case.)

        The defense for scrapers and other processes or people who read web pages, isn't that the work isn't copyrighted. It's that reading the page is considered Fair Use.

        That i

    • by Anonymous Coward

      I'm not sure that matters when it explicitly violates terms of service.

      You do stuff with the content that the authors don't like, then legally the authors have recourse to tell you to stop and seek damages. A few precedents where a top tier artist wins in court against an AI company will undermine the whole industry, because it's an open secret that this work couldn't have been done if the image and tag databases weren't full of stolen content.

      Meanwhile there are still people out there who think that 'promp

      • Copyright is protecting creativity by placing restrictions on creativity. Makes sense, it's like "fucking for virginity". Did anyone find a different way to protect creativity, like for example opening up sources?
    • Stop: humans and AI are not doing the same thing. They're about as far from doing the same thing as possible.

      Every time someone says that, Jesus kills a kitten.

      Think of the kittens!

    • .. if ingesting someone's work via AI was considered plagiarism or not.

      Even if it's not considered plagiarism, it could still be copyright infringement. Copyright is literally and only about the right to copy (it's right there in the name). Plagiarism is a superset of copyright. If you copy someone else's work without authorization, you've committed copyright infringement. If you then also pass off that copy as your own work, you've committed plagiarism.

      .. a human does a similar thing when reading other pe

      • Copyright is literally and only about the right to copy (it's right there in the name).

        Actually... No. Copyright is about the right to profit from the publication of creative works. This limited right is granted in order to encourage further creative works for the benefit of society at large.

        Copyright laws were originally passed to protect authors from the predations of publishers who would produce and sell copies of the works without compensation to authors. Making a copy is only relevant to copyright where it can reasonably be seen to impact the creators ability to profit from their work

    • Scraping data from a website is not illegal in the USA. It may be a violation of the Terms of Service agreement, but courts have ruled that ToS agreements are not binding unless affirmatively accepted.

      What you do with the scrapped data may constitute a Copyright violation -or it may not. Copyright law is intentionally vague, with many exceptions.

  • by The Cat ( 19816 ) on Saturday July 27, 2024 @12:43AM (#64659190)

    How it started: "Yeah, fuck those writers. Scrape it all! I'll never buy your crap! Get a real job! Learn to code! Worthless degree!"

    How it's going: "How come all TV and movies suck now?"

    • by Lehk228 ( 705449 )
      TV and movies have sucked for 15 years, hence why we don't care of worthless writers get replaced by AI good writers and artists have nothing to fear it's the shitty twitter artists selling custom furry porn and disposable filler art for magazines and shit that are going to have to get real jobs
      • If you think TV and movies have sucked for 15 years, the problem is you, not movies. Recently Tenant was pretty great, and "Our Little Sister" was great and not typical Hollywood. "Us" was good in terms of smart horror, and I enjoyed "Arrival" although admittedly you might need to be a language nerd to enjoy it. "Spectre" was at least as good as any other Bond movie.
        • I think you mean Tenet? Only was a meh movie, even in the ratings. Little Sister was rather boring chickflick. Us also got poor ratings simply because of its messaging, it got very confused. And Spectre was a rather boring Bond film, simply too safe with the character. Watch some old Bond to know what I mean.

    • When we figure out how to get LLMs to give attribution to their sources (an active area of research), that's when the copyright hammer is going to drop.
      • It's gonna drop equally hard on humans and AI. We can't be sure who's secretly using AI, so we need to apply it everywhere. There are few truly original ideas. Most can be attributed to someone else. Even unintentionally you could write a derivative text and get sued.
    • Can't wait to get unlimited anime generated within genre specifications by AI. If you look at it, anime is already LLM level, just recombining the same 100 ideas in new ways. There will be no degradation, but there is a chance for improved quality and diversity if you are doing it carefully. You can force LLMs to expand and add novel things, but can't force studios to take even the tiniest risk.
    • Re: (Score:1, Flamebait)

      by Growlley ( 6732614 )
      why are you so woke?
    • by mjwx ( 966435 )

      How it started: "Yeah, fuck those writers. Scrape it all! I'll never buy your crap! Get a real job! Learn to code! Worthless degree!"

      How it's going: "How come all TV and movies suck now?"

      TV and movies have always sucked. It's just that as we get older we forgot how much of it sucked because we only remember the good shows that stood the test of time. Sturgeon's Law (90% of everything is crap) was well understood back then.

  • by Rosco P. Coltrane ( 209368 ) on Saturday July 27, 2024 @02:09AM (#64659282)

    Watermarks.

    And they're old as dirt. It's hardly a new idea.

    I use them all the time to figure out who sells my data to whom. Whenever I sign up to something - willingly or not - I give my name with different middle initials, like John T. H. Doe, then the next submission, I put John T. I. Doe, then John T. K. Doe and I keep track of who I gave which name to.

    When a name comes back on a piece of junk mail or spam email, I know who sold my information. If it's a company, I put them on my list of companies never to buy anything from again.

    It works as long as whoever sold my information doesn't strip the fake middle initials, just like the "digital traps" will work if AI doesn't mangle the original works so much that the watermarks gets destroyed.

    • I used to provide a unique email address to every company but quickly realized they all sell you.

      There's no point. Now I have my real email for people I care about and my one junk email for all the spammy crap.

      • Now I have my real email for people I care about

        It only takes one person you care about to have a Gmail account and your real email for real people immediately becomes spam fodder and permanently attached to your real name.

        There's no point trying to protect a valuable email addy anymore, and there hasn't been for decades. Big Data will always get you and you can't do a damn thing about it because someone you can't control is always careless and it only takes one slip-up.

  • by SubmergedInTech ( 7710960 ) on Saturday July 27, 2024 @02:17AM (#64659298)

    From the paper:

    - They need "sequences of 100 tokens repeated 1,000 times."

    - These need to be seeded into a huge dataset to resist deduplication. Not one document. Not a book. Duplicate sequences can easily be detected and removed from those. According to the paper, only "large datasets containing terabytes of text" are impractical (for now) to deduplicate. But that's literally (ha) a dataset the size of a million Bibles (~4MB).

    So this won't protect Joe Writer. No writer is prolific enough to generate terabytes of text. Not even Steven King. The only ones who will benefit from this are big corporations trying to protect their own datasets from each other.

    • Phone books and maps and such have been protected for ages by including a few mistakes. If someone has all your mistakes, then they copied from you. Other types of works haven't traditionally needed it since it was obvious they were copied. I don't think there's any practical way to prevent some entity from learning from your work though.

  • by Eunomion ( 8640039 ) on Saturday July 27, 2024 @02:37AM (#64659322)
    Individuals are constantly and falsely bombarded with automated copyright claims that sabotage and silence them, unable to keep up with the mass-produced nature of such attacks. Meanwhile if their work does get through and gain traction, it will be massively cannibalized and counterfeited by other automation that's likewise too fast to take down. The catch-22 is the fundamental unit of dystopia.
    • You can't make money selling text, you can make money applying ideas in reality. That's what we need to accept. Like open source, you can't sell me Linux, but I can make money with it.
      • If people can't self-sustain by creating, then they won't even try. And if the AI is only based on cannibalizing what people make, its outputs will quickly lose value. Then what do you have? Just a toxic environment where people can neither create nor benefit from creation.
        • If people can't self-sustain by creating, then they won't even try.

          Nonsense. Not everything has to be about money and not everything is. Many people create many things regularly for tons of reasons. One could even argue that the stuff that is created this way often contains the more interesting things when compared to the stuff that is churned out by profit-seeking producers.

          • I didn't say profit, I said self-sustain. The pool of talent willing to create when it provides a living is larger, more diverse, and generally more motivated than that which exists when people can only do it as an act of altruism. Critical acclaim feels a lot better with a paycheck than without.
            • Irrelevant. My argument did not hinge on profiting vs. self-sustaining. I clearly showed that this is false: "If people can't self-sustain by creating, then they won't even try."
              Don't try to move the goalposts.

              • You're gonna quibble on the technicality that someone might still create as a literal starving artist? Come on. That's not how things generally work.
                • No. You're missing the point: Creating doesn't have to be a job and it shouldn't be. As I said: Not everything has to be about money and not everything is.
                  I can create art right now, even though it is not my job. I know many people who create shittons of stuff in their 'free' time.

                  You're saying that art is only created if somebody can make money off it. Think about that for a second.

          • The issue is pollution. Information pollution to be exact. People create things for the love of creating all the time, say a beautiful children's book. Now do this in a toxic AI environment where the book is copied and transmogrified endlessly, where Amazon sells a million variations on this one book, with a million endings, a million characters, a million different stories, nudity, violence, pedophilia, recipes for poisoned mushrooms, the antichrist, whatever the AIs push out in the blink of an eye. Some
        • We are actually living in an anomaly of time where we have a glut of artists and content producers capable of living solely on their art product.

          Just a hundred years or so ago, only the best artists were employed by the governments, religions and wealthy and had a commission that could sustain themselves. Now just a handful of people are rich enough to commission any niche artist they never really meet in real life and this has made art incredibly cheap as well.

          With the advent of AI being capable of derivat

  • Map makers have been doing this for centuries.

  • (Back story: I have almost 30 scifi and fantasy novels in print)
    1. Out of curiousity, I asked AI to write a chapter in the style of my work (more on that below)
    2. one of the feedback buttons was 'that's amazing, how did you do it?' so I clicked that one despite the fact the effort was laughably bad.
    3. It said thanks, then proudly informed me it had read all my novels. Odd, since I didn't realised AI had a credit card or a paypal account.

    So how bad was the chapter? It used a variety of character names
    • There are audio only videos on YouTube of AI written sci-fi books (maybe others but I know of these). As an aside they're all AI voiced, too.

      They're complete shit. There's a huge difference between the ones written by people, the ones where people cleaned up an AI mess and the AI mess. After trying to listen to a few of the AI ones before I realized they were AI I was like these are the worst stories ever, how can anyone think this is worth publishing even for free?

      Of course people are excited by magical

    • You might enjoy this comic about AI authors: https://www.smbc-comics.com/co... [smbc-comics.com]
    • I think what we'll see in the short term is AI continuing to devalue the worth of authors and writers. Basically lowering the pay/word even lower than it is now after the book market tanked and journalism jobs disappeared as google etc vacuumed up all the ad revenue.

      So AI books and writing won't be any good, but editors and employers won't pay anybody to write better than the AI would.

      And the same will probably happen in other areas. I really can't imagine a lot of simpler developer jobs won't be threatened

    • by vivian ( 156520 )

      Anyway, I gave up writing a few years back when I saw the excitement everyone had for AI-written prose.

      If AI written prose is so terrible, as you pointed out, then you shouldn't have any problem with continuing to find readers for your good quality work.
      As an avid fan of fiction myself, I would strongly encourage you to keep writing, as frankly, there is a lack of good original work out there. So many stories I have read over the years in the Sci-fi/Fantasy genre are a copy paste of the hero journey that they might as well have been written by AI, and the few times I stumble across a new author who h

I have a very small mind and must live with it. -- E. Dijkstra

Working...