Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
AI The Internet

Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers' (msn.com) 42

For more than a decade, the nonprofit Common Crawl "has been scraping billions of webpages to build a massive archive of the internet," notes the Atlantic, making it freely available for research. "In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models.

"In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this — as well as masking the actual contents of its archives..." Common Crawl's website states that it scrapes the internet for "freely available content" without "going behind any 'paywalls.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their LLMs on high-quality journalism for free. Meanwhile, Common Crawl's executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. "The robots are people too," he told me, and should therefore be allowed to "read the books" for free. Multiple news publishers have requested that Common Crawl remove their articles to prevent exactly this use. Common Crawl says it complies with these requests. But my research shows that it does not.

I've discovered that pages downloaded by Common Crawl have appeared in the training data of thousands of AI models. As Stefan Baack, a researcher formerly at Mozilla, has written, "Generative AI in its current form would probably not be possible without Common Crawl." In 2020, OpenAI used Common Crawl's archives to train GPT-3. OpenAI claimed that the program could generate "news articles which human evaluators have difficulty distinguishing from articles written by humans," and in 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing generative-AI boom. Many different AI companies are now using publishers' articles to train models that summarize and paraphrase the news, and are deploying those models in ways that steal readers from writers and publishers.

Common Crawl maintains that it is doing nothing wrong. I spoke with Skrenta twice while reporting this story. During the second conversation, I asked him about the foundation archiving news articles even after publishers have asked it to stop. Skrenta told me that these publishers are making a mistake by excluding themselves from "Search 2.0" — referring to the generative-AI products now widely being used to find information online — and said that, anyway, it is the publishers that made their work available in the first place. "You shouldn't have put your content on the internet if you didn't want it to be on the internet," he said. Common Crawl doesn't log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you're a subscriber and hides the content if you're not. Common Crawl's scraper never executes that code, so it gets the full articles.

Thus, by my estimate, the foundation's archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper's, and The Atlantic.... A search for nytimes.com in any crawl from 2013 through 2022 shows a "no captures" result, when in fact there are articles from NYTimes.com in most of these crawls.

"In the past year, Common Crawl's CCBot has become the scraper most widely blocked by the top 1,000 websites," the article points out...
This discussion has been archived. No new comments can be posted.

Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers'

Comments Filter:
  • by Anonymous Coward

    >paywall

    asking the client to not look != wall

    articles were openly broadcast, news ignored

    • Just because my car door is unlocked doesn't mean you're not a thief if you enter and take stuff you like.
      • If a shop leaves free samples on a table and you take one, youâ(TM)re not stealing even if they hoped you wouldn't notice or take too many.
        • If you steal stuff and then lie about it "but y'honor it was just a free sample", then you should be sent to jail for it.

          Copyright limitations apply to all copyrightable works by default even when the works are not explicitly marked with a copyright. It's a simple rule, which makes life easy for everyone. All anyone has to ask is: Do I own the copyright? No. Then I can't copy it.

      • No, but if you put your stuff out at the curb, it is free for anyone to pick up, even if that someone is a business. These companies are putting their contents out there free for any crawler bot to take, no paywalls. They just don't like some of the bots that stop by to pick stuff up.

  • Sites either put on the paywall only later or let in certain User-Agents and IP-Ranges, because they want search engines to list articles you cannot access, so you may buy access. If the Crawler is fast enough, it may have caught the page before the paywall was activated, just like the Googlebot is supposed to do.

    So if you want a working paywall, use a working paywall, and don't leave it open to bots to spam search engines with inaccessible results.

  • Is this the same with other industries or is AI alone in it being entirely based on the theft of others property?
    • by znrt ( 2424692 )

      it's not like someone hacked into their computers to wade through private folders full of their bs. if it is published and accessible on the net then it is free to read. afaik there is no law yet that defines bypassing a paywall as a crime, much less "theft". if you don't want that, simple: don't publish it. if you still do and can't find enough suckers willing to pay for it then cry me a river. btw, you also seem to use a very skewed interpretation of "property".

      • Copyright is property no? If not then I am mistaken.
        • by znrt ( 2424692 )

          copyright is a law that attempts to conflate physical property with "intellectual" property, sadly with some success. it's still not the same. the intent is precisely to promote access to someone else's ideas or expressions to some equivalent of "theft". they haven't gone that far yet.

          • by allo ( 1728082 )

            The intent *was*, a few copyright reforms ago.
            Probably before many users here were born. Can't blame them for only knowing the copyright that was intended to help Disney keep its creations commercialized forever. Aren't they already lobbying for the next extension? I guess AI related copyright reforms may play into their hands with that.

            • by znrt ( 2424692 )

              indeed. think of the artists! they have been useful tools before, in exchange of the possibility of a breadcrumb ... ofc media conglomerates, corporations and well intentioned politicians valiantly lead the fight.

      • by allo ( 1728082 )

        Depends. If you signed up, you are bound by ToS. That's why you probably can't just shovel content behind a paywall into AI training with an account.

        If you didn't, only copyright law is applicable. Currently courts interpret copyright law such that AI training is transformative use, which means you don't need to get a license to use the data (but are restricted to use it only for the uses that count as transformative use. Training: yes, Training and reading: no).

        If you crawl too fast you might also get fram

        • by znrt ( 2424692 )

          yeah, i was mostly referring to the concept of theft, but indeed it's more nuanced.

    • It's not just AI. If I have a conversation, and later learn that the other person repeated parts of it, I don't punch him in the nose and accuse him of stealing training data. Because I acknowledge that all conversation depends on prior conversations.
    • No, this is not theft. This is more like somebody putting some old furniture by the curb, and being upset when an upholstery company comes along and takes it. After all, they wanted to give away that furniture only to individuals, not businesses! But that's not how it works. If you put your furniture at the curb, you are telling the world it's free. You don't get to pick what kind of person picks it up.

      These websites are making their entire content free to crawlers. There is no paywall, only humans see payw

  • by Anonymous Coward

    Summary missed key point from the original article: "In 2023, after 15 years of near-exclusive financial support from the Elbaz Family Foundation Trust, it received donations from OpenAI ($250,000), Anthropic ($250,000), and other organizations involved in AI development."
    https://www.msn.com/en-us/mone... [msn.com]

  • Bwhahahaha! (Score:1, Flamebait)

    by Bodhammer ( 559311 )
    "high-quality journalism" - an oxymoron if there ever was!
  • by gurps_npc ( 621217 ) on Saturday November 08, 2025 @08:36PM (#65783342) Homepage

    Translation1: Those companies are making a mistake by not giving him what he wants for free, in order for his company to become profitable.

    Translation2: Those companies should not have put stuff on the internet for sale with a paywall if they didn't want people like me to steal it without getting permission.

    • They actually *are* giving crawlers what they want for free. It's just now, after they start to realize what some crawlers do with the content, they're starting to want to get choosy about what kinds of crawlers can have the contents for free.

  • If a bot can get the content without logging in, then so can a human. You can not call content behind a simple JavaScript screen "paywalled".
    • It does not matter how broken the DRM is, circumventing it is still illegal according to copyright laws.

      • That's not true. The law doesn't speak of "DRM" but instead of effective technical protection measures. The key word is effective - serving the content in full, unencumbered, and then relying on a client to run code to hide it again is not effective.
        • I don't put a fence around my house because it's super effective. I put it to clearly designated the space as private and that you are not welcome here without permission. The fence is entirely not effective against a bulldozer, but the bulldozer is not a clever legal loophole you can present in court.
          • by butlerm ( 3112 )

            What if you broadcast a concert, unencrypted, over the airwaves and then insert a message every so often, "you may not listen to this program"? That designates your intent that no one listen, but it is not an effective technical protection measure because you are broadcasting to everyone in the vicinity, in the clear, in a format they understand. It is like putting up a billboard on the side of a highway and expecting no one to look at it. Not effective at all.

          • by allo ( 1728082 )

            If you use a fence instead of a wall you cannot sue people for seeing you behind the fence, though.

    • It's worse than that. Common Crawl and other bots identify themselves as bots, and generally respect robots.txt. These websites with so-called paywalls, intentionally do NOT block the bots with any kind of paywall, broken or otherwise. They want their full content indexed so they show up as search results, but the results are just teasers for humans to try to get them to subscribe. So these sites intentionally tell the crawlers there is no paywall, when there actually is.

You will have a head crash on your private pack.

Working...