Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers' (msn.com) 42
For more than a decade, the nonprofit Common Crawl "has been scraping billions of webpages to build a massive archive of the internet," notes the Atlantic, making it freely available for research.
"In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models.
"In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this — as well as masking the actual contents of its archives..." Common Crawl's website states that it scrapes the internet for "freely available content" without "going behind any 'paywalls.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their LLMs on high-quality journalism for free. Meanwhile, Common Crawl's executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. "The robots are people too," he told me, and should therefore be allowed to "read the books" for free. Multiple news publishers have requested that Common Crawl remove their articles to prevent exactly this use. Common Crawl says it complies with these requests. But my research shows that it does not.
I've discovered that pages downloaded by Common Crawl have appeared in the training data of thousands of AI models. As Stefan Baack, a researcher formerly at Mozilla, has written, "Generative AI in its current form would probably not be possible without Common Crawl." In 2020, OpenAI used Common Crawl's archives to train GPT-3. OpenAI claimed that the program could generate "news articles which human evaluators have difficulty distinguishing from articles written by humans," and in 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing generative-AI boom. Many different AI companies are now using publishers' articles to train models that summarize and paraphrase the news, and are deploying those models in ways that steal readers from writers and publishers.
Common Crawl maintains that it is doing nothing wrong. I spoke with Skrenta twice while reporting this story. During the second conversation, I asked him about the foundation archiving news articles even after publishers have asked it to stop. Skrenta told me that these publishers are making a mistake by excluding themselves from "Search 2.0" — referring to the generative-AI products now widely being used to find information online — and said that, anyway, it is the publishers that made their work available in the first place. "You shouldn't have put your content on the internet if you didn't want it to be on the internet," he said. Common Crawl doesn't log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you're a subscriber and hides the content if you're not. Common Crawl's scraper never executes that code, so it gets the full articles.
Thus, by my estimate, the foundation's archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper's, and The Atlantic.... A search for nytimes.com in any crawl from 2013 through 2022 shows a "no captures" result, when in fact there are articles from NYTimes.com in most of these crawls.
"In the past year, Common Crawl's CCBot has become the scraper most widely blocked by the top 1,000 websites," the article points out...
"In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this — as well as masking the actual contents of its archives..." Common Crawl's website states that it scrapes the internet for "freely available content" without "going behind any 'paywalls.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their LLMs on high-quality journalism for free. Meanwhile, Common Crawl's executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. "The robots are people too," he told me, and should therefore be allowed to "read the books" for free. Multiple news publishers have requested that Common Crawl remove their articles to prevent exactly this use. Common Crawl says it complies with these requests. But my research shows that it does not.
I've discovered that pages downloaded by Common Crawl have appeared in the training data of thousands of AI models. As Stefan Baack, a researcher formerly at Mozilla, has written, "Generative AI in its current form would probably not be possible without Common Crawl." In 2020, OpenAI used Common Crawl's archives to train GPT-3. OpenAI claimed that the program could generate "news articles which human evaluators have difficulty distinguishing from articles written by humans," and in 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing generative-AI boom. Many different AI companies are now using publishers' articles to train models that summarize and paraphrase the news, and are deploying those models in ways that steal readers from writers and publishers.
Common Crawl maintains that it is doing nothing wrong. I spoke with Skrenta twice while reporting this story. During the second conversation, I asked him about the foundation archiving news articles even after publishers have asked it to stop. Skrenta told me that these publishers are making a mistake by excluding themselves from "Search 2.0" — referring to the generative-AI products now widely being used to find information online — and said that, anyway, it is the publishers that made their work available in the first place. "You shouldn't have put your content on the internet if you didn't want it to be on the internet," he said. Common Crawl doesn't log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you're a subscriber and hides the content if you're not. Common Crawl's scraper never executes that code, so it gets the full articles.
Thus, by my estimate, the foundation's archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper's, and The Atlantic.... A search for nytimes.com in any crawl from 2013 through 2022 shows a "no captures" result, when in fact there are articles from NYTimes.com in most of these crawls.
"In the past year, Common Crawl's CCBot has become the scraper most widely blocked by the top 1,000 websites," the article points out...
Comment Subject: (Score:1)
>paywall
asking the client to not look != wall
articles were openly broadcast, news ignored
Re: (Score:2)
Re: Comment Subject: (Score:2)
Re: (Score:2)
Copyright limitations apply to all copyrightable works by default even when the works are not explicitly marked with a copyright. It's a simple rule, which makes life easy for everyone. All anyone has to ask is: Do I own the copyright? No. Then I can't copy it.
Re: (Score:2)
No, but if you put your stuff out at the curb, it is free for anyone to pick up, even if that someone is a business. These companies are putting their contents out there free for any crawler bot to take, no paywalls. They just don't like some of the bots that stop by to pick stuff up.
Broken paywalls (Score:2)
Sites either put on the paywall only later or let in certain User-Agents and IP-Ranges, because they want search engines to list articles you cannot access, so you may buy access. If the Crawler is fast enough, it may have caught the page before the paywall was activated, just like the Googlebot is supposed to do.
So if you want a working paywall, use a working paywall, and don't leave it open to bots to spam search engines with inaccessible results.
Re: (Score:2)
It lets them send you the articles in whole so that they are in your possession after the fact which greatly aids their claims in court that you stole the content.
One problem with this hypothesis I see is ... didn't a company that makes porn movies get caught seeding their own torrents - which blew up in their face after they tried taking to court people who torrented said movies?
I can't help but imagine that would make such maneuvers, similar maneuvers, seem less likely to work - that there are cases of people trying it only to have it blow up in their faces, I mean.
Re: (Score:2)
The point is, that not Google bot gets an exception, but for example anyone using a data center IP, so other bots also get a free pass.
And if your other protection is broken by disabling Javascript (most are not anymore), it is also inefficient. You can't know and can't control what kind of client a visitor is using to access your site.
So if you want to secure your content, put a full login in front of it. Then there is no access without account and you can rate limit access with accounts.
Why is everything built on theft (Score:1)
Re: (Score:2)
it's not like someone hacked into their computers to wade through private folders full of their bs. if it is published and accessible on the net then it is free to read. afaik there is no law yet that defines bypassing a paywall as a crime, much less "theft". if you don't want that, simple: don't publish it. if you still do and can't find enough suckers willing to pay for it then cry me a river. btw, you also seem to use a very skewed interpretation of "property".
Re: Why is everything built on theft (Score:2)
Re: (Score:2)
copyright is a law that attempts to conflate physical property with "intellectual" property, sadly with some success. it's still not the same. the intent is precisely to promote access to someone else's ideas or expressions to some equivalent of "theft". they haven't gone that far yet.
Re: (Score:2)
The intent *was*, a few copyright reforms ago.
Probably before many users here were born. Can't blame them for only knowing the copyright that was intended to help Disney keep its creations commercialized forever. Aren't they already lobbying for the next extension? I guess AI related copyright reforms may play into their hands with that.
Re: (Score:2)
indeed. think of the artists! they have been useful tools before, in exchange of the possibility of a breadcrumb ... ofc media conglomerates, corporations and well intentioned politicians valiantly lead the fight.
Re: (Score:2)
Depends. If you signed up, you are bound by ToS. That's why you probably can't just shovel content behind a paywall into AI training with an account.
If you didn't, only copyright law is applicable. Currently courts interpret copyright law such that AI training is transformative use, which means you don't need to get a license to use the data (but are restricted to use it only for the uses that count as transformative use. Training: yes, Training and reading: no).
If you crawl too fast you might also get fram
Re: (Score:2)
yeah, i was mostly referring to the concept of theft, but indeed it's more nuanced.
Re: (Score:2)
Re: (Score:2)
No, this is not theft. This is more like somebody putting some old furniture by the curb, and being upset when an upholstery company comes along and takes it. After all, they wanted to give away that furniture only to individuals, not businesses! But that's not how it works. If you put your furniture at the curb, you are telling the world it's free. You don't get to pick what kind of person picks it up.
These websites are making their entire content free to crawlers. There is no paywall, only humans see payw
As usual, follow the money. (Score:2, Informative)
Summary missed key point from the original article: "In 2023, after 15 years of near-exclusive financial support from the Elbaz Family Foundation Trust, it received donations from OpenAI ($250,000), Anthropic ($250,000), and other organizations involved in AI development."
https://www.msn.com/en-us/mone... [msn.com]
And 'scientific' papers (Score:2)
Still wanting to know when the research community will begin downvoting the foundational social science research papers and excluding them from new citations where that flawed research used what is now consider scientific flawed and inadmissible methodology - self-reported surveys, tiny sample size, outcome based survey questions, etc..
Can someone in the research area give insight what happens when a paper is retracted to that paper and to the papers which cite that paper and then the second generation cita
Bwhahahaha! (Score:1, Flamebait)
Translation (Score:3)
Translation1: Those companies are making a mistake by not giving him what he wants for free, in order for his company to become profitable.
Translation2: Those companies should not have put stuff on the internet for sale with a paywall if they didn't want people like me to steal it without getting permission.
Re: (Score:2)
They actually *are* giving crawlers what they want for free. It's just now, after they start to realize what some crawlers do with the content, they're starting to want to get choosy about what kinds of crawlers can have the contents for free.
You can not call Javascript a "paywall" (Score:2)
Re: (Score:3)
It does not matter how broken the DRM is, circumventing it is still illegal according to copyright laws.
Re: You can not call Javascript a "paywall" (Score:2)
Re: You can not call Javascript a "paywall" (Score:2)
Re: (Score:3)
What if you broadcast a concert, unencrypted, over the airwaves and then insert a message every so often, "you may not listen to this program"? That designates your intent that no one listen, but it is not an effective technical protection measure because you are broadcasting to everyone in the vicinity, in the clear, in a format they understand. It is like putting up a billboard on the side of a highway and expecting no one to look at it. Not effective at all.
Re: (Score:2)
If you use a fence instead of a wall you cannot sue people for seeing you behind the fence, though.
Re: (Score:2)
It's worse than that. Common Crawl and other bots identify themselves as bots, and generally respect robots.txt. These websites with so-called paywalls, intentionally do NOT block the bots with any kind of paywall, broken or otherwise. They want their full content indexed so they show up as search results, but the results are just teasers for humans to try to get them to subscribe. So these sites intentionally tell the crawlers there is no paywall, when there actually is.