Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm (yahoo.com) 108
Multiple AI companies are ignoring Robots.txt files meant to block the scraping of web content for generative AI systems, reports Reuters — citing a warning sent to publisher by content licensing startup TollBit.
TollBit, an early-stage startup, is positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them. The company tracks AI traffic to the publishers' websites and uses analytics to help both sides settle on fees to be paid for the use of different types of content... It says it had 50 websites live as of May, though it has not named them. According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt. TollBit said its analytics indicate "numerous" AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.
"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. "The more publisher logs we ingest, the more this pattern emerges."
The article includes this quote from the president of the News Media Alliance (a trade group representing over 2,200 U.S.-based publishers). "Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry."
Reuters also notes another threat facing news sites: Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries. If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.
"What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites," TollBit wrote. "The more publisher logs we ingest, the more this pattern emerges."
The article includes this quote from the president of the News Media Alliance (a trade group representing over 2,200 U.S.-based publishers). "Without the ability to opt out of massive scraping, we cannot monetize our valuable content and pay journalists. This could seriously harm our industry."
Reuters also notes another threat facing news sites: Publishers have been raising the alarm about news summaries in particular since Google rolled out a product last year that uses AI to create summaries in response to some search queries. If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.
Yah ... (Score:2, Troll)
If publishers want to prevent their content from being used by Google's AI to help generate those summaries, they must use the same tool that would also prevent them from appearing in Google search results, rendering them virtually invisible on the web.
I'm sure this is nothing more than an unfortunate coincidence that Google will fix .... eventually ... rest assured, we're working on it ... any day now ...
Re: (Score:3)
Re: (Score:1)
Indeed. This is just publishers whining that they aren't getting something for nothing.
Re:Yah ... (Score:4, Insightful)
If their content is worth nothing then why are they being scraped?
Re: Yah ... (Score:2)
Tulip bulbs.
Re: (Score:2)
Doesn't matter either way. Shit gets trained into the AI on Slashdot time.
In other words, in six months to a year the AI will know about it, and it will likely spit out just as many dupes!
Re: (Score:1)
Multiple AI Companies Ignore Robots.Txt Files
I would be shocked if there is anyone who DOESN'T ignore the Robots.Txt file
Re: (Score:1)
It gets hits on my webserver. It's clearly not being ignored by everyone.
Re: (Score:2)
There's an important difference between retrieving that file and complying with its contents.
Re: (Score:1)
True, but it's technically not being "ignored" if it's at least being fetched, even if it's not being actually obeyed...
Re: (Score:2)
They might ask for robot.txt to get the interesting stuff you try to hide, it would be trivial for you to test it out instead of just saying they ask for it. Who knows? Maybe I could test it myself on my own web servers then, write a click bait article then, profit!
Re: (Score:2)
Why would they change this? Appearing on google search results is a privilege, not a right.
And that's why Google needs to be regulated as a monopoly.
Re: (Score:2)
It cuts both ways. If Google didn't have their content the users would go elsewhere for it. The value is mutual.
Re: (Score:2)
Google doesn't have their content.
Re: (Score:2)
corruption and classism, it's inevitable that the greedy will cross every boundary because greed is insatiable
classism is the real problem which creates the corruption destroying our society
Just a contract (Score:5, Insightful)
robots.txt is just a good-will contract between the client and a web server. Since AI companies (in fact all companies) go for profit and hardly ever show any good will, why would you expect them to obey by rules outlined in robots.txt? If you want access control over your content, implement actual access control.
Re:Just a contract (Score:5, Informative)
A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or password-protect the page.
https://developers.google.com/... [google.com]
Re: (Score:2)
Yes, this was decided after the California DMV decided to use their robots.tx not to have their site indexed.
Google's reasoning was that it would become a pretty bad search engine if it didn't couldn't index the DMV site.
Re: (Score:2)
robots.txt is just a good-will contract between the client and a web server. Since AI companies (in fact all companies) go for profit and hardly ever show any good will, why would you expect them to obey by rules outlined in robots.txt? If you want access control over your content, implement actual access control.
robots.txt is just a passive request, really. And the AI companies are apparently declining that request.
Personally, I think they should respect it, but yeah, if you are relying on everybody respecting robots.txt, then you are smoking something ...
Re: (Score:2)
Re: (Score:2)
Accessing content made publicly available and learning from it is not a violation of anything. So you can multiply however you like. It's still multiplication by zero, so it equals zero.
Re: (Score:2)
If the content they scrape has no value then why are they scraping it?
Re: (Score:2)
>If the content they scrape has no value
Who made this claim, and why are you replying to me with this stupid assertion?
Re: (Score:1)
I knock down strawmen here almost every day, thanks. I take pleasure in it.
Re: (Score:2)
robots.txt is just a good-will contract between the client and a web server.
robots.txt is, in effect, a machine readable/parable copyright notice. So if an AI company scrapes pages in contravention of what is said in a robots.txt then it should be liable for being sued for breach of copyright. They cannot pretend that as they did not read it that they can ignore it; imagine what a judge would say if you reproduced a book or piece of music but said that you did not read the book's copyright or CD's copyright notice.
But AI companies are rich with expensive lawyers and will fight test
Re: (Score:2)
A copyright notice tells who's got the copyright. The robots.txt doesn't talk about copyrights or even author names at all.
My access control (Score:2)
A copyright, license agreement, and an attorney. Not that is practical for an individual, with the except of those in a jurisdiction with small claims court. But even a moderately sized business should be able to enforce its rights in a civil court.
So be it (Score:2)
One way to protest this is to add instructions or other content for only the AI on your site. This could be used for commercial purposes too, if the return message can be modified.
selfishness (Score:2)
Re: (Score:2)
Exactly. Well, supposedly a "good" company like Google was 20 years ago (ah, for the days of Do No Evil...) I would expect them to honor it but anyone with a "bad" intent would probably use it as a shopping list of where to start indexing.
Otherwise, it needs to be dealt with in config or code (require authentication of some kind, only allow internal LAN/VPN connections, etc)
Killing the deal (Score:4, Interesting)
The deal was, you get to scrape the web, show your ads on your search results pages, and in return the web sites get visitors. If you scrape the web and "summarize" the content and nobody ever visits the web site, you don't uphold your end of the deal and the deal will end. And if copyright legislation doesn't come down on that behavior, the open web will cease to be. Nobody will get access to anything that isn't advertising or propaganda in itself without signing a contract that excludes any non-personal use of the content, even summarizing and other "fair uses".
Re:Killing the deal (Score:4, Insightful)
That worked...for a while. The problem is the ad networks, being the capitalists they are, took the "neutral" approach of "whatever you pay for". This resulted in legitiate businesses being used for the illicit distribution of malware.
The next problem is that ad networks did nothing. They knew they were serving malicious ads; they knew they were selling to bad actors; but they knew they had legal protection and continued to willingly sell malicious adspace under the guise of "we're too big to check".
So now come ad-blockers. It was one thing when they were just annoying; but it's another when there's actual risk of getting hacked. It didn't go over well when the local newspaper infected 2500 local readers from a bad ad. Did they blame the ad? The paper did. Know who the readers blamed? They blamed the newspaper. "You should have taken more responsibility," is what they screamed as they were canceling subscriptions. The same for a local TV station's website when their ad network was serving malicious ads. They could point the finger all they wanted...but everyone was pointing it at the station.
That's the other problem; no one places the blame where the blame should be placed. Rather than blame the adnetworks with no morals; they blame the website operators.
So now we have the ad-blocker wars; and to combat that...more anti-adblock stuff.
The fact is...ad revenue isn't enough anymore. The lack of privacy laws and no oversight on any of this has meant the biggest export is American user data; sold by American companies, to the highest bidder. They don't care about us...we're just a product to profit off of.
Re: (Score:3)
Facebook has a version of that problem now. Every single "Sponsored" article leads straight to a malware. Reporting it to Facebook as fraud, which these pages are, gets a response of "we see no violation of our guidelines".
Re: (Score:2)
Yeah...when they're being paid to display it it's never a guideline and there's no concern for users.
I'm waiting for someone to finally get a judge to start using the elimination of rules and start holding them responsible in civil court. They don't have immunity in civil court anymore over that; it was killed so they could arrest backpage guys.
Re: Killing the deal (Score:2)
Re: (Score:2)
If I pay $1
Re: (Score:3)
The websites are a proxy for the malware. They signed a contract with someone serving malware to their innocent users. They are responsible for delivering malware. If they had used a legit ad company that filters out shitty malware ads the users would not have been impacted.
Here's who gets to blame who:
Web site users get to blame the web site
Web site has to take that responsibility for infecting their users while also getting to blame the ad network
The ad network has to take that responsibility while bla
Re: (Score:3)
I never said ad based content models worked or are a good thing. Only pointing out that the content distributors who use shitty ad networks are fully responsible for the malware they deliver to their visitors. Someone said the readers unfairly blamed the online newspapers. Not so. The readers appropriately blamed the newspapers for delivering malware to their browsers.
Re: (Score:1)
When was that the deal outside of your imagination. Even google itself clearly states in writing that this is not the deal in the opening lines of the description of what robots.txt does:
https://developers.google.com/... [google.com]
"A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google. To keep a web page out of Google, block indexing with noindex or passw
Re: (Score:3)
Read it once again ;)
The comment isn't about google and robots.txt, it is about "AI" outfits scraping content and then offering it as their own "AI" creation.
Or at least it appears so to me.
Re: (Score:2)
The complaint raised in the topic is specifically about robots.txt:
>Multiple AI Companies Ignore Robots.Txt Files, Scrape Web Content, Says Licensing Firm
Re: (Score:2)
The robots exclusion standard (aka robots.txt) is a red herring [wikipedia.org]. It is a gentlemen's agreement [wikipedia.org], originally intended to indicate to crawlers which parts of a web site are unsuitable for them, where crawling would produce nothing but useless burden for the server and the crawler alike. For example, you wouldn't want a crawler to keep requesting URLs from a procedurally generated infinite tree of documents. Whether it has expanded beyond that and can be legally binding under certain circumstances is left as an
Re: (Score:2)
>an implicit understanding
Isn't the entire argument here that there is in fact no such thing because content producers specifically assert that such agreement doesn't exist and AI companies and google are doing things they don't want google to do, but that they have no desire to actually prevent by gating their content from the public?
Re: (Score:2)
We are in a time of transition. Obviously most web sites do not want to drop out of Google right now, because that still is a sizeable amount of their traffic. However, once people understand "googling" to mean asking a chatbot and getting answers not from the web sites but immediately from the bot, the traffic will dry up and exposing the content to Google and other AI companies won't benefit web sites anymore, and that's when web sites will not only drop from the search results pages, that nobody looks at
Re: (Score:2)
I strongly disagree on emotionally loaded, objectively and factually incorrect wording such as "content being stolen" when all that's being done is learning from data that certain people chose to share with the public.
But the rest of the analysis is mostly in line with my thinking.
So I suspect the sole point of differential we have is that you see it a moral good to call any person learning from content of another person a thief (someone who steals) as you do in the post above. Whereas I look at entire huma
Re: (Score:2)
You're missing the point, because it doesn't matter at all whether you think "stolen" is the right word. Fact of the matter is that content producers do not agree to that kind of use of their content without getting anything in return, and as soon as they no longer benefit from making content accessible to crawlers, they will stop doing that. If you agree to a contract that forbids you from training an AI model (and all other uses that aren't purely personal), because you won't be able to access anything w
Re: (Score:2)
Emperor has no clothes, and public has every right to see that emperor has no clothes, and learn whatever lessons it chooses to learn from it.
Only a tyrannical emperor makes people avert their eyes and block learning from that public display he himself chose to put on.
Once the emperor chose to put himself up on display in whatever way he chose to do it, no further permits from the emperor are necessary for looking at him, and learning from his visage. And any demands for such permits to be required are so e
Re: (Score:2)
Greedy idiots ruining things for everyone by shirking conventions is the oldest tale in the book.
Re: (Score:2)
Indeed. They should stop pretending that they have a right to dictate if people can look at and learn from with content they themselves made public.
Re: (Score:2)
You're an LLM, aren't you?
Re: (Score:2)
It is the current popular way to run away from argument you lost among the terminally online, isn't it?
Before it was "you're a bot", and before that "you're a nazi". My reaction remains the same.
Run away little girl, run away!
Re: (Score:2)
It's a plausible explanation why you keep forgetting all context. You may just be an idiot though.
Re: (Score:2)
Projection on your part is very real.
Re: (Score:1)
>Fact of the matter is that content producers do not agree to that kind of use of their content without getting anything in return, and as soon as they no longer benefit from making content accessible to crawlers, they will stop doing that.
Then: Don't. Put. Shit. On. The. Free. Web.
Period.
It's not rocket surgery here.
You aren't allowed to make a piece of artwork, put it publicly on a billboard in the middle of town, and then say Bob, Jerry, and Sue aren't allowed to look at it, but everyone else in the w
Re: (Score:2)
Let's ignore for a moment that you completely ignored the rest of the discussion and naturally missed the point, that letting AI companies get away with delivering content that others created will result in the end of the openly accessible web for everyone, not just these companies, because these parasites and their shills don't take no for an answer. But even with that caveat, you're still wrong. The web isn't a billboard that anyone can look at. It's servers delivering content to individual clients, and y
Re: (Score:2)
No you can't decide to host on a non-login site and pick and choose who can view the content. That's the whole damn reason some sites hide forums and user content behind logins.
Well, you CAN, if you know the particular IP addresses you won't respond to. Or you can, I don't know, Implement a login scheme with terms of service.
The "problem" with a login scheme is.... you don't get a free lunch. You don't even get random visitors to ATTEMPT to try to serve ads to. Most of them will see your jank assed attempt
Re: (Score:2)
The reason I wrote "deal" and not "contract" is that it's a convention or mutually implied understanding, not a piece of paper or an oral agreement sealed with a handshake or somesuch. If the wording confuses you, call it a balance of benefits. Web authors let crawlers access their sites. They do not have to do that. The balance of benefits is being shaken up by crawlers that train AI models. These AI companies then effectively provide the content without anyone having to visit the site where it originated.
Re: Killing the deal (Score:2)
The UN Human Rights Council might disagree with you.
virtually invisible on the web (Score:2)
Re: (Score:2)
Maybe you and your site are the AI.
Copyright Infringement (Score:2)
Take a page out of Nintendo's book; you lawyer up and file a C&D, DMCA, and everything you can for every page they scrape. If they are ignoring your intellectual property rights and policies; then it's unauthorized access. I mean the FBI literally just listed SABnzbd as a pirate site...clearly the standard for infringement is low if the FBI isn't even making correct arguments in court.
Re: (Score:2)
Large sites like Youtube have a process in place, and there's software to automatically scan videos for copyrighted music snippets that can be submitted as DMCA violations.
Right now small community sites don't have the expertise or the manpower to manually check access logs and trace where the spiders are coming from, or find the contact details to send C&D and DMCA claims. So they do nothing, and that's
Re: (Score:2)
What the community needs is a way to easily and automatically do the C&D + DMCA submissions.
Large sites like Youtube have a process in place, and there's software to automatically scan videos for copyrighted music snippets that can be submitted as DMCA violations.
No way that will be abused by Big Corp lawyers and trolls, right?
Oh wait! It already happens with some DMCA requests made to Youtube by various outsourced "rights management" firms and trolls.
https://www.eff.org/deeplinks/... [eff.org] https://www.businessinsider.co... [businessinsider.com] https://arstechnica.com/tech-p... [arstechnica.com] https://www.thefader.com/2022/... [thefader.com] https://sirtaptap.com/articles... [sirtaptap.com]
Re: (Score:2)
No court has yet ruled if AI scraped data run through their training programs is a copyright violation or transformative.
Does anyone follow robots.txt orders? (Score:2)
It's extra work to pull and read those, and slows the search engine even if ignored. It's far simpler to simply ignore them and build up your "metrics" for the amount of material you've scanned, even when the robots.txt warns you that it's not reliable or even stable.
Re: (Score:1)
Like I said above, it gets hits on my webserver. Not everyone ignores them.
Re: (Score:2)
As a list of target URLs to scrape?
Do the files your robots.txt protect ever get grabbed?
Re: (Score:2)
At some point sufficient incompetence is a form of malice.
This stuff is old and anyone capable of writing a web crawler will know about robots.txt.
Re: (Score:1)
Keep in mind that the primary purpose of robots.txt is to provide a list of primary URLs to crawl, as a shortcut for the crawler to get to the stuff that's relevant to index. Yes, it can also be used to advise on what not to fetch, but ostensibly it's in a ethical web spider's best interests to parse and obey this file, as it will save them time and omit unnecessary chaff from the indexed data.
Re: (Score:1)
(Full disclosure; there's actually nothing on my website but the robots.txt and the index.html, so I haven't actually tested spider obedience figures. I just know it's actually being downloaded, and frequently by stuff with words like "crawler" or "spider" in the indent string.)
Re: (Score:2)
It's a mixed bag. I've seen some well behaved and others dive right in to fake test directories that are just there to see who is bad.
Occam's Razor (Score:2)
Occam's razor would suggest that these companies simply never thought to look for or use robots.txt. It is designed to inform web crawlers what to index for search engines, ant I feel there's a good chance these companies never thought to leverage it, or didn't feel it was applicable to what they were doing. They should have, of course, but I feel there is some wiggle room there to give them the benefit of the doubt in this case.
Not to mention at the end of the day, this is a text file anyone can ignore and
Yawn (Score:5, Interesting)
Remember back in the late 2000s when companies were all about "Reinventing Search" (of the WWW)? It turned out most of them were trying to get juicier results than Google by ignoring robots.txt so they were not actually better and did irritate a lot of people when they ended up recursing indefinitely down programmatically-generated websites whose robots.txt specifically said "don't go here".
It's not news that ignoring robots.txt gets you access to more content on the web. It's also not news that this is usually not going to get you any better content.
Yet another bunch of tech bros are deciding they can succeed by ignoring all of rules, laws, social conventions, and the learnings of the past because they're the superior, innovative people. Instead they will just burn money until they run out, then go around and start another company and get some more money without ever generating anything useful or profitable.
Re: (Score:2)
It's not news that ignoring robots.txt gets you access to more content on the web.
It's also not news that this is usually not going to get you any better content.
Even Google ignores the robots.txt. They made that decision after the California DMV (Department of Motor Vehicle) blocked them with their robots.txt
And frankly, I can't blame Google.
If you don't want your content to be accessed by everyone, don't put it up on the public internet.
Badly written bots are a separate issue.
technical suggestion (Score:3)
Why would you expect a technical suggestion to work?
It's almost as is (Score:4, Insightful)
Gigantic quasi-monopolies don't respect anyone or any laws, or bother to behave with any sort of decency anymore, since they made themselves untouchable and it does nothing for their shareholders anyway.
I don't think they even bother to pretend to show restraint anymore. Like with the AI stuff infringing copyright on an unprecedented scale: they basically just went "Yeah, that's how it goes now. You can't stop us. Suck it up." It's quite staggering.
Re: (Score:2)
Explain how AI infringes copyright?
Re: (Score:2)
That's one hell of a rock you must have been living under...
Re: (Score:2)
I don't think they even bother to pretend to show restraint anymore.
I think Microsoft's Recall feature demonstrates your point very clearly. No shame at all.
Poison the results. (Score:2)
If an AI is crawling the site, create one page that contains purely random text with a selection of random links, but have the page reachable via any arbitrary URL pointing to any imaginable purely illusiary subdomain.
The AI will harvest however many pages it is set to (possibly all of them), each page diluting and corrupting the AI's neural net.
The AI developers don't give a damn about quality, only the illusion of quality, so will never actually stop and look. But a large enough phantom site should seriou
We need .. (Score:2)
How to fix (Score:3)
You can use fail2ban to block rude web scrapers. Put a hidden link into your web pages that people would not see, but bots would. Include that link in robots.txt. When anyone hits that link, fail2ban will automatically block them based on the rule you implement.
Re: (Score:2)
You beat me to the punch. I've done similar things in the past to trap bad bots. My next favorite tool was for use against email harvesters. I generated page after page of fake email addresses for them to collect. Same idea, though. Hidden link on a page, not in the robots.txt though, as the point wasn't to block them but to poison their well.
Linkedin, Microsoft and OpenAI (Score:2)
It's strange, that given Microsoft involvement in both LinkedIn and OpenAI, that Microsoft prevents OpenAI from accessing LinkedIn.
Why does it matter (Score:2)
More expected than surprising (Score:2)
There are two interwined issues here (Score:2)
Oh, yes, please scrape my website (Score:2)
No, I won't sue. I'll file criminal charges for theft against the CEOs. They get to go to JAIL.