Can Robots.txt Files Really Stop AI Crawlers? (theverge.com) 97
In the high-stakes world of AI, "The fundamental agreement behind robots.txt [files], and the web as a whole — which for so long amounted to 'everybody just be cool' — may not be able to keep up..." argues the Verge:
For many publishers and platforms, having their data crawled for training data felt less like trading and more like stealing. "What we found pretty quickly with the AI companies," says Medium CEO Tony Stubblebin, "is not only was it not an exchange of value, we're getting nothing in return. Literally zero." When Stubblebine announced last fall that Medium would be blocking AI crawlers, he wrote that "AI companies have leached value from writers in order to spam Internet readers."
Over the last year, a large chunk of the media industry has echoed Stubblebine's sentiment. "We do not believe the current 'scraping' of BBC data without our permission in order to train Gen AI models is in the public interest," BBC director of nations Rhodri Talfan Davies wrote last fall, announcing that the BBC would also be blocking OpenAI's crawler. The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI's models "were built by copying and using millions of The Times's copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more." A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.
It's not just publishers, either. Amazon, Facebook, Pinterest, WikiHow, WebMD, and many other platforms explicitly block GPTBot from accessing some or all of their websites.
On most of these robots.txt pages, OpenAI's GPTBot is the only crawler explicitly and completely disallowed. But there are plenty of other AI-specific bots beginning to crawl the web, like Anthropic's anthropic-ai and Google's new Google-Extended. According to a study from last fall by Originality.AI, 306 of the top 1,000 sites on the web blocked GPTBot, but only 85 blocked Google-Extended and 28 blocked anthropic-ai. There are also crawlers used for both web search and AI. CCBot, which is run by the organization Common Crawl, scours the web for search engine purposes, but its data is also used by OpenAI, Google, and others to train their models. Microsoft's Bingbot is both a search crawler and an AI crawler. And those are just the crawlers that identify themselves — many others attempt to operate in relative secrecy, making it hard to stop or even find them in a sea of other web traffic.
For any sufficiently popular website, finding a sneaky crawler is needle-in-haystack stuff.
In addition, the article points out, a robots.txt file "is not a legal document — and 30 years after its creation, it still relies on the good will of all parties involved.
"Disallowing a bot on your robots.txt page is like putting up a 'No Girls Allowed' sign on your treehouse — it sends a message, but it's not going to stand up in court."
Over the last year, a large chunk of the media industry has echoed Stubblebine's sentiment. "We do not believe the current 'scraping' of BBC data without our permission in order to train Gen AI models is in the public interest," BBC director of nations Rhodri Talfan Davies wrote last fall, announcing that the BBC would also be blocking OpenAI's crawler. The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI's models "were built by copying and using millions of The Times's copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more." A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.
It's not just publishers, either. Amazon, Facebook, Pinterest, WikiHow, WebMD, and many other platforms explicitly block GPTBot from accessing some or all of their websites.
On most of these robots.txt pages, OpenAI's GPTBot is the only crawler explicitly and completely disallowed. But there are plenty of other AI-specific bots beginning to crawl the web, like Anthropic's anthropic-ai and Google's new Google-Extended. According to a study from last fall by Originality.AI, 306 of the top 1,000 sites on the web blocked GPTBot, but only 85 blocked Google-Extended and 28 blocked anthropic-ai. There are also crawlers used for both web search and AI. CCBot, which is run by the organization Common Crawl, scours the web for search engine purposes, but its data is also used by OpenAI, Google, and others to train their models. Microsoft's Bingbot is both a search crawler and an AI crawler. And those are just the crawlers that identify themselves — many others attempt to operate in relative secrecy, making it hard to stop or even find them in a sea of other web traffic.
For any sufficiently popular website, finding a sneaky crawler is needle-in-haystack stuff.
In addition, the article points out, a robots.txt file "is not a legal document — and 30 years after its creation, it still relies on the good will of all parties involved.
"Disallowing a bot on your robots.txt page is like putting up a 'No Girls Allowed' sign on your treehouse — it sends a message, but it's not going to stand up in court."
Velvet rope (Score:5, Insightful)
Re:Velvet rope (Score:5, Informative)
What a stupid question anyway, robot.txt has never stopped anybody or anything from accessing a site, AI or not.
Re: (Score:1)
Hey it’s technically illegal under the very broad CFAA. I know it’s been revised so it’s probably been fixed but even unpleasant conversation could have technically fallen under the CFAA at one time.
Re: (Score:2)
Asking someone to not cross the velvet rope is not a viable theft-deterrent system.
Meh. Who'd steal a velvet rope? There's not a huge "secondhand" market for those!
That's not the point (Score:2)
Just let them crawl (Score:5, Insightful)
The thing about more modern LLM is that as they continue crawling they start feeding on their own generated content which is the equivalent of marrying your cousin, in a few generation all LLM will resemble the famous painting of Charles II. And it's already visible, GPT4 is much worse at coming up with novel ideas and goes much more towards hallucinations and being outright wrong because it is being fed the answer someone reposted from GPT3.
Re: (Score:3)
Sounds like a doom loop. AIs will increasingly suck up it's own content recursively. Eventually they will drift off into la-la land and become irrelevant.
Re: (Score:3)
So you're saying the solution isn't robots.txt, but a website that deliberately returns AI-generated crap for every page visited by the AI crawlers?
There's something I rather like about this idea... My ollama install is sufficiently slow that it'll have the added side-effect of slowing down and using extra resources in the crawlers too.
Re: (Score:1)
You can just train with internet data up to a certain cutoff date, pre-AI.
Re: (Score:2)
The thing about more modern LLM is that as they continue crawling they start feeding on their own generated content which is the equivalent of marrying your cousin, in a few generation all LLM will resemble the famous painting of Charles II. And it's already visible, GPT4 is much worse at coming up with novel ideas and goes much more towards hallucinations and being outright wrong because it is being fed the answer someone reposted from GPT3.
Do you have any evidence to support your claim regarding GPT4?
Re: (Score:2)
I have not looked into this so just shooting from the hip but it seems logical that the more LLM bots are fed from information generated by other LLM bots the more they will drift from reality.
Like the children's game of "telephone".
Re: (Score:1)
The fact that GitHub CoPilot and OpenAI Chat are more and more confidently wrong about even the simplest things whereas just a year ago they were providing at least somewhat decent solutions.
Re: (Score:2)
I wonder if you could poison the training data. Include some invisible text on the site, that only the LLM will read.
Re: (Score:1)
Maybe we need 'Cyber Tresspas' to be a thing (Score:3)
When you enter private property, there are certain commonly accepted rules, plus whatever rules the property owner posts or uses commonly accepted other indicators to advertise.
You can walk up a driveway and ring someone's doorbell, but only if there is no fence and gate blocking your way (no matter how ineffective as physical security) and no signage indicating you are not welcome. Put up sign or a gate, and suddenly anyone ringing your doorbell without prior authorization is trespassing.
If I expose a server to the Internet and post a rule in a commonly accepted place like 'robots.txt', violating that rule ought to be considered an act of criminal trespass... and any data downloaded during that act should be considered data theft.
Re: (Score:2)
> Put up sign or a gate, and suddenly anyone ringing your doorbell without prior authorization is trespassing.
That's... stupid.
I'm not saying you're legally wrong, just that I consider the doorbell not accessible without authorization stupid.
Re: (Score:2)
Even stupider is going onto someone's property when they have clearly said you are not welcome.
You have already lost the game. (Score:2)
Re: (Score:2, Insightful)
That has to be the stupidest response I've read on Slashdot recently.
You're using the Internet, which requires exposing servers to the Internet. If you think connecting computers to the Internet is 'losing the game' perhaps you should disconnect yourself for a while and see how great your Internet experience is without the Internet.
Re: (Score:3)
The only way to mitigate copyrighted material be used without authorization is to litigate. Once again unless you have an army of lawyers expect that copyright material to not be used as you intended.
Re:Maybe we need 'Cyber Tresspas' to be a thing (Score:4, Informative)
Put up sign or a gate, and suddenly anyone ringing your doorbell without prior authorization is trespassing.
If I expose a server to the Internet and post a rule in a commonly accepted place like 'robots.txt', violating that rule ought to be considered an act of criminal trespass
I agree on the second part but I've done research on the trespassing rules, and it's very complicated. Here are a few of the things I learned.
1. A no soliciting sign on your door has no legal power. There has to be a clear line a person has to cross before they get to the door.
2. You can't shoot at children who cross into your property, even if they're trespassing.
3. Courts tend to treat back and side yards differently from front yards.
4. In a life threatening emergency your property rights are generally nil.
5. There are a lot of people who can come onto your property even if no trespassing signs are clearly posted.
+-- A. The postman can walk past no soliciting signs to come to your door.
+-- B. Free speech has been interpreted to mean that evangelists and politicians can ignore no soliciting signs.
+-- C. A police officer can knock on your back window in the middle of the night.
+-- D. The electric company can go pretty much anywhere they want.
Re: (Score:2)
I once lived on a property where there was an easement through my backyard as that's where the local main sewer line was buried. At any time, the municipal government had full authority to destroy my fence to drive construction equipment in and excavate the back half of my yard. Never happened, but I didn't care for the possibility.
But yes, your property has a variety of rules covering access. Thankfully, I live in Canada where free speech does NOT grant ridiculous amounts of leeway to con artists to pe
Re: (Score:2)
I have an easement as well but it is literally way out front next to the sidewalk on a space between me and my neighbor. They can get to it without damaging anything while standing in the sidewalk. It's kinda weird your house was put in front like that.
Re: (Score:2)
It's not that weird. I have the same situation as the GP; underground power lines run through the back of my yard, (and everyone else's on the block,) so there's an easement. The electric company can come in and tear my back fence and yard out if necessary. It'll probably never happen, but still.
On top of that, like you, I have an easement through mt front yard, near the sidewalk, for a cable line and, (I believe,) a gas line as well.
Re: Maybe we need 'Cyber Tresspas' to be a thing (Score:2)
1. A no soliciting sign on your door has no legal power. There has to be a clear line a person has to cross before they get to the door.
In one word? False. Though it depends on local laws. In Phoenix, solicitors are required to obey such signs. They can walk up to your door, but the moment they try to solicit, as a matter of law it's the same as if they jumped over your fence into your back yard. Bonus: If you're in a gated community, the moment they pass through an open gate no matter who opened it, they're already trespassing.
I've had them say some shit like "I have a peddler's permit that lets me ignore that stuff" which is a complete li
Re: Maybe we need 'Cyber Tresspas' to be a thing (Score:2)
Oh, and in Phoenix, missionaries are solicitors under the law.
Re: (Score:2)
Re: (Score:2)
Tell Palo Alto to refund my parking ticket fine in a shopping mall lot. :-)
Re: (Score:2)
Re: (Score:2)
Which is why I specified Palo Alto. I got it. Was just funning.
Re: (Score:2)
My understanding if you do that here the owner can just call a tow company who will be more than happy to pick it up within mere minutes of the call as it's a very lucrative business. No police necessary, though I don't know the exact rules around it here and I haven't yet seen it happen to anybody. In Phoenix, they just have to have a sign posted basically saying "tow away zone" whilst citing the ARS code that they'll have you towed under. From there the property owner doesn't hold any liability or anythin
Re: (Score:2)
Re: (Score:2)
> 2. You can't shoot at children who cross into your property, even if they're trespassing.
Whoa! What if I'm a really bad shot? Can I shoot them in that case?
Re: (Score:2)
Re: (Score:2)
Ok but how is a special rule/law for not shooting kids any different than the general concept of not shooting random people who aren't a threat?
I don't think it suddenly becomes ok to shoot people when they hit 18. If there's any place where that is true, please let me know where so I can avoid.
Re: (Score:2)
Re: (Score:2)
Oh ok, I thought that's what you meant. Never mind, all good.
Re:Maybe we need 'Cyber Tresspas' to be a thing (Score:4, Informative)
This is probably American bollocks talking, but here in the civilized world, it's certainly bullshit. Generally, postal workers - in the execution of their duties - have a reasonable right of entry onto land to perform those duties. Representatives of the owners of equipment (e.g. the gas, water and electricity meters supplying your premises - they're never sold or leased ; they're always the property of the utility company) have the right of access to their property for inspection. In particular, they have the right to access and inspect it to ascertain that the meter hasn't been bypassed. Various officers of the court have the right of access to property to serve papers, and perform various other duties instructed by the court.
American wet dreams about "my home is my castle" are not applicable outside America - and probably aren't applicable inside America too - watching video of police raids on lunatic isolationists/ survivalists is popular entertainment here (with a prayer of thanks to Cthulhu that we're not on the same continent as these nutters.
Sorry Canada and Mexico.
Re: (Score:2)
This isn't someone's private home. You're inviting people into your public showroom by having a website. This is like a shopkeeper posting a sign that says "no robots allowed," but if you don't have a bouncer (i.e. a WAF), you really can't do anything about it. The idea of charging an unmanned, suspected, potentially foreign bot for "trespass" doesn't translate here.
Re: (Score:2)
violating that rule ought to be considered an act of criminal trespass... and any data downloaded during that act should be considered data theft.
Or you could, you know, not send the data they asked for in the first place.....
This just screams entitlement like the various news agencies. Yes I want you to index me. No I don't want you to display said index of me without paying for it!!!!!
Not to mention, it's fairly easy to detect a web crawler. Just drop some hidden honeypot links that aren't visible to normal users. When the bot crawls too many of them, just block their access. You could also use ADs. If the advert isn't displayed, don't send th
so use adblock and it's an felony? or EULA bs? (Score:2)
so use adblock and it's an felony? or EULA bs?
Re: (Score:2)
Maybe 'use adblock against same-site ads is cyber trespassing' is something we should accept. Third party's too risky to force on people.
I think if a site expects you to accept ads (third-party hosted or not), it should be liable for any damage caused by malicious code served up. If Slashdot connects me to malware, I should have a fairly easy legal remedy to charge it for any damages related to data theft or even simply the time or money put into cleaning my system.
I suspect that if consumers had rights a
Re: (Score:2)
How about, you can't have it both ways!
If you don't want your data to be public, don't make it publicly available.
If you make it publicly available, crawlers (AI or not) will find it.
Seems OK to me!
Re: (Score:1)
Public is public (Score:2, Insightful)
If you post something in public, it means that anybody or any bot can see it
Want security and control? Don't post in public
Re: (Score:2)
Re: (Score:1)
You're talking about things that are covered by copyright law. AI training is not. That's why the companies are mad.
Re: (Score:2)
You're talking about things that are covered by copyright law. AI training is not.
This is nonsense.
Sure, the actual training of the AI of not covered by copyright law. But much of its training material is very much covered by copyright law. And given the current long copyright terms, I suspect only a tiny fraction of publicly available sources are in the public domain.
About as well as ... (Score:2)
Microsoft and Google can force acceptance (Score:1)
There is an alternative (Score:3)
The alternative is to Clickwrap your website with a No Bots agreement, and require User action to enter.
There are also things you can put in front of a web server that dynamically rewrite all the page links to be session-specific Link codes. Then you ban the session if there are too many requests in a short period of time.
Since the webpage links are session-specific; if they try to evade using a new IP address, they have to start all over. Because the links to some random GUID don't even provide the browser any way to uniquely identify each link from session to session.
"Disallowing a bot on your robots.txt page is like putting up a 'No Girls Allowed' sign
True. But actually the sign on a treehouse might have some weight depending on how the trespassing law is written.
There's little precedent for the robots.txt
Humans also train on this material.. (Score:1)
Re: Humans also train on this material.. (Score:1)
If you read a book the probability of you using to write anything remotely similar is next to zero, based on the number of writers relative to the whole population. If ai reads it, the probability of it using it is 1.0. it is reading it with an explicit goal of using the data.
Re: (Score:2)
If ai reads it, the probability of it using it is 1.0.
Not really. The AI is using the data to strengthen relationships between ideas. If the only thing the training strengthens from reading your site is "dogs wag their tails when happy" or "climate change is increasing" then not much is retained. If your site is the only one talking about the breakthrough in perovskite photovoltaic efficiency, then yes, your data is probably going to be a singular node and regurgitated more directly.
LOL NO (Score:4, Informative)
Re: (Score:2)
I've been making some progress with GDPR. If they scrape the web then they have to respond to SARs because they might have scraped your personal data.
Re: (Score:2)
because they might have scraped your personal data.
If personal data is available to any automated system scraping the web then the GDPR compliance problem is your website, not the crawler.
Re: (Score:2)
GDPR doesn't work like that. Just because you found some data in a public place, doesn't mean you can take it, or process it, or refuse SARs.
Re: (Score:2)
>"Robots.txt is only going to stop a scraper that chooses to abide by it."
Correct. I have seen systems (bots) actively scraping my site, completely ignoring my robots.txt
>"It's not mandatory and there are no laws enforcing it being followed."
Even if there were laws, they would probably still be ignored. And/or enforcement is pretty much impossible. Case in point- the ridiculous concept of "gun-free zones." The only ones who respect it are the law-abiding, who were never the problem. And then you
Re: (Score:2)
open season (Score:2)
The Larger Questions (Score:2)
The larger questions are how much we can "trust" AI and how that affects current laws.
No one is proposing a technical solution to that genie now that it's out of the bottle but I'm sure that some kind of digital Robocop, fraught with its own problems, is coming soon. "That's Tron. He fights for the users."
robots.txt != copyright infringement (Score:2)
That's two different things. Clearly. (...Like push and pull logic, client -server, whatever.)
Firewall+robots.txt = very useful (Score:5, Interesting)
Put a block in your robots.txt for a path that doesn't exist.
Anyone who tries to hit that path is obviously abusing your robots.txt.
Drop their IP in your firewall.
Yes they can change their IP and likely use many IP's but they're not getting very far in most cases.
I consider robots.txt abusers to be in the same class of scum as spammers and Nigerian prince scammers and crypto boys. Automation won't stop them all but it can dramatically reduce their impact.
Re: (Score:2)
Your approach might work for crawlers that actually try to follow the paths in robots.txt. But I don't think that's how most crawlers work. They're actually following links in the home page itself. Reputable crawlers would then filter out links discovered in this way, based on the list in robots.txt. But if your page doesn't contain a link to your non-existent page, crawlers would have no reason to go there.
Re: (Score:2)
But the idea is to filter out crawlers that are not reputable and use the robots.txt as a list of places to go rather than avoid.
Filtering bad network bits of any sort is a series of filters. There is no one magic way to filter all spam or all sms crap or all bad web crawlers but the ones abusing robots.txt can certainly be stopped or at least dramatically slowed.
Re: (Score:2)
Or you could just make your private content...private, as in, don't allow it to be seen without authentication. Then you don't have to worry about crawlers following links in robots.txt.
Re: (Score:2)
"you could just make your private content...private"
Only some of the concern is about content that should have been made private. The greater concern is dumb crawlers pounding your servers repeatedly.
Re: (Score:2)
That concern--of dumb crawlers pounding servers--is not expressed in the article or the summary, as far as I can tell. And to the extent that it's a concern outside of the article's scope, the concern is no more serious for regular crawlers, than AI crawlers.
Re: (Score:2)
Because I don't want my honest customers and respectful crawlers to have increased friction because a minority of people are assholes abusing my robots.txt.
Re: (Score:2)
Welcome to the real world, which is full of assholes.
Stopping robots is just stupid. (Score:4, Insightful)
Stopping robots is just stupid. If you post stuff publicly (non-protected webpages), then your stuff is fair game.
If you don't want robots, protect the information with logins.
"but logins will reduce our viewership" - You can't have your cake and eat it too.
This is nothing new (Score:2)
Hell... here I am reading a news aggregation website that relies on the same principle.
When will we stop seeing AI as the bogeyman of the 21st century and realise that we've been doing this stuff all along. AI has just revealed
Even if they do respect robots.txt... (Score:2)
Even if they don't crawl, in respect for robots.txt, the data will still get scraped anyway -- Just from other sources. The syndicated reposters, the companies that just wholesale regurtigate shit for clicks and SEO.
Their content is wide-spread, they really can't robots.txt their way out of being involved in training.
Re: (Score:2)
Which is why they need to stop being so polite.
Just say "We consider using our content for training language models for commercial use not fair use, we will get around to suing you eventually, be warned." In fact, just put that as a comment in the robots.txt.
Oh but Search Engines Get a Pass? (Score:1)
Re: (Score:2)
No, it took a few lawsuits, the DMCA and EU copyright directives to give them those privileges.
Or to turn it around, OpenAI probably doesn't have the privilege unless government jumps up and gives it or the supreme court rules it fair use. Otherwise they will be exAI.
"No X allowed" might still be enforceable (Score:2)
A "No girls allowed" sign on a treehouse might, if the person putting the sign up had ownership rights, be enforceable against those who violate the sign as being trespassers though.
Robots.txt does serve as a kind of "no trespassing" sign, and it'd be interesting to see how it holds up in court in terms of serving as commonly recognized notice of limitations on permission to access a service, for automated systems. Quite often just putting someone on notice that they're not allowed to access something is su
trespassers like signs can't block screen readers (Score:2)
trespassers like signs can't block screen readers under the ADA law.
Not even the DMCA allows you to block them.
So if you had an screen reader that screen reader is coded to not read the robots file (that is not shown on screen when you visit) you can have an is of trespassing but in away that the law say you can't enforce. Now an local cop may not jail / ticket some one for that. But an hacking charge you may need to fight that out in court.
Re: trespassers like signs can't block screen read (Score:2)
automated system vs say an automated adblock for a (Score:2)
automated system vs say an automated adblock for an live view will need to be have the prohibition be listed in an way that an live view is shown to the live viewer to be legally binding. Or do you want an visit to an website to = prison time?
Re: automated system vs say an automated adblock f (Score:2)
Robots.txt is the defacto standard, the industry norm, for limiting or denying automated systems access to servers.
And yes, if you go on a website and it has a notice you're only allowed to visit the website if you are a member of group X (eg, employees) then that can be legally binding too (albei
I suspect many will not like this take (Score:1)
Not going to stand up to court? (Score:1)
Wider issue (Score:2)
The wider issue is that the whole system of copyright/patents/"intellectual property" etc. was introduced in a world where a physical copy of some sort was required to transfer information (book, paper, tape, vinyl disc, optical disk, clay tablet, wax cylinder etc. etc.). We no longer live in that world.
We now live in a world where any information that is stored digitally can be immediately replicated any number of times. At will. So the whole system of copyright and "intellectual property" is no longer fi
Honeypot (Score:2)
In your robots.txt simply put a few lines like /foobar
Then, whenever anything ever asks for your page /foobar, you block that IP for 12 hours
Captcha Re:Honeypot (Score:2)
Add a form with an "I am not a robot" check-box and marking buses, boats, store-fronts, street-lights... ;O)
You see, Robots.txt files (Score:1)
robots.txt has it's uses, but... (Score:1)