Google Suggests Robots.txt File Updates for 'Emerging AI' Use Cases (blog.google) 58
For a "vibrant content ecosystem," Google's VP of Trust says web publishers need "choice and control over their content, and opportunities to derive value from participating in the web ecosystem." (Does this mean Google wants to buy the right to scrape your content?)
In a blog post, Google's VP of trust starts by saying that unfortunately, "existing web publisher controls" like your robots.txt file (a community-developed web standard) came from nearly 30 years ago, "before new AI and research use cases..." We believe it's time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases. Today, we're kicking off a public discussion, inviting members of the web and AI communities to weigh in on approaches to complementary protocols. We'd like a broad range of voices from across web publishers, civil society, academia and more fields from around the world to join the discussion, and we will be convening those interested in participating over the coming months.
They're announcing an "AI web publisher controls" mailing list (which you can sign up for at the bottom of Google's blog post).
Am I missing something? It seems like this should be as easy as adding a syntax for opting in, like
AI-ok: *
Thanks to Slashdot reader terrorubic for sharing the article.
In a blog post, Google's VP of trust starts by saying that unfortunately, "existing web publisher controls" like your robots.txt file (a community-developed web standard) came from nearly 30 years ago, "before new AI and research use cases..." We believe it's time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases. Today, we're kicking off a public discussion, inviting members of the web and AI communities to weigh in on approaches to complementary protocols. We'd like a broad range of voices from across web publishers, civil society, academia and more fields from around the world to join the discussion, and we will be convening those interested in participating over the coming months.
They're announcing an "AI web publisher controls" mailing list (which you can sign up for at the bottom of Google's blog post).
Am I missing something? It seems like this should be as easy as adding a syntax for opting in, like
AI-ok: *
Thanks to Slashdot reader terrorubic for sharing the article.
It already works, why change it (Score:5, Informative)
Shouldn't AI crawlers already follow the robots.txt? No need to change it, sites just need to define it correctly. This might cause for other spiders to not index all their content, but that might be the price you need to pay.
Re:It already works, why change it (Score:5, Insightful)
Re: (Score:2)
Re:It already works, why change it (Score:4, Insightful)
Google wants AI scrape to be opt-out, because that will give them access rights to pretty much everything instantly.
Re: (Score:2)
Re: (Score:2)
Yup. Google's moving the goalposts, and hoping some folks don't hear that the move happened. Big business being big business. Yay.
Re: (Score:3)
It works as a total blacklist, but not as a blacklist specifically against AI scraping (while keeping search engine visibility). That's mostly what this is about, coming up with a new robots.txt parameter that tells AI bots to scram
Re: (Score:3)
It works as a total blacklist, but not as a blacklist specifically against AI scraping (while keeping search engine visibility).
Of course, Google chose to contribute to this problem by essentially saying "if you want our search engine to index your site, you also have to let us slurp up your site's data for training our AI".
This can be solved with robots.txt. All that's really needed is for Google and the others to give their AI scrapers unique identifiable user-agent identities.
Re:It already works, why change it (Score:5, Insightful)
Of course, Google chose to contribute to this problem by essentially saying "if you want our search engine to index your site, you also have to let us slurp up your site's data for training our AI".
I'm confused.
This is literally why this discussion is happening.
Google, like many many others, have realized that we need more ways to indicate what kind of behavior you specifically want to allow or disallow by scraping entities.
They have a right to scrape. This is very well established. They don't even have to try to be nice about it.
This can be solved with robots.txt. All that's really needed is for Google and the others to give their AI scrapers unique identifiable user-agent identities.
This is absolutely a solution-ish. But only ish.
The paradigm you suggest only really fulfills the needs of the problem if you have a different UserAgent for every reason a scraper may index or chew through a site.
I.e., you, as a site operator may want Google to index your page for translation purposes, but not LLM training purposes.
Or you may want it to index the text, but not cache the site.
The problem, is that robots.txt is not expressive as to what your permissions are. And meta tags suck ass due to nothing even remotely approaching standardization of meaning for robots meta keyword..
Unfortunately, the solution that the community at large comes with is more likely to use something like meta tags, because kids these days don't like simple shit like parsing a robots.txt.
If it were me, I'd fix robots.txt to just be more expressive. You could solve the problem quite easily with one new verb: Permissions.
But ultimately, the file/tags were discussing here a a good-faith protocol. They don't have to use it, and you can't force them to.
This protocol must be negotiated. There are consequences to bad faith in that negotiation on both sides.
I see a lot of morons on this site talking some shit that's frankly laughable.
It's becoming more and more obvious to me that there are very few people on this site who actually matter to any of the shit they have opinions on. Which is sad, because that used to be the awesomeness of this site. We got the opinions of professionals and experts, not stupid fucking meme traffickers and education-by-youtube dumbfucks who think they matter.
Having a hot take on something the industry is doing might be cool to one's friends, but all it does is ensure that you do not matter, and that nobody will give a fuck what your stupid opinion is.
This isn't personally aimed at you, but you are straying close to the area of being unhelpful.
Re: It already works, why change it (Score:3)
Yeah, no. Sundar knows that robots.txt already covers the use case. He just wants to make up a new rule to have an AI-scraping "excuse" because the new rule has nobody following it by default, and he intends Google to be the primary recipient of that pretext.
The cruelty is the point.
Re: (Score:2)
robots.txt is woefully insufficient for this purpose. It's already insufficient for the current hodgepodge regime of more expressive scraping behavior regulation systems.
There needs to be a standardized way to expressively inform the scraper what you'd like them to do with what's on your site.
You very well may want Google to index your site, but not cache it. Or not train LLMs with it. How, precisely, do you tell it to do that with robots.txt, which gives
Re: It already works, why change it (Score:2)
No, the parent poster is probably realistic and not cynical
Google wants to scrape site for AI and finds robots.txt too restrictive.
If I was broken why wait so long and explicitly mention AI? This is the 'public comments' period of a change they intend making.
Re: (Score:2)
No, the parent poster is probably realistic and not cynical
Nope. You share their particular form of damage, clearly.
Google wants to scrape site for AI and finds robots.txt too restrictive.
Stupid argument.
Google wants to give you more control, not less.
robots.txt has been insufficient for crawler bot restriction for a long time now, which is why we have robots meta tags.
Just because you can't imagine a situation where you want bots to be able to do some things on your site, and not others doesn't mean such a need doesn't exist. It just means you're lazy, or not very intelligent.
If I was broken why wait so long and explicitly mention AI? This is the 'public comments' period of a change they intend making.
No, this isn't.
This is an invitation to get people toge
Re: (Score:2)
robots.txt has been insufficient for crawler bot restriction for a long time now, which is why we have robots meta tags.
I was under the impression that robot meta tags were created for those with shared hosting, can't modify HTTP response headers, or otherwise can't have a single robots.txt file located on the host's webroot.
Re: (Score:2)
I was under the impression that robot meta tags were created for those with shared hosting, can't modify HTTP response headers, or otherwise can't have a single robots.txt file located on the host's webroot.
It did in fact start life that way, back in... 1996. It never saw wide adoption, however, due to the specific design of being as limited as robots.txt in terms of permissions. It was only useful for the specific use case you mentioned, when shared non-virtual hosting was common. These days, sharedhosting.com/~johnbob/ is a rare sight.
When SEO guys and indexers started becoming aware that additional permissions were needed, for whatever reason, they chose to start that work on the robots meta tag rather th
Re: (Score:3)
Re: (Score:2)
Re: (Score:3)
Robots.txt isn't a very good solution, so these days most web crawlers ignore it. It was widely abused for SEO and malware, by trying to make it harder to detect dodgy websites.
For AI what we should have is opt-in. If the website doesn't opt in, complete with a link to the licence that the material is available under, don't take it and use it to train your AI. We will probably need a law that makes using AI trained on material not explicitly licenced for such use copyright infringement.
Re: (Score:3)
Robots.txt isn't a very good solution, so these days most web crawlers ignore it. It was widely abused for SEO and malware, by trying to make it harder to detect dodgy websites.
The only real problem with robots.txt is that it is voluntary requiring people running crawlers to be respectful of the wishes of site owners even when it poses an inconvenience for them.
The reason crawlers ignore robots.txt is because their owners are asshats who believe they are entitled to do as they please.
Over the years there have been a string of ridiculous public excuses for such behavior: Other sites ignore it too, we're special different and unique - robots.txt was never intended to apply to *our*
Re: (Score:3)
For AI scraping / training, I don't see a fault with Robots.txt. Most sites don't want that kind of traffic because they filter the ADs and block legitimate human users. (An effective DDoS attack.) While most AI users don't want the ADs con
Re: (Score:2)
Whatever the solution, it should be opt in. Assume no permission unless the site explicitly grants it.
Re: (Score:2)
Websites are public by default, they can use HTTP 401 or other solutions if non-public behaviour is required.
Re: It already works, why change it (Score:1)
Re: (Score:2)
Maybe some web site operators want search engines to crawl their sites, but not AI bots (at least, not for free). Stack Overflow is one such site.
Re: It already works, why change it (Score:1)
Re: (Score:2)
I have no idea what your "argument" is, your sentences are rather jumbled. But you state that the "AI domain" cannot be defined by a static set of computers. Why not?
And data does not travel freely. It travels encumbered with intellectual property restrictions, where the author allows his/her data to be shared only under specific conditions. That's what copyright is, you the author get to decide how your work is used, regardless of the ease with which the data itself can be shared.
It's not about the technic
Re: It already works, why change it (Score:1)
Re: (Score:2)
Yes, some level of data sharing is required for the operation of the web or software (keystroke logging or telemetry). But that need does not infringe on the ownership rights of an IP holder. This is why one must accept terms and conditions every time you access many websites, you are granting the publisher access to some of your otherwise private data, in exchange for using the site.
AI is different from the web in how it presents data to the user, not because it's AI. Traditional web sites control exactly
Re: (Score:2)
Sure they might want that, but it doesn't mean they are entitled to that. The old saying about cake and consuming it and all.
Re: (Score:2)
By law, authors certainly are entitled to that level of control, if they want it. This is what copyright law covers. The author legally has complete control over when and how their work is shared, to whatever level of granularity they want. In some cases, that level of control may not be available, so the author might have to make compromises.
Streaming presented content owners with a very similar issue a few years ago. Just because a publisher had permission to make a DVD, did not automatically give them pe
Re: (Score:2)
As far as I know, no law compels anyone to fetch or respect the contents of robots.txt. While respecting it is "The Right Thing To Do," doing so is a courtesy, not a legal requirement. Thus, I suggest that efforts to expand the expressiveness of robots.txt, or to create additional files that are similar, will have the effect of giving the false impression that something useful has occurred. It will attract a great deal of discussion and effort without actually changing the legal, and enforceable, obligation
You've got the main bit (Score:3)
It needs to be a bit more selective than that, but yes it should be just as easy as adding a robots.txt-like syntax for AI. I wouldn't piggy-back on robots.txt itself, let that control overall automated access, but have an ai-robots.txt file specifically for AI scanning that can indicate the scanning tool or purpose and which parts of the site are allowed or blocked for that tool/purpose. My recommendation would be for web server software to by default set this to deny-all and during installation or setup prompt the user to select a desired configuration (defaulting to "no change").
Re: (Score:2)
Why not add it to robots.txt? This is all about scraping sites with automated tools. So far, it's the wild west so far as whether services will honor them, with the most infamous scraper having a single command line option to ignore any flags/marking to stop it.
At the end of the day, when my server identifies a scraper that oversteps what robots.txt allows, it starts sending it down the path of 1xx response tar pits. You follow robots.txt or you don't get to play.
Re: (Score:2)
Yes, they're making a nice strawman argument with this proposal. Unfortunately they're not alone in this, there is also a w3c community group about this, who is making the same strawman argument [github.com] (somehow AI scraping is not "normal" scraping, because it's AI).
And their end goal is to make this opt-out - basically if you don't add the specific rules to your web server, they will scrape everything.
It's all AI enthusiasts and what they all want is to skip following the robots.txt rules because it "hinders innov
Re: (Score:3)
Yes, they're making a nice strawman argument with this proposal. Unfortunately they're not alone in this, there is also a w3c community group about this, who is making the same strawman argument [github.com] (somehow AI scraping is not "normal" scraping, because it's AI).
There's no strawman anywhere. What an absurd claim.
The argument being made is not that AI scraping is not normal scraping. The argument being made is that lots of scraping happens for different reasons, and robots.txt does not allow you to specify what reasons you're OK with.
The industry has moved around this with meta tags, but they're a shit-show.
If you didn't know that, then you're already not relevant to this discussion.
And their end goal is to make this opt-out - basically if you don't add the specific rules to your web server, they will scrape everything.
Nobody has proposed that.
It's all AI enthusiasts and what they all want is to skip following the robots.txt rules because it "hinders innovation."
Nobody has suggested this.
People don't want another fu
Re: (Score:2)
Well, the main valid reason is that it's so easy to ignore robots.txt . If there's another valid reason, I don't know it.
Re: (Score:2)
Because I may want to allow spiders to index my site for search-results purposes but not for AI training. A separate file allows that without leaving a loophole for AI scanners that use the same user-agent string as for more traditional spiders.
Google wants AI to stop scraping for free (Score:3)
Google is hoping it can price out AI competition.
Google's VP of trust (Score:2, Interesting)
Re:Google's VP of trust (Score:4, Insightful)
So? If Russia has a Minister of Justice [wikipedia.org], Google can have a VP of trust.
Bots don't care (Score:1)
Just make it simple (Score:2)
What's the code for "fuck off, AI"?
I doubt anyone would need anything else.
Posted on this yesterday .. (Score:1)
Robots.txt is ignored by google (Score:2, Insightful)
Google is one of the bad actors; your robots dot text file will be ignored by Google if you do a blanket disallow wild card, because they don't recognize the wild card disallow as part of the protocol as valid. And so they refuse to honor it.
Re: Robots.txt is ignored by google (Score:3)
Yup. I had to ban IP blocks and add specific user agent bans to deal with a lot if spiders. Its yet another case of dominate players intentionally trying to break the internet.
here is a revolutionary concept (Score:2)
Here's a Thought (Score:2)
How about a permanent hard 403 for all Google.com servers without exception?
Get the fuck out and stay out. I was here first. Google doesn't dictate policy on my servers.
Meanwhile, you can take your AI and ram it.
"Google's VP of Trust" (Score:2)
Dealing with the devil (Score:2)
This is the same Google that decided to ignore indexing instructions in robots.txt against the explicit wishes of site owners.
Hypocrits (Score:2)
Only positive consent is valid (Score:1)
As long as my website does not have an explicite permission stated, they better stay the hell away from using it as free AI training data.
ignore my robots.txt? (Score:2)
Ignore my robots.txt and my very stupid non-AI script will blackhole your IP. Basically touch too many unique URLs in too short of a time and you go on a naughty list. I consider crawlers a violation of the terms of use of my website, and a good portion of them play nice and honor robots.txt. And I consider it contrary to the copyright on my website for corporations to archive the contents for their own use. The major players already weaponized robots.txt against their competitors, such as Amazon still main
It's just another backdoor attempt (Score:2)
Am I missing something? It seems like this should be as easy as adding a syntax for opting in, like
AI-ok: *
Yeah, but that wouldn't let Google (and the rest of the Big Five butt buddies) shoehorn their own replacement for robots.txt (which will in all likelihood be completely human-illegible, but "easily" managed with a complimentary Google Webmaster service widget of some sort).
AI bots does not identify themselves (Score:1)
That's pretty rich (Score:2)