Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
AI Google

Google Suggests Robots.txt File Updates for 'Emerging AI' Use Cases (blog.google) 58

For a "vibrant content ecosystem," Google's VP of Trust says web publishers need "choice and control over their content, and opportunities to derive value from participating in the web ecosystem." (Does this mean Google wants to buy the right to scrape your content?)

In a blog post, Google's VP of trust starts by saying that unfortunately, "existing web publisher controls" like your robots.txt file (a community-developed web standard) came from nearly 30 years ago, "before new AI and research use cases..." We believe it's time for the web and AI communities to explore additional machine-readable means for web publisher choice and control for emerging AI and research use cases. Today, we're kicking off a public discussion, inviting members of the web and AI communities to weigh in on approaches to complementary protocols. We'd like a broad range of voices from across web publishers, civil society, academia and more fields from around the world to join the discussion, and we will be convening those interested in participating over the coming months.
They're announcing an "AI web publisher controls" mailing list (which you can sign up for at the bottom of Google's blog post).

Am I missing something? It seems like this should be as easy as adding a syntax for opting in, like

AI-ok: *


Thanks to Slashdot reader terrorubic for sharing the article.
This discussion has been archived. No new comments can be posted.

Google Suggests Robots.txt File Updates for 'Emerging AI' Use Cases

Comments Filter:
  • by raynet ( 51803 ) on Sunday July 09, 2023 @09:44AM (#63670673) Homepage

    Shouldn't AI crawlers already follow the robots.txt? No need to change it, sites just need to define it correctly. This might cause for other spiders to not index all their content, but that might be the price you need to pay.

    • by jythie ( 914043 ) on Sunday July 09, 2023 @10:10AM (#63670769)
      They should, but by creating a 'new' standard, Google could claim that if a website does not have the new file then all permissions are given, maybe even ignoring the robot.txt since that would be for 'non-ai use cases'
    • by Gherald ( 682277 )

      It works as a total blacklist, but not as a blacklist specifically against AI scraping (while keeping search engine visibility). That's mostly what this is about, coming up with a new robots.txt parameter that tells AI bots to scram

      • It works as a total blacklist, but not as a blacklist specifically against AI scraping (while keeping search engine visibility).

        Of course, Google chose to contribute to this problem by essentially saying "if you want our search engine to index your site, you also have to let us slurp up your site's data for training our AI".

        This can be solved with robots.txt. All that's really needed is for Google and the others to give their AI scrapers unique identifiable user-agent identities.

        • by DamnOregonian ( 963763 ) on Sunday July 09, 2023 @02:04PM (#63671453)

          Of course, Google chose to contribute to this problem by essentially saying "if you want our search engine to index your site, you also have to let us slurp up your site's data for training our AI".

          I'm confused.
          This is literally why this discussion is happening.

          Google, like many many others, have realized that we need more ways to indicate what kind of behavior you specifically want to allow or disallow by scraping entities.

          They have a right to scrape. This is very well established. They don't even have to try to be nice about it.

          This can be solved with robots.txt. All that's really needed is for Google and the others to give their AI scrapers unique identifiable user-agent identities.

          This is absolutely a solution-ish. But only ish.
          The paradigm you suggest only really fulfills the needs of the problem if you have a different UserAgent for every reason a scraper may index or chew through a site.

          I.e., you, as a site operator may want Google to index your page for translation purposes, but not LLM training purposes.
          Or you may want it to index the text, but not cache the site.

          The problem, is that robots.txt is not expressive as to what your permissions are. And meta tags suck ass due to nothing even remotely approaching standardization of meaning for robots meta keyword..
          Unfortunately, the solution that the community at large comes with is more likely to use something like meta tags, because kids these days don't like simple shit like parsing a robots.txt.

          If it were me, I'd fix robots.txt to just be more expressive. You could solve the problem quite easily with one new verb: Permissions.
          But ultimately, the file/tags were discussing here a a good-faith protocol. They don't have to use it, and you can't force them to.
          This protocol must be negotiated. There are consequences to bad faith in that negotiation on both sides.

          I see a lot of morons on this site talking some shit that's frankly laughable.
          It's becoming more and more obvious to me that there are very few people on this site who actually matter to any of the shit they have opinions on. Which is sad, because that used to be the awesomeness of this site. We got the opinions of professionals and experts, not stupid fucking meme traffickers and education-by-youtube dumbfucks who think they matter.

          Having a hot take on something the industry is doing might be cool to one's friends, but all it does is ensure that you do not matter, and that nobody will give a fuck what your stupid opinion is.

          This isn't personally aimed at you, but you are straying close to the area of being unhelpful.

    • Yeah, no. Sundar knows that robots.txt already covers the use case. He just wants to make up a new rule to have an AI-scraping "excuse" because the new rule has nobody following it by default, and he intends Google to be the primary recipient of that pretext.

      The cruelty is the point.

      • Incorrect. You're just ignorant and/or stupid.

        robots.txt is woefully insufficient for this purpose. It's already insufficient for the current hodgepodge regime of more expressive scraping behavior regulation systems.

        There needs to be a standardized way to expressively inform the scraper what you'd like them to do with what's on your site.
        You very well may want Google to index your site, but not cache it. Or not train LLMs with it. How, precisely, do you tell it to do that with robots.txt, which gives
        • No, the parent poster is probably realistic and not cynical
            Google wants to scrape site for AI and finds robots.txt too restrictive.

          If I was broken why wait so long and explicitly mention AI? This is the 'public comments' period of a change they intend making.

          • No, the parent poster is probably realistic and not cynical

            Nope. You share their particular form of damage, clearly.

            Google wants to scrape site for AI and finds robots.txt too restrictive.

            Stupid argument.
            Google wants to give you more control, not less.
            robots.txt has been insufficient for crawler bot restriction for a long time now, which is why we have robots meta tags.
            Just because you can't imagine a situation where you want bots to be able to do some things on your site, and not others doesn't mean such a need doesn't exist. It just means you're lazy, or not very intelligent.

            If I was broken why wait so long and explicitly mention AI? This is the 'public comments' period of a change they intend making.

            No, this isn't.
            This is an invitation to get people toge

            • robots.txt has been insufficient for crawler bot restriction for a long time now, which is why we have robots meta tags.

              I was under the impression that robot meta tags were created for those with shared hosting, can't modify HTTP response headers, or otherwise can't have a single robots.txt file located on the host's webroot.

              • I was under the impression that robot meta tags were created for those with shared hosting, can't modify HTTP response headers, or otherwise can't have a single robots.txt file located on the host's webroot.

                It did in fact start life that way, back in... 1996. It never saw wide adoption, however, due to the specific design of being as limited as robots.txt in terms of permissions. It was only useful for the specific use case you mentioned, when shared non-virtual hosting was common. These days, sharedhosting.com/~johnbob/ is a rare sight.

                When SEO guys and indexers started becoming aware that additional permissions were needed, for whatever reason, they chose to start that work on the robots meta tag rather th

        • What they really need is the ability to specify "fuck you - pay me" Index on search engine: OK! Any other reason: payment per byte, sent to Bitcoin address/bank acct/whatever
    • by AmiMoJo ( 196126 )

      Robots.txt isn't a very good solution, so these days most web crawlers ignore it. It was widely abused for SEO and malware, by trying to make it harder to detect dodgy websites.

      For AI what we should have is opt-in. If the website doesn't opt in, complete with a link to the licence that the material is available under, don't take it and use it to train your AI. We will probably need a law that makes using AI trained on material not explicitly licenced for such use copyright infringement.

      • Robots.txt isn't a very good solution, so these days most web crawlers ignore it. It was widely abused for SEO and malware, by trying to make it harder to detect dodgy websites.

        The only real problem with robots.txt is that it is voluntary requiring people running crawlers to be respectful of the wishes of site owners even when it poses an inconvenience for them.

        The reason crawlers ignore robots.txt is because their owners are asshats who believe they are entitled to do as they please.

        Over the years there have been a string of ridiculous public excuses for such behavior: Other sites ignore it too, we're special different and unique - robots.txt was never intended to apply to *our*

      • I'd argue the reason why Robots.txt is considered "broken" is because so many websites want the web crawler to index them, but then want total control over how that index entry looks. (Which robots.txt doesn't allow.) See also: Australia / Canada and major news outlets.

        For AI scraping / training, I don't see a fault with Robots.txt. Most sites don't want that kind of traffic because they filter the ADs and block legitimate human users. (An effective DDoS attack.) While most AI users don't want the ADs con
        • by AmiMoJo ( 196126 )

          Whatever the solution, it should be opt in. Assume no permission unless the site explicitly grants it.

          • by raynet ( 51803 )

            Websites are public by default, they can use HTTP 401 or other solutions if non-public behaviour is required.

    • Do users need an easier way to enter their info when they sign up for a web service then an easier way to upload/download content? They will not each be developing their own generative assistant to run natively or writing their own coding assistant for gcc I'm sure. So this is just about an element in a development environment that users will never see or touch? Or maybe I'm missing the point....
    • Maybe some web site operators want search engines to crawl their sites, but not AI bots (at least, not for free). Stack Overflow is one such site.

      • The web environment is the domain and the AI is the domain and cannot be defined by a static set of computers is my argument. I acknowledge trust relationships and gateways and subnets.....and the like but, there is a concept where data travelling freely from one point to another, in other words between domains is the only definition of. So...Data scraping is therefore a trivial discussion about what happens on the most fundamental level on the internet.
        • I have no idea what your "argument" is, your sentences are rather jumbled. But you state that the "AI domain" cannot be defined by a static set of computers. Why not?

          And data does not travel freely. It travels encumbered with intellectual property restrictions, where the author allows his/her data to be shared only under specific conditions. That's what copyright is, you the author get to decide how your work is used, regardless of the ease with which the data itself can be shared.

          It's not about the technic

          • No matter. You seem to understand well enough and it really sounds like you simply disagree, ok.....Do you or for that matter anyone even know what an AI is? Do you know the scope of a web service and all affiliated services when you visit a site? Some part of what I think you are naming intellectual property is within the affiliated domain so any data provided is shared within it. And if you are a developer? Then, unless you own and operate you own physical network and develop your own infrastructure your
            • Yes, some level of data sharing is required for the operation of the web or software (keystroke logging or telemetry). But that need does not infringe on the ownership rights of an IP holder. This is why one must accept terms and conditions every time you access many websites, you are granting the publisher access to some of your otherwise private data, in exchange for using the site.

              AI is different from the web in how it presents data to the user, not because it's AI. Traditional web sites control exactly

      • by raynet ( 51803 )

        Sure they might want that, but it doesn't mean they are entitled to that. The old saying about cake and consuming it and all.

        • By law, authors certainly are entitled to that level of control, if they want it. This is what copyright law covers. The author legally has complete control over when and how their work is shared, to whatever level of granularity they want. In some cases, that level of control may not be available, so the author might have to make compromises.

          Streaming presented content owners with a very similar issue a few years ago. Just because a publisher had permission to make a DVD, did not automatically give them pe

    • As far as I know, no law compels anyone to fetch or respect the contents of robots.txt. While respecting it is "The Right Thing To Do," doing so is a courtesy, not a legal requirement. Thus, I suggest that efforts to expand the expressiveness of robots.txt, or to create additional files that are similar, will have the effect of giving the false impression that something useful has occurred. It will attract a great deal of discussion and effort without actually changing the legal, and enforceable, obligation

  • by Todd Knarr ( 15451 ) on Sunday July 09, 2023 @09:53AM (#63670717) Homepage

    It needs to be a bit more selective than that, but yes it should be just as easy as adding a robots.txt-like syntax for AI. I wouldn't piggy-back on robots.txt itself, let that control overall automated access, but have an ai-robots.txt file specifically for AI scanning that can indicate the scanning tool or purpose and which parts of the site are allowed or blocked for that tool/purpose. My recommendation would be for web server software to by default set this to deny-all and during installation or setup prompt the user to select a desired configuration (defaulting to "no change").

    • by Barny ( 103770 )

      Why not add it to robots.txt? This is all about scraping sites with automated tools. So far, it's the wild west so far as whether services will honor them, with the most infamous scraper having a single command line option to ignore any flags/marking to stop it.

      At the end of the day, when my server identifies a scraper that oversteps what robots.txt allows, it starts sending it down the path of 1xx response tar pits. You follow robots.txt or you don't get to play.

      • Yes, they're making a nice strawman argument with this proposal. Unfortunately they're not alone in this, there is also a w3c community group about this, who is making the same strawman argument [github.com] (somehow AI scraping is not "normal" scraping, because it's AI).

        And their end goal is to make this opt-out - basically if you don't add the specific rules to your web server, they will scrape everything.

        It's all AI enthusiasts and what they all want is to skip following the robots.txt rules because it "hinders innov

        • Yes, they're making a nice strawman argument with this proposal. Unfortunately they're not alone in this, there is also a w3c community group about this, who is making the same strawman argument [github.com] (somehow AI scraping is not "normal" scraping, because it's AI).

          There's no strawman anywhere. What an absurd claim.
          The argument being made is not that AI scraping is not normal scraping. The argument being made is that lots of scraping happens for different reasons, and robots.txt does not allow you to specify what reasons you're OK with.
          The industry has moved around this with meta tags, but they're a shit-show.

          If you didn't know that, then you're already not relevant to this discussion.

          And their end goal is to make this opt-out - basically if you don't add the specific rules to your web server, they will scrape everything.

          Nobody has proposed that.

          It's all AI enthusiasts and what they all want is to skip following the robots.txt rules because it "hinders innovation."

          Nobody has suggested this.
          People don't want another fu

      • by HiThere ( 15173 )

        Well, the main valid reason is that it's so easy to ignore robots.txt . If there's another valid reason, I don't know it.

      • Because I may want to allow spiders to index my site for search-results purposes but not for AI training. A separate file allows that without leaving a loophole for AI scanners that use the same user-agent string as for more traditional spiders.

  • by S_Stout ( 2725099 ) on Sunday July 09, 2023 @10:17AM (#63670787)
    Google can afford to scrape. If AI does it for free then AI will eventually replace Google as it does a better job and does not have to show ads like Google does.

    Google is hoping it can price out AI competition.
  • What an oxymoron.
  • There are several big bots which don't pay attention to robots.txt, why would AI-bots be any different.
  • What's the code for "fuck off, AI"?

    I doubt anyone would need anything else.

  • by Anonymous Coward

    Google is one of the bad actors; your robots dot text file will be ignored by Google if you do a blanket disallow wild card, because they don't recognize the wild card disallow as part of the protocol as valid. And so they refuse to honor it.

  • How about we invent something called intellectual property rights that are enforced by fines of say a 100000 dollars per offense? one web page counting as one offense?
  • How about a permanent hard 403 for all Google.com servers without exception?

    Get the fuck out and stay out. I was here first. Google doesn't dictate policy on my servers.

    Meanwhile, you can take your AI and ram it.

  • Google and Trust in the same sentence. What a Hoot!!
  • This is the same Google that decided to ignore indexing instructions in robots.txt against the explicit wishes of site owners.

  • For a company that NEVER has obeyed the desires of those hosting networks this is just annoying. I ran a server hosting a band's website. I had updated the robots.txt not to scan the sight. One page had no direct link from any other page on the website. You either knew it or not. Google, using their outright brutal DNS servers found the page and started posting the bands music in Google searches. I filled out the forms to have it removed and now, eight years later it can still be found in the land of EVIL.
  • As long as my website does not have an explicite permission stated, they better stay the hell away from using it as free AI training data.

  • Ignore my robots.txt and my very stupid non-AI script will blackhole your IP. Basically touch too many unique URLs in too short of a time and you go on a naughty list. I consider crawlers a violation of the terms of use of my website, and a good portion of them play nice and honor robots.txt. And I consider it contrary to the copyright on my website for corporations to archive the contents for their own use. The major players already weaponized robots.txt against their competitors, such as Amazon still main

  • Am I missing something? It seems like this should be as easy as adding a syntax for opting in, like

    AI-ok: *

    Yeah, but that wouldn't let Google (and the rest of the Big Five butt buddies) shoehorn their own replacement for robots.txt (which will in all likelihood be completely human-illegible, but "easily" managed with a complimentary Google Webmaster service widget of some sort).

  • I have a really old website (20 years) and i manage everything there, AI spider bots are very common, not a single one identify themselves as a bot, all of them came from VPS from different origins, all of them abuse the website with a lot of requests per minute. For example Hetzner.de a VPS company, i have to block a lot of IP addresses from them, they dont care. The User Agent of all these spiders have no identification of a bot, all of them uses a normal browser identification, so i only can block them b
  • Google, the company that considers robots.txt a suggestion rather than a directive...

Remember: Silly is a state of Mind, Stupid is a way of Life. -- Dave Butler

Working...