Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
AI The Internet

Can Robots.txt Files Really Stop AI Crawlers? (theverge.com) 97

In the high-stakes world of AI, "The fundamental agreement behind robots.txt [files], and the web as a whole — which for so long amounted to 'everybody just be cool' — may not be able to keep up..." argues the Verge: For many publishers and platforms, having their data crawled for training data felt less like trading and more like stealing. "What we found pretty quickly with the AI companies," says Medium CEO Tony Stubblebin, "is not only was it not an exchange of value, we're getting nothing in return. Literally zero." When Stubblebine announced last fall that Medium would be blocking AI crawlers, he wrote that "AI companies have leached value from writers in order to spam Internet readers."

Over the last year, a large chunk of the media industry has echoed Stubblebine's sentiment. "We do not believe the current 'scraping' of BBC data without our permission in order to train Gen AI models is in the public interest," BBC director of nations Rhodri Talfan Davies wrote last fall, announcing that the BBC would also be blocking OpenAI's crawler. The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI's models "were built by copying and using millions of The Times's copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more." A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.

It's not just publishers, either. Amazon, Facebook, Pinterest, WikiHow, WebMD, and many other platforms explicitly block GPTBot from accessing some or all of their websites.

On most of these robots.txt pages, OpenAI's GPTBot is the only crawler explicitly and completely disallowed. But there are plenty of other AI-specific bots beginning to crawl the web, like Anthropic's anthropic-ai and Google's new Google-Extended. According to a study from last fall by Originality.AI, 306 of the top 1,000 sites on the web blocked GPTBot, but only 85 blocked Google-Extended and 28 blocked anthropic-ai. There are also crawlers used for both web search and AI. CCBot, which is run by the organization Common Crawl, scours the web for search engine purposes, but its data is also used by OpenAI, Google, and others to train their models. Microsoft's Bingbot is both a search crawler and an AI crawler. And those are just the crawlers that identify themselves — many others attempt to operate in relative secrecy, making it hard to stop or even find them in a sea of other web traffic.

For any sufficiently popular website, finding a sneaky crawler is needle-in-haystack stuff.

In addition, the article points out, a robots.txt file "is not a legal document — and 30 years after its creation, it still relies on the good will of all parties involved.

"Disallowing a bot on your robots.txt page is like putting up a 'No Girls Allowed' sign on your treehouse — it sends a message, but it's not going to stand up in court."
This discussion has been archived. No new comments can be posted.

Can Robots.txt Files Really Stop AI Crawlers?

Comments Filter:
  • Velvet rope (Score:5, Insightful)

    by sinij ( 911942 ) on Sunday February 18, 2024 @11:38AM (#64249146)
    Asking someone to not cross the velvet rope is not a viable theft-deterrent system.
  • by guruevi ( 827432 ) on Sunday February 18, 2024 @11:47AM (#64249156)

    The thing about more modern LLM is that as they continue crawling they start feeding on their own generated content which is the equivalent of marrying your cousin, in a few generation all LLM will resemble the famous painting of Charles II. And it's already visible, GPT4 is much worse at coming up with novel ideas and goes much more towards hallucinations and being outright wrong because it is being fed the answer someone reposted from GPT3.

    • by mspohr ( 589790 )

      Sounds like a doom loop. AIs will increasingly suck up it's own content recursively. Eventually they will drift off into la-la land and become irrelevant.

      • So you're saying the solution isn't robots.txt, but a website that deliberately returns AI-generated crap for every page visited by the AI crawlers?

        There's something I rather like about this idea... My ollama install is sufficiently slow that it'll have the added side-effect of slowing down and using extra resources in the crawlers too.

      • You can just train with internet data up to a certain cutoff date, pre-AI.

    • The thing about more modern LLM is that as they continue crawling they start feeding on their own generated content which is the equivalent of marrying your cousin, in a few generation all LLM will resemble the famous painting of Charles II. And it's already visible, GPT4 is much worse at coming up with novel ideas and goes much more towards hallucinations and being outright wrong because it is being fed the answer someone reposted from GPT3.

      Do you have any evidence to support your claim regarding GPT4?

      • I have not looked into this so just shooting from the hip but it seems logical that the more LLM bots are fed from information generated by other LLM bots the more they will drift from reality.

        Like the children's game of "telephone".

      • by guruevi ( 827432 )

        The fact that GitHub CoPilot and OpenAI Chat are more and more confidently wrong about even the simplest things whereas just a year ago they were providing at least somewhat decent solutions.

    • by AmiMoJo ( 196126 )

      I wonder if you could poison the training data. Include some invisible text on the site, that only the LLM will read.

      • I have a bunch of abuse@... E-Mail addresses with the text the same color as the background on my web site for any spammer that wants to scrape/steal E-Mail addresses from my web site. I have thought about creating a folder of obviously bogus contradictory data for AI systems to learn. Of course, I'd list it in a robot.txt file as a defense against my disinformation being used.
  • by Baron_Yam ( 643147 ) on Sunday February 18, 2024 @11:52AM (#64249164)

    When you enter private property, there are certain commonly accepted rules, plus whatever rules the property owner posts or uses commonly accepted other indicators to advertise.

    You can walk up a driveway and ring someone's doorbell, but only if there is no fence and gate blocking your way (no matter how ineffective as physical security) and no signage indicating you are not welcome. Put up sign or a gate, and suddenly anyone ringing your doorbell without prior authorization is trespassing.

    If I expose a server to the Internet and post a rule in a commonly accepted place like 'robots.txt', violating that rule ought to be considered an act of criminal trespass... and any data downloaded during that act should be considered data theft.

    • > Put up sign or a gate, and suddenly anyone ringing your doorbell without prior authorization is trespassing.

      That's... stupid.
      I'm not saying you're legally wrong, just that I consider the doorbell not accessible without authorization stupid.

    • quote: 'If I expose a server to the Internet'
      • Re: (Score:2, Insightful)

        by Baron_Yam ( 643147 )

        That has to be the stupidest response I've read on Slashdot recently.

        You're using the Internet, which requires exposing servers to the Internet. If you think connecting computers to the Internet is 'losing the game' perhaps you should disconnect yourself for a while and see how great your Internet experience is without the Internet.

        • Context matters, here losing the game is losing control of published data. As a mere end user I only lose the game if I put or post any data I care to keep to myself on your server. We are at the point that trusting anyone with valuable private data is a false promise.

          The only way to mitigate copyrighted material be used without authorization is to litigate. Once again unless you have an army of lawyers expect that copyright material to not be used as you intended.
    • by dcooper_db9 ( 1044858 ) on Sunday February 18, 2024 @01:06PM (#64249290)

      Put up sign or a gate, and suddenly anyone ringing your doorbell without prior authorization is trespassing.

      If I expose a server to the Internet and post a rule in a commonly accepted place like 'robots.txt', violating that rule ought to be considered an act of criminal trespass

      I agree on the second part but I've done research on the trespassing rules, and it's very complicated. Here are a few of the things I learned.

      1. A no soliciting sign on your door has no legal power. There has to be a clear line a person has to cross before they get to the door.
      2. You can't shoot at children who cross into your property, even if they're trespassing.
      3. Courts tend to treat back and side yards differently from front yards.
      4. In a life threatening emergency your property rights are generally nil.
      5. There are a lot of people who can come onto your property even if no trespassing signs are clearly posted.
      +-- A. The postman can walk past no soliciting signs to come to your door.
      +-- B. Free speech has been interpreted to mean that evangelists and politicians can ignore no soliciting signs.
      +-- C. A police officer can knock on your back window in the middle of the night.
      +-- D. The electric company can go pretty much anywhere they want.

      • I once lived on a property where there was an easement through my backyard as that's where the local main sewer line was buried. At any time, the municipal government had full authority to destroy my fence to drive construction equipment in and excavate the back half of my yard. Never happened, but I didn't care for the possibility.

        But yes, your property has a variety of rules covering access. Thankfully, I live in Canada where free speech does NOT grant ridiculous amounts of leeway to con artists to pe

        • I have an easement as well but it is literally way out front next to the sidewalk on a space between me and my neighbor. They can get to it without damaging anything while standing in the sidewalk. It's kinda weird your house was put in front like that.

          • It's not that weird. I have the same situation as the GP; underground power lines run through the back of my yard, (and everyone else's on the block,) so there's an easement. The electric company can come in and tear my back fence and yard out if necessary. It'll probably never happen, but still.

            On top of that, like you, I have an easement through mt front yard, near the sidewalk, for a cable line and, (I believe,) a gas line as well.

      • 1. A no soliciting sign on your door has no legal power. There has to be a clear line a person has to cross before they get to the door.

        In one word? False. Though it depends on local laws. In Phoenix, solicitors are required to obey such signs. They can walk up to your door, but the moment they try to solicit, as a matter of law it's the same as if they jumped over your fence into your back yard. Bonus: If you're in a gated community, the moment they pass through an open gate no matter who opened it, they're already trespassing.

        I've had them say some shit like "I have a peddler's permit that lets me ignore that stuff" which is a complete li

        • Oh, and in Phoenix, missionaries are solicitors under the law.

        • In the City of Los Angeles, shopping malls generally have the curbs painted red and posted as No Parking, to keep them open for the Fire Department if needed. If a shop keeper calls the police to ticket cars parked in that No Parking Zone, the cops will refuse because it's private property and they have no jurisdiction. If they're called out because of peddlers blocking the sidewalks, they'll refuse because it's public property.
          • Tell Palo Alto to refund my parking ticket fine in a shopping mall lot. :-)

          • My understanding if you do that here the owner can just call a tow company who will be more than happy to pick it up within mere minutes of the call as it's a very lucrative business. No police necessary, though I don't know the exact rules around it here and I haven't yet seen it happen to anybody. In Phoenix, they just have to have a sign posted basically saying "tow away zone" whilst citing the ARS code that they'll have you towed under. From there the property owner doesn't hold any liability or anythin

      • > 2. You can't shoot at children who cross into your property, even if they're trespassing.

        Whoa! What if I'm a really bad shot? Can I shoot them in that case?

        • It might seem obvious that you can't shoot (or threaten to shoot) at children, but that's on the list for a reason. I know of three different cases where people shot or threatened to shoot children for trespassing. One shot a wayward football. One threatened to shoot a kid for walking across his front yard. One posted on Facebook that she'd shoot any kids who pulled a doorbell prank on her. There are way too many unstable people with access to weapons.
          • Ok but how is a special rule/law for not shooting kids any different than the general concept of not shooting random people who aren't a threat?

            I don't think it suddenly becomes ok to shoot people when they hit 18. If there's any place where that is true, please let me know where so I can avoid.
             

    • by RockDoctor ( 15477 ) on Sunday February 18, 2024 @02:24PM (#64249438) Journal

      Put up sign or a gate, and suddenly anyone ringing your doorbell without prior authorization is trespassing.

      This is probably American bollocks talking, but here in the civilized world, it's certainly bullshit. Generally, postal workers - in the execution of their duties - have a reasonable right of entry onto land to perform those duties. Representatives of the owners of equipment (e.g. the gas, water and electricity meters supplying your premises - they're never sold or leased ; they're always the property of the utility company) have the right of access to their property for inspection. In particular, they have the right to access and inspect it to ascertain that the meter hasn't been bypassed. Various officers of the court have the right of access to property to serve papers, and perform various other duties instructed by the court.

      American wet dreams about "my home is my castle" are not applicable outside America - and probably aren't applicable inside America too - watching video of police raids on lunatic isolationists/ survivalists is popular entertainment here (with a prayer of thanks to Cthulhu that we're not on the same continent as these nutters.

      Sorry Canada and Mexico.

    • This isn't someone's private home. You're inviting people into your public showroom by having a website. This is like a shopkeeper posting a sign that says "no robots allowed," but if you don't have a bouncer (i.e. a WAF), you really can't do anything about it. The idea of charging an unmanned, suspected, potentially foreign bot for "trespass" doesn't translate here.

    • violating that rule ought to be considered an act of criminal trespass... and any data downloaded during that act should be considered data theft.

      Or you could, you know, not send the data they asked for in the first place.....

      This just screams entitlement like the various news agencies. Yes I want you to index me. No I don't want you to display said index of me without paying for it!!!!!

      Not to mention, it's fairly easy to detect a web crawler. Just drop some hidden honeypot links that aren't visible to normal users. When the bot crawls too many of them, just block their access. You could also use ADs. If the advert isn't displayed, don't send th

    • so use adblock and it's an felony? or EULA bs?

      • Maybe 'use adblock against same-site ads is cyber trespassing' is something we should accept. Third party's too risky to force on people.

        I think if a site expects you to accept ads (third-party hosted or not), it should be liable for any damage caused by malicious code served up. If Slashdot connects me to malware, I should have a fairly easy legal remedy to charge it for any damages related to data theft or even simply the time or money put into cleaning my system.

        I suspect that if consumers had rights a

    • How about, you can't have it both ways!

      If you don't want your data to be public, don't make it publicly available.

      If you make it publicly available, crawlers (AI or not) will find it.

      Seems OK to me!

    • And it actually is a thing, at least in EU. Machine readable opt-out needs to be respected per the EU copyright directive, this include robots.txt but also ToS and metadata.
  • Public is public (Score:2, Insightful)

    by MpVpRb ( 1423381 )

    If you post something in public, it means that anybody or any bot can see it
    Want security and control? Don't post in public

    • That anyone can see it is true but that doesn't mean that anyone can copy it, alter it, and republish it as a new work. Even the fair use rules generally require credit to the original author.
      • by Anonymous Coward

        You're talking about things that are covered by copyright law. AI training is not. That's why the companies are mad.

        • by Okind ( 556066 )

          You're talking about things that are covered by copyright law. AI training is not.

          This is nonsense.

          Sure, the actual training of the AI of not covered by copyright law. But much of its training material is very much covered by copyright law. And given the current long copyright terms, I suspect only a tiny fraction of publicly available sources are in the public domain.

  • ... Neville Chaimberlain.

  • Due to their dominance in search, Microsoft and Google hold all the power here. They can technically honor a "Do Not Train AI" flag for sites that set it in robots.txt. ...but they can also then stop indexing those sites for general search (because technically their ranking algorithms also employ some level of "AI"). Those sites will be essentially forced to allow AI training if they don't want their traffic to dry up. At the end of the day, Medium, NYT, et al need Google and Bing more than the other way ar
  • by mysidia ( 191772 ) on Sunday February 18, 2024 @12:13PM (#64249208)

    The alternative is to Clickwrap your website with a No Bots agreement, and require User action to enter.

    There are also things you can put in front of a web server that dynamically rewrite all the page links to be session-specific Link codes. Then you ban the session if there are too many requests in a short period of time.

    Since the webpage links are session-specific; if they try to evade using a new IP address, they have to start all over. Because the links to some random GUID don't even provide the browser any way to uniquely identify each link from session to session.

    "Disallowing a bot on your robots.txt page is like putting up a 'No Girls Allowed' sign

    True. But actually the sign on a treehouse might have some weight depending on how the trespassing law is written.
    There's little precedent for the robots.txt

  • It’s for sure a weird area.. if I read a text book or a news article I’m generally going to use parts of those works to “better” myself. If I read all of hemmingways books, or listen to all of radioheads discography there is a good chance that’s going to influence my future recordings or text I write.. that would be considered fair use.. I sort of agree though, I fail to see what good this is really creating. I foresee a future where most of us will be employed to explain to
    • If you read a book the probability of you using to write anything remotely similar is next to zero, based on the number of writers relative to the whole population. If ai reads it, the probability of it using it is 1.0. it is reading it with an explicit goal of using the data.

      • If ai reads it, the probability of it using it is 1.0.

        Not really. The AI is using the data to strengthen relationships between ideas. If the only thing the training strengthens from reading your site is "dogs wag their tails when happy" or "climate change is increasing" then not much is retained. If your site is the only one talking about the breakthrough in perovskite photovoltaic efficiency, then yes, your data is probably going to be a singular node and regurgitated more directly.

  • LOL NO (Score:4, Informative)

    by anoncoward69 ( 6496862 ) on Sunday February 18, 2024 @12:30PM (#64249238)
    Robots.txt is only going to stop a scraper that chooses to abide by it. It's not mandatory and there are no laws enforcing it being followed. Probably your only choice is to put your users though capcha hell and put it all behind one. "AI" will probably have those figured out soon enough if not already.
    • by AmiMoJo ( 196126 )

      I've been making some progress with GDPR. If they scrape the web then they have to respond to SARs because they might have scraped your personal data.

      • because they might have scraped your personal data.

        If personal data is available to any automated system scraping the web then the GDPR compliance problem is your website, not the crawler.

        • by AmiMoJo ( 196126 )

          GDPR doesn't work like that. Just because you found some data in a public place, doesn't mean you can take it, or process it, or refuse SARs.

    • >"Robots.txt is only going to stop a scraper that chooses to abide by it."

      Correct. I have seen systems (bots) actively scraping my site, completely ignoring my robots.txt

      >"It's not mandatory and there are no laws enforcing it being followed."

      Even if there were laws, they would probably still be ignored. And/or enforcement is pretty much impossible. Case in point- the ridiculous concept of "gun-free zones." The only ones who respect it are the law-abiding, who were never the problem. And then you

  • Comment removed based on user account deletion
  • feel free to ignore robots.txt but don't complain when you bot ends up swallowing a few gigs of procedurally generated nonsense after detection.
  • The larger questions are how much we can "trust" AI and how that affects current laws.

    No one is proposing a technical solution to that genie now that it's out of the bottle but I'm sure that some kind of digital Robocop, fraught with its own problems, is coming soon. "That's Tron. He fights for the users."

  • That's two different things. Clearly. (...Like push and pull logic, client -server, whatever.)

  • by iAmWaySmarterThanYou ( 10095012 ) on Sunday February 18, 2024 @02:26PM (#64249442)

    Put a block in your robots.txt for a path that doesn't exist.

    Anyone who tries to hit that path is obviously abusing your robots.txt.

    Drop their IP in your firewall.

    Yes they can change their IP and likely use many IP's but they're not getting very far in most cases.

    I consider robots.txt abusers to be in the same class of scum as spammers and Nigerian prince scammers and crypto boys. Automation won't stop them all but it can dramatically reduce their impact.

    • Your approach might work for crawlers that actually try to follow the paths in robots.txt. But I don't think that's how most crawlers work. They're actually following links in the home page itself. Reputable crawlers would then filter out links discovered in this way, based on the list in robots.txt. But if your page doesn't contain a link to your non-existent page, crawlers would have no reason to go there.

      • But the idea is to filter out crawlers that are not reputable and use the robots.txt as a list of places to go rather than avoid.

        Filtering bad network bits of any sort is a series of filters. There is no one magic way to filter all spam or all sms crap or all bad web crawlers but the ones abusing robots.txt can certainly be stopped or at least dramatically slowed.

        • Or you could just make your private content...private, as in, don't allow it to be seen without authentication. Then you don't have to worry about crawlers following links in robots.txt.

          • "you could just make your private content...private"

            Only some of the concern is about content that should have been made private. The greater concern is dumb crawlers pounding your servers repeatedly.

            • That concern--of dumb crawlers pounding servers--is not expressed in the article or the summary, as far as I can tell. And to the extent that it's a concern outside of the article's scope, the concern is no more serious for regular crawlers, than AI crawlers.

          • Because I don't want my honest customers and respectful crawlers to have increased friction because a minority of people are assholes abusing my robots.txt.

  • by arfonrg ( 81735 ) on Sunday February 18, 2024 @03:04PM (#64249518)

    Stopping robots is just stupid. If you post stuff publicly (non-protected webpages), then your stuff is fair game.

    If you don't want robots, protect the information with logins.

    "but logins will reduce our viewership" - You can't have your cake and eat it too.

  • How many times have you Googled for an answer to something, only to find a dozen websites that regurgitate the same thread with a different wrapping. The web is fundamentally founded on scraping other people's data and re-presenting it as your own - in order to sell advertising.

    Hell... here I am reading a news aggregation website that relies on the same principle.

    When will we stop seeing AI as the bogeyman of the 21st century and realise that we've been doing this stuff all along. AI has just revealed
  • Even if they don't crawl, in respect for robots.txt, the data will still get scraped anyway -- Just from other sources. The syndicated reposters, the companies that just wholesale regurtigate shit for clicks and SEO.

    Their content is wide-spread, they really can't robots.txt their way out of being involved in training.

    • Which is why they need to stop being so polite.

      Just say "We consider using our content for training language models for commercial use not fair use, we will get around to suing you eventually, be warned." In fact, just put that as a comment in the robots.txt.

  • These humorous thing is that they actually think there is a difference between an AI scraping your content and Google/Bing scraping your content! smh
    • No, it took a few lawsuits, the DMCA and EU copyright directives to give them those privileges.

      Or to turn it around, OpenAI probably doesn't have the privilege unless government jumps up and gives it or the supreme court rules it fair use. Otherwise they will be exAI.

  • A "No girls allowed" sign on a treehouse might, if the person putting the sign up had ownership rights, be enforceable against those who violate the sign as being trespassers though.

    Robots.txt does serve as a kind of "no trespassing" sign, and it'd be interesting to see how it holds up in court in terms of serving as commonly recognized notice of limitations on permission to access a service, for automated systems. Quite often just putting someone on notice that they're not allowed to access something is su

    • trespassers like signs can't block screen readers under the ADA law.

      Not even the DMCA allows you to block them.

      So if you had an screen reader that screen reader is coded to not read the robots file (that is not shown on screen when you visit) you can have an is of trespassing but in away that the law say you can't enforce. Now an local cop may not jail / ticket some one for that. But an hacking charge you may need to fight that out in court.

      • Trespassing is just an analogy here. Unauthorised access is the actual issue. Robots.txt is a industry norm for advising whether an automated system is permitted to have access to the system. Generally, you don't have to have a substantive enforcement mechanism for such a prohibition to be legally binding - in the absence of permission you can't access that system.
        • automated system vs say an automated adblock for an live view will need to be have the prohibition be listed in an way that an live view is shown to the live viewer to be legally binding. Or do you want an visit to an website to = prison time?

          • Adblock is very different. Adblock is someone choosing not to download some parts of something, rather than downloading something from a system the owner of the system has said their not allowed to download from.
            Robots.txt is the defacto standard, the industry norm, for limiting or denying automated systems access to servers.
            And yes, if you go on a website and it has a notice you're only allowed to visit the website if you are a member of group X (eg, employees) then that can be legally binding too (albei
  • If a site is visible at all to the world, then a human can presumably gain value from it, and in fact need only cite the source if using the information (not through plagiarism, but by summarizing or simply learning from it) to learn or publish or whatever. Why should a non-human intelligence, one capable of absorbing that information in some way, such as the current large language models, be treated any differently? If the information needs to be monetized, perhaps it should be placed behind a paywall. If
  • EU directives require any and all machine readable opt-out to be respected. It include robot.txt of course but also ToS.
  • The wider issue is that the whole system of copyright/patents/"intellectual property" etc. was introduced in a world where a physical copy of some sort was required to transfer information (book, paper, tape, vinyl disc, optical disk, clay tablet, wax cylinder etc. etc.). We no longer live in that world.

    We now live in a world where any information that is stored digitally can be immediately replicated any number of times. At will. So the whole system of copyright and "intellectual property" is no longer fi

  • In your robots.txt simply put a few lines like /foobar

    Then, whenever anything ever asks for your page /foobar, you block that IP for 12 hours

  • You see, Robots.txt files are a collection of instructions for web robots that tell them which portions of a website they may access and index. It's like handing people a map with certain places designated as off-limits. However, their efficiency in preventing AI crawlers is determined on a variety of criteria. Traditional web crawlers frequently follow these guidelines, but AI crawlers can be more intelligent. They may interpret the rules differently or even reject them entirely in some situations. It's li
  • I feel that for AI it makes sense to limit it with the robots.txt file, but for tools like reverse image search and copyright monitoring tools, it doesn't make sense to have to follow the robots.txt file. If I had Taylor Swift's nudes posted on my site and wanted to be discoverable but not switch up by copyright bots, I'd just block all bots besides the search engines. Then it's a whack-a-mole game and only if someone reports them or you find them yourself, which also requires bots that don't follow the rob

Trap full -- please empty.

Working...