Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI The Internet IT

Websites are Blocking the Wrong AI Scrapers (404media.co) 32

An anonymous reader shares a report: Hundreds of websites trying to block the AI company Anthropic from scraping their content are blocking the wrong bots, seemingly because they are copy/pasting outdated instructions to their robots.txt files, and because companies are constantly launching new AI crawler bots with different names that will only be blocked if website owners update their robots.txt. In particular, these sites are blocking two bots no longer used by the company, while unknowingly leaving Anthropic's real (and new) scraper bot unblocked.

This is an example of "how much of a mess the robots.txt landscape is right now," the anonymous operator of Dark Visitors told 404 Media. Dark Visitors is a website that tracks the constantly-shifting landscape of web crawlers and scrapers -- many of them operated by AI companies -- and which helps website owners regularly update their robots.txt files to prevent specific types of scraping. The site has seen a huge increase in popularity as more people try to block AI from scraping their work. "The ecosystem of agents is changing quickly, so it's basically impossible for website owners to manually keep up. For example, Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) just added new ones last month and last week, respectively," they added.

This discussion has been archived. No new comments can be posted.

Websites are Blocking the Wrong AI Scrapers

Comments Filter:
  • by Apotekaren ( 904220 ) on Monday July 29, 2024 @01:31PM (#64664482)

    It should be opt-in, not opt-out.

    • Yeah, it's not as if the AI companies are trying to surreptitiously circumvent website owners' wishes, is it?
      • That's the thing - TFS implies it's a robots.txt problem ("how much of a mess the robots.txt landscape is right now"), when the real issue is the sleazy people running Anthropic intentionally taking steps to circumvent the obvious wishes of website owners.

        • If banks put up a sign out front: "Bank robbers aren't welcome here." ...do you think sleezy people would also disobey THAT sign? No way, bruh. They're not THAT sleezy. ...are they?

          But yes, you seem to have broken thru into a new level of truth: Sleezy people exist and ruin shit for the rest of us.
          and now I'd like to introduce you to some more truth: Robots.txt is dumb AF, because sleezy people exist.

          • It makes more sense to put a legally-binding copyright notice on every page & in the metadata explicitly prohibiting the use of the content for AI. Who knows, after a while, we may even see the notice coming out in LLM outputs?!
    • It should be opt-in, not opt-out.

      It is opt-in if you write your robots.txt file that way. Just disallow * and allow the bots you want. The site owner gets the choice.

      • robots.txt is such a bullshit way to handle any kind of site scraping. It's the fucking honor system, there's no law that states they have to abide by some random txt file sitting on your site.
      • by chrish ( 4714 )

        I mean, that's what my robots.txt looks like:

        # Block all user agents, for the whole site
        User-agent: *
        Disallow: /

        # Allow these user agents
        User-agent: DuckDuckBot
        User-agent: Googlebot

        Suppose I could remove Google from there now that they don't bother indexing the web.

        The problem is that the AI scrapers are terrible Internet citizens, and ignore robots.txt. Given how frequently they change IP addresses, they're basically DDoS malware.

    • You can block everything but Google in robots.txt. I believe this is what Reddit does.
    • I see it very similar to how Google were summarising news items that people searched for. The content of the click-through was displayed just below the search box, so there was little reason to go to the origin of the news.

      If AI bots are summarising and allowed to do that, why was Google not allowed to summarise, why are AI LLMs?

      The only reason is perhaps due to investors having some clout.

  • Sweet Slashvertisment, "anonymous reader"!

    • Last I read, the company Dark Visitors owned a child sex ring beneath a pizza shop in DC and are murdering babies, too. Some very fine people are saying the company, Dark Visitors is terrible and are in decline.
  • So these companies Google and all that make money from scaping now they're mad when others do what they saw fit to do for themselves and their own benefit. No different than a drug dealer calling the cops on a rival.

    • Companies like Google create indexes that link back to the original content. AI LLMs don't cite sources. They create what may or may not be derivative works but never with any attribution. Those scenarios are entirely different.
  • This will be solved by law or the web will die. No more access without logging in.

    • by kwalker ( 1383 )

      Yeah, because "no access without logging in" isn't killing the web right now.

      • I feel like it's been a while that most reasonably interesting communities and content are behind logins, or invite-only. Not that I like it, but it's the only places which managed to preserve the collaborative, creative, open-minded spirit that used to be the Internet's promise. AI is most definitely going to seal the deal, though, and maybe lead to more real-world personal identification requirements.
    • by Misagon ( 1135 )

      > No more access without logging in.

      Or having to solve a CAPTCHA in the Cookie Consent dialog ...

  • I think it is going to wind up an arms race between websites wanting their content for people, versus AI scrapers.

    One way of dealing with this arms race was an out Outlook item, where to send an email, a proof-of-work token [1] was generated. This might be something where if a client wanted to view a page, a PoW token would be done with a low order of work... enough to slow down and tarpit the massive spiders, but not enough to affect a web page's responsiveness.

    Another idea might be throwaway client certi

    • by Kaenneth ( 82978 )

      Only serve pages encrypted with a complex decryption system?

      Like a system where you quickly encrypt the page with a key like '123456', but only tell the client '1234XX' and they have to brute-force a few missing digits.

      • Pretty much. That is how Outlook used to do a "stamp", which was a proof of work, like guess the first n digits of a hash by trying random numbers.

        One could also hand the browser ciphertext and the encryption key with a ton of rounds necessary to decrypt it, which might be more elegant, as the parent mentions. However, this means the web server would have do the same amount of iterations to encrypt. Having the web server just verify work is done is probably the best thing.

  • by ebunga ( 95613 ) on Monday July 29, 2024 @03:38PM (#64664952)

    That web site had it coming to her looking all hot and desirable with all that fresh... content. And did you see that robots.txt? She was asking for it. It's her fault.

  • No. Say what it really means: the chatbots are *DELIBERATELY* changing names to DELIBERATELY violate the wishes of the website owners.

So... did you ever wonder, do garbagemen take showers before they go to work?

Working...