Websites are Blocking the Wrong AI Scrapers (404media.co) 32

Posted by msmash on Monday July 29, 2024 @02:23PM from the how-about-that dept.

An anonymous reader shares a report: Hundreds of websites trying to block the AI company Anthropic from scraping their content are blocking the wrong bots, seemingly because they are copy/pasting outdated instructions to their robots.txt files, and because companies are constantly launching new AI crawler bots with different names that will only be blocked if website owners update their robots.txt. In particular, these sites are blocking two bots no longer used by the company, while unknowingly leaving Anthropic's real (and new) scraper bot unblocked.

This is an example of "how much of a mess the robots.txt landscape is right now," the anonymous operator of Dark Visitors told 404 Media. Dark Visitors is a website that tracks the constantly-shifting landscape of web crawlers and scrapers -- many of them operated by AI companies -- and which helps website owners regularly update their robots.txt files to prevent specific types of scraping. The site has seen a huge increase in popularity as more people try to block AI from scraping their work. "The ecosystem of agents is changing quickly, so it's basically impossible for website owners to manually keep up. For example, Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) just added new ones last month and last week, respectively," they added.

Websites are Blocking the Wrong AI Scrapers

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 32 Comments Log In/Create an Account

Comments Filter:

Easy solution, impossible to implement. (Score:3)

by Apotekaren ( 904220 ) writes: on Monday July 29, 2024 @02:31PM (#64664482)

It should be opt-in, not opt-out.

- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
  - Re: (Score:2)
    
    by 93 Escort Wagon ( 326346 ) writes:
    
    That's the thing - TFS implies it's a robots.txt problem ("how much of a mess the robots.txt landscape is right now"), when the real issue is the sleazy people running Anthropic intentionally taking steps to circumvent the obvious wishes of website owners.
    - Re: (Score:2)
      
      by BringsApples ( 3418089 ) writes:
      
      If banks put up a sign out front: "Bank robbers aren't welcome here." ...do you think sleezy people would also disobey THAT sign? No way, bruh. They're not THAT sleezy. ...are they?
      But yes, you seem to have broken thru into a new level of truth: Sleezy people exist and ruin shit for the rest of us.
      and now I'd like to introduce you to some more truth: Robots.txt is dumb AF, because sleezy people exist.
      - Re: (Score:2)
        
        by account_deleted ( 4530225 ) writes:
        
        Comment removed based on user account deletion
- Re: (Score:2)
  
  by mumbojumbo21 ( 7612560 ) writes:
  
  It should be opt-in, not opt-out.
  It is opt-in if you write your robots.txt file that way. Just disallow * and allow the bots you want. The site owner gets the choice.
  - Re: (Score:2)
    
    by anoncoward69 ( 6496862 ) writes:
    
    robots.txt is such a bullshit way to handle any kind of site scraping. It's the fucking honor system, there's no law that states they have to abide by some random txt file sitting on your site.
  - Re: (Score:2)
    
    by chrish ( 4714 ) writes:
    
    I mean, that's what my robots.txt looks like:
    # Block all user agents, for the whole site
    User-agent: *
    Disallow: /
    # Allow these user agents
    User-agent: DuckDuckBot
    User-agent: Googlebot
    Suppose I could remove Google from there now that they don't bother indexing the web.
    The problem is that the AI scrapers are terrible Internet citizens, and ignore robots.txt. Given how frequently they change IP addresses, they're basically DDoS malware.
- Re: (Score:2)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
- Re: (Score:2)
  
  by eneville ( 745111 ) writes:
  
  I see it very similar to how Google were summarising news items that people searched for. The content of the click-through was displayed just below the search box, so there was little reason to go to the origin of the news.
  If AI bots are summarising and allowed to do that, why was Google not allowed to summarise, why are AI LLMs?
  The only reason is perhaps due to investors having some clout.
Slashvertisment (Score:2)

by Enigma2175 ( 179646 ) writes:

Sweet Slashvertisment, "anonymous reader"!
- Re: (Score:1, Funny)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
AI scrapers (Score:2)

by backslashdot ( 95548 ) writes:

So these companies Google and all that make money from scaping now they're mad when others do what they saw fit to do for themselves and their own benefit. No different than a drug dealer calling the cops on a rival.
- Re: (Score:3)
  
  by account_deleted ( 4530225 ) writes:
  
  Comment removed based on user account deletion
Do we pretend that they adhere to robots.txt now? (Score:2)

by TheNameOfNick ( 7286618 ) writes:

This will be solved by law or the web will die. No more access without logging in.
- Re: (Score:2)
  
  by kwalker ( 1383 ) writes:
  
  Yeah, because "no access without logging in" isn't killing the web right now.
  - Re: Do we pretend that they adhere to robots.txt n (Score:1)
    
    by Anamon ( 10465047 ) writes:
    
    I feel like it's been a while that most reasonably interesting communities and content are behind logins, or invite-only. Not that I like it, but it's the only places which managed to preserve the collaborative, creative, open-minded spirit that used to be the Internet's promise. AI is most definitely going to seal the deal, though, and maybe lead to more real-world personal identification requirements.
- Re: (Score:2)
  
  by Misagon ( 1135 ) writes:
  
  > No more access without logging in.
  Or having to solve a CAPTCHA in the Cookie Consent dialog ...
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    AIs are better at captchas than humans now.
Reminds me of the spammer arms race... (Score:2)

by ctilsie242 ( 4841247 ) writes:

I think it is going to wind up an arms race between websites wanting their content for people, versus AI scrapers.
One way of dealing with this arms race was an out Outlook item, where to send an email, a proof-of-work token [1] was generated. This might be something where if a client wanted to view a page, a PoW token would be done with a low order of work... enough to slow down and tarpit the massive spiders, but not enough to affect a web page's responsiveness.
Another idea might be throwaway client certi
- Re: (Score:2)
  
  by Kaenneth ( 82978 ) writes:
  
  Only serve pages encrypted with a complex decryption system?
  Like a system where you quickly encrypt the page with a key like '123456', but only tell the client '1234XX' and they have to brute-force a few missing digits.
  - Re: (Score:2)
    
    by ctilsie242 ( 4841247 ) writes:
    
    Pretty much. That is how Outlook used to do a "stamp", which was a proof of work, like guess the first n digits of a hash by trying random numbers.
    One could also hand the browser ciphertext and the encryption key with a ton of rounds necessary to decrypt it, which might be more elegant, as the parent mentions. However, this means the web server would have do the same amount of iterations to encrypt. Having the web server just verify work is done is probably the best thing.
Did you see what she was wearing? (Score:3)

by ebunga ( 95613 ) writes: on Monday July 29, 2024 @04:38PM (#64664952)

That web site had it coming to her looking all hot and desirable with all that fresh... content. And did you see that robots.txt? She was asking for it. It's her fault.

The ecosystem is changing? (Score:2)

by whitroth ( 9367 ) writes:

No. Say what it really means: the chatbots are *DELIBERATELY* changing names to DELIBERATELY violate the wishes of the website owners.
- Re: The more furious they block AI scrapers now .. (Score:1)
  
  by Anamon ( 10465047 ) writes:
  
  And how would content creators benefit from being mentioned in LLM output? They can't place ads there or lead visitors to their other content. It's a 100% loss. This is the same "but I'll be paying you in exposure" argument from the creative industry, and it's just as wrong here.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Websites are Blocking the Wrong AI Scrapers (404media.co) 32

Websites are Blocking the Wrong AI Scrapers More Login

Websites are Blocking the Wrong AI Scrapers

Easy solution, impossible to implement. (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Slashvertisment (Score:2)

Re: (Score:1, Funny)

AI scrapers (Score:2)

Re: (Score:3)

Do we pretend that they adhere to robots.txt now? (Score:2)

Re: (Score:2)

Re: Do we pretend that they adhere to robots.txt n (Score:1)

Re: (Score:2)

Re: (Score:2)

Reminds me of the spammer arms race... (Score:2)

Re: (Score:2)

Re: (Score:2)

Did you see what she was wearing? (Score:3)

The ecosystem is changing? (Score:2)

Re: The more furious they block AI scrapers now .. (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot