Websites are Blocking the Wrong AI Scrapers (404media.co) 32
An anonymous reader shares a report: Hundreds of websites trying to block the AI company Anthropic from scraping their content are blocking the wrong bots, seemingly because they are copy/pasting outdated instructions to their robots.txt files, and because companies are constantly launching new AI crawler bots with different names that will only be blocked if website owners update their robots.txt. In particular, these sites are blocking two bots no longer used by the company, while unknowingly leaving Anthropic's real (and new) scraper bot unblocked.
This is an example of "how much of a mess the robots.txt landscape is right now," the anonymous operator of Dark Visitors told 404 Media. Dark Visitors is a website that tracks the constantly-shifting landscape of web crawlers and scrapers -- many of them operated by AI companies -- and which helps website owners regularly update their robots.txt files to prevent specific types of scraping. The site has seen a huge increase in popularity as more people try to block AI from scraping their work. "The ecosystem of agents is changing quickly, so it's basically impossible for website owners to manually keep up. For example, Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) just added new ones last month and last week, respectively," they added.
This is an example of "how much of a mess the robots.txt landscape is right now," the anonymous operator of Dark Visitors told 404 Media. Dark Visitors is a website that tracks the constantly-shifting landscape of web crawlers and scrapers -- many of them operated by AI companies -- and which helps website owners regularly update their robots.txt files to prevent specific types of scraping. The site has seen a huge increase in popularity as more people try to block AI from scraping their work. "The ecosystem of agents is changing quickly, so it's basically impossible for website owners to manually keep up. For example, Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) just added new ones last month and last week, respectively," they added.
Easy solution, impossible to implement. (Score:3)
It should be opt-in, not opt-out.
Re: (Score:2)
Re: (Score:2)
That's the thing - TFS implies it's a robots.txt problem ("how much of a mess the robots.txt landscape is right now"), when the real issue is the sleazy people running Anthropic intentionally taking steps to circumvent the obvious wishes of website owners.
Re: (Score:2)
If banks put up a sign out front: "Bank robbers aren't welcome here." ...do you think sleezy people would also disobey THAT sign? No way, bruh. They're not THAT sleezy. ...are they?
But yes, you seem to have broken thru into a new level of truth: Sleezy people exist and ruin shit for the rest of us.
and now I'd like to introduce you to some more truth: Robots.txt is dumb AF, because sleezy people exist.
Re: (Score:2)
Re: (Score:2)
It should be opt-in, not opt-out.
It is opt-in if you write your robots.txt file that way. Just disallow * and allow the bots you want. The site owner gets the choice.
Re: (Score:2)
Re: (Score:2)
I mean, that's what my robots.txt looks like:
# Block all user agents, for the whole site
User-agent: *
Disallow: /
# Allow these user agents
User-agent: DuckDuckBot
User-agent: Googlebot
Suppose I could remove Google from there now that they don't bother indexing the web.
The problem is that the AI scrapers are terrible Internet citizens, and ignore robots.txt. Given how frequently they change IP addresses, they're basically DDoS malware.
Re: (Score:2)
Re: (Score:2)
I see it very similar to how Google were summarising news items that people searched for. The content of the click-through was displayed just below the search box, so there was little reason to go to the origin of the news.
If AI bots are summarising and allowed to do that, why was Google not allowed to summarise, why are AI LLMs?
The only reason is perhaps due to investors having some clout.
Slashvertisment (Score:2)
Sweet Slashvertisment, "anonymous reader"!
Re: Slashvertisment (Score:1, Funny)
AI scrapers (Score:2)
So these companies Google and all that make money from scaping now they're mad when others do what they saw fit to do for themselves and their own benefit. No different than a drug dealer calling the cops on a rival.
Re: (Score:3)
Do we pretend that they adhere to robots.txt now? (Score:2)
This will be solved by law or the web will die. No more access without logging in.
Re: (Score:2)
Yeah, because "no access without logging in" isn't killing the web right now.
Re: Do we pretend that they adhere to robots.txt n (Score:1)
Re: (Score:2)
> No more access without logging in.
Or having to solve a CAPTCHA in the Cookie Consent dialog ...
Re: (Score:2)
AIs are better at captchas than humans now.
Reminds me of the spammer arms race... (Score:2)
I think it is going to wind up an arms race between websites wanting their content for people, versus AI scrapers.
One way of dealing with this arms race was an out Outlook item, where to send an email, a proof-of-work token [1] was generated. This might be something where if a client wanted to view a page, a PoW token would be done with a low order of work... enough to slow down and tarpit the massive spiders, but not enough to affect a web page's responsiveness.
Another idea might be throwaway client certi
Re: (Score:2)
Only serve pages encrypted with a complex decryption system?
Like a system where you quickly encrypt the page with a key like '123456', but only tell the client '1234XX' and they have to brute-force a few missing digits.
Re: (Score:2)
Pretty much. That is how Outlook used to do a "stamp", which was a proof of work, like guess the first n digits of a hash by trying random numbers.
One could also hand the browser ciphertext and the encryption key with a ton of rounds necessary to decrypt it, which might be more elegant, as the parent mentions. However, this means the web server would have do the same amount of iterations to encrypt. Having the web server just verify work is done is probably the best thing.
Did you see what she was wearing? (Score:3)
That web site had it coming to her looking all hot and desirable with all that fresh... content. And did you see that robots.txt? She was asking for it. It's her fault.
The ecosystem is changing? (Score:2)
No. Say what it really means: the chatbots are *DELIBERATELY* changing names to DELIBERATELY violate the wishes of the website owners.
Re: The more furious they block AI scrapers now .. (Score:1)