AI Crawlers Haven't Learned To Play Nice With Websites (theregister.com) 57

Posted by msmash on Wednesday March 19, 2025 @01:25PM from the robot.txt-be-damned dept.

SourceHut, an open-source-friendly git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data. From a report: "SourceHut continues to face disruptions due to aggressive LLM crawlers," the biz reported Monday on its status page. "We are continuously working to deploy mitigations. We have deployed a number of mitigations which are keeping the problem contained for now. However, some of our mitigations may impact end-users."

SourceHut said it had deployed Nepenthes, a tar pit to catch web crawlers that scrape data primarily for training large language models, and noted that doing so might degrade access to some web pages for users. "We have unilaterally blocked several cloud providers, including GCP [Google Cloud] and [Microsoft] Azure, for the high volumes of bot traffic originating from their networks," the biz said, advising administrators of services that integrate with SourceHut to get in touch to arrange an exception to the blocking.

AI Crawlers Haven't Learned To Play Nice With Websites

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 57 Comments Log In/Create an Account

Comments Filter:

They don't know right from wrong (Score:5, Interesting)

by ebunga ( 95613 ) writes: on Wednesday March 19, 2025 @01:30PM (#65245317)

AI companies are barely a step away from committing armed robbery just to feed their training datasets.

- Re:They don't know right from wrong (Score:4, Interesting)
  
  by jriding ( 1076733 ) writes: on Wednesday March 19, 2025 @02:20PM (#65245445)
  
  Didn't someone go to prison for scraping data from a web site then offering it to the public?
  So why does AI get a pass.
  
  - Re:They don't know right from wrong (Score:5, Insightful)
    
    by ebunga ( 95613 ) writes: on Wednesday March 19, 2025 @02:25PM (#65245457)
    
    Corporations can get away with this sort of stuff. Same goes for Facebook literally stalking you in ways that would be illegal if you as a private individual did what they do.
    
  - Aaron Schwartz (Score:5, Insightful)
    
    by abulafia ( 7826 ) writes: on Wednesday March 19, 2025 @03:20PM (#65245565)
    
    And AI gets a pass because they feed politicians.
    Aaron did it for the greater good which cannot be allowed. That sort of thing gets you hounded to death.
    
  - - Re: (Score:3, Informative)
      
      by Anonymous Coward writes:
      
      They redistribute it to the public for profit though.
      https://blogs.microsoft.com/on... [microsoft.com]
      We have incorporated filters and other technologies that are designed to reduce the likelihood that Copilots return infringing content.
      Reduce the likelihood means they infringe sometimes.
      So is it OK for me to go unpunished if I only break the law sometimes?
  - Re: (Score:2)
    
    by jythie ( 914043 ) writes:
    
    Stealing from the rich is a crime, stealing from the poor is not.. and in the eyes of many, doing things without a profit motive against those who do have profit in mind makes you morally inferior and unamerican.
  - Re: (Score:3)
    
    by h33t l4x0r ( 4107715 ) writes:
    
    I remember a big linkedIn lawsuit that sort of went back and forth on appeals and ultimately found the scraper liable but only because they were creating fake accounts and doing all sorts of things to circumvent protection. So no, I can't see AI getting on the wrong side of that decision. Not in March of 2025 at least.
    - Re: (Score:3, Interesting)
      
      by angel'o'sphere ( 80593 ) writes:
      
      My GF has a fake account on Linkedin.
      I mean: it is not her account. It is an account "in her name". Kind of a Ghost account.
      Astonishingly her name is written "correctly". Which is interesting as her name on her passport is "wrong". There is a mixup of one vowel. Which is acceptable as her name is Thai and that is an "unwritten" vowel. All her family has an "A" there, and she has an "O"
      Her Linkedin name is using the O, which is - well - astonishing. She is probably the only person in Thailand with that name.
- Re: (Score:1)
  
  by starworks5 ( 139327 ) writes:
  
  No, the real reason i think, is either A) the crawlers are too lazy to first check if the URL is already in common crawl, B) the website creator created the website in such a way as that it can only be crawled with some sort of selenium chrome workers running on peoples botnetted home PCs.
- Re: (Score:1)
  
  by CEC-P ( 10248912 ) writes:
  
  Be a shame if someone detected the traffic patterns and sources and then swapped out the content for purposely wrong, misleading, poisoned text data instead then huh? You know, after asking them politely to behave and then not doing it.
There is trillions of dollars at stake here (Score:3, Interesting)

by rsilvergun ( 571051 ) writes: on Wednesday March 19, 2025 @01:50PM (#65245369)

AI is a form of capital. Whoever controls it is going to be the wealthiest human beings and human history. That's because eventually AI is going to replace trillions of dollars of human labor and the output of that is going to belong to whoever owns the AI.

We could radically restructure our civilization to prevent that but frankly at this point I don't think there's a snowball's chance in hell. Old people are in charge and old people don't want to do that. Speaking is one of the few old farts that would be on board for that I can tell you there's damn few of us and we suck at politics. And this is very much a political thing.

So getting back on track AI bots are going to steal and take and devour everything and anything they can get their hands on without any concern of what's legal let alone moral because of the insane amounts of money and power involved here.

Meanwhile old farts who don't like change will continue to deny what's happening. Some of them might even be lucky enough to retire or die before the worst of it hits. Good for them I guess.

- Re:There is trillions of dollars at stake here (Score:5, Insightful)
  
  by BrendaEM ( 871664 ) writes: on Wednesday March 19, 2025 @02:00PM (#65245399) Homepage
  
  No AI is not capitol; it is stealing capitol. It's not a question of age, it is a question of creators--and thieves.
  
  - Re: (Score:2)
    
    by suutar ( 1860506 ) writes:
    
    It's both. It's a piece of capital that works to acquire other people's capital.
    Fortunately it was recently ruled that AI-generated work is not eligible for copyright. Unfortunately I don't expect that to last =/
  - Re: (Score:1)
    
    by starworks5 ( 139327 ) writes:
    
    Capital, is those durable produced goods that are in turn used as productive inputs for further production. as per wikipedia
  - It's still capital (Score:2)
    
    by rsilvergun ( 571051 ) writes:
    
    Stolen capital is still capital. And whoever controls it is going to have absurd amounts of power unlike anything we have ever seen in our civilization. It is probably not something we should be allowing individuals to control. But we're going to for reasons.
- Re:There is trillions of dollars at stake here (Score:5, Interesting)
  
  by narcc ( 412956 ) writes: on Wednesday March 19, 2025 @04:17PM (#65245713) Journal
  
  AI is going to replace trillions of dollars of human labor and the output of that is going to belong to whoever owns the AI.
  
  You could be forgiven for believing that in 2023. Things were new and seemed like they were moving at a break-neck pace. There was a lot of excitement around what would be in the very near future. It's now 2025 and you've had more than enough time to either learn about how the technology works or notice that the miracles you expected to disrupt the every industry two years ago never happened.
  Even a surprising number of very qualified people got swept up in the excitement, convincing themselves that the magic of emergence would somehow overcome what are fairly obvious fundamental limitations. The truckloads of money being tossed around probably also helped a few folks overcome their initial misgivings...
  While you expected a dramatic shift in the world order by now, I expected investor money to dry up. What neither one of us expected was that both would happen for an entirely different reason! I mean, of course, the mad king that has turned the US from the seemingly unshakable pillar of stability that held up the world's greatest period of prosperity ... into its greatest threat.
  The only hope we have as a nation is to show the world that we are willing and able to deal with a threat like this quickly and without delay. While that won't erase the damage overnight, it just might be enough to save us from the worst of the consequences. We really, really, don't want to transition away from being the world's reserve currency. We need to display our reliability and stability now. That will mean his removal. There's no way around that. Unfortunately, that will either mean assassination or a sudden return to rationality on the right. The pain will get them there, it just might take longer than we have...
  
  - Re: (Score:2)
    
    by jythie ( 914043 ) writes:
    
    Eh, people picture TV level drama and timespans, then when it doesn't happen quickly they think it didn't happen. It is still a growing market, with companies trying to figure out how to use it to have fewer employees. My company is only now making the use of AI a priority, and careers are being built from that bandwagon. So the hype might be dulled in the media, but it still growing in corporate .
AI, as Implimented is Massive Data Theft (Score:4, Insightful)

by BrendaEM ( 871664 ) writes: on Wednesday March 19, 2025 @01:58PM (#65245395) Homepage

There is no honorable AI implementation. AI, as used is nothing more than a copyright breaking data theft machine.

- Re:AI, as Implimented is Massive Data Theft (Score:4, Insightful)
  
  by Calydor ( 739835 ) writes: on Wednesday March 19, 2025 @02:18PM (#65245439)
  
  Are we ... starting to root for RIAA and MPAA?
  
  - Re:AI, as Implimented is Massive Data Theft (Score:5, Informative)
    
    by drinkypoo ( 153816 ) writes: <drink@hyperlogos.org> on Wednesday March 19, 2025 @02:30PM (#65245475) Homepage Journal
    
    It's a complicated issue, we can have complicated opinions.
    For example you could be in favor of reasonably limited copyright terms, but also be against mass scraping to replace human creativity.
    
  - Re: (Score:2)
    
    by cmdr_klarg ( 629569 ) writes:
    
    Let *them* fight. Just stay out from underfoot when they do.
    - Re: (Score:3)
      
      by jythie ( 914043 ) writes:
      
      RIAA and MPAA are rather scary players to be involved since both are banking heavily on AI being the future, so expect lobbying for laws that make use of their stuff 'stealing' and them making use of your's 'fair'. It will fall back on the old american idea that if something is making money it should be protected, but things that are not profitable do not get protection.
      - Re: (Score:3)
        
        by allo ( 1728082 ) writes:
        
        That's why open ai (not OpenAI) is important. The big players will sooner or later try to gatekeep AI. Make sure to have your local models before lobbying leads to only big companies using AI and giving you only limited and paid access to some API that doesn't allow to use AI to the potential the big players are using it.
- Re: (Score:2)
  
  by nightflameauto ( 6607976 ) writes:
  
  There is no honorable AI implementation. AI, as used is nothing more than a copyright breaking data theft machine.
  But it's profitable, so that makes it alright. Right? I mean, profit is the only measure of worth we accept, so all other concerns can just be swept away with the magic-erase power of Wall Street.
- Re: (Score:1)
  
  by angel'o'sphere ( 80593 ) writes:
  
  That is nonsense.
  Most AI is trained by humans.
  But that is not my point.
  -------------
  Copyright breaking stuff is perhaps in text generation or image generation.
  And: currently, unless the laws change, it is not a copyright infringement.
  It might be morally flawed, but it is absolutely in the laws to scrape the internet, and use what you find to train an AI and that AI might use whatever to generate output. And that output is under the copyright of the person using that AI as a tool. Just like a 3D printer or a
Not necessarily even crawling (Score:2)

by Iamthecheese ( 1264298 ) writes:

Crawling implies downloading all available content. A lot of new AI-related traffic, I'm sure, relates to specific searches made by agentic AI and specific searches done inefficiently. For example if I'm looking on a company's website to do research to create a cover letter for a resume I might read the "about", "contact us" and "careers" pages, whereas Gemini will read the entire site. On an airline's webpage looking for ticket price I'll go through a predictable and cachable path to get that company's pri
- easy to mimick you (Score:2)
  
  by Petr Blazek ( 8018844 ) writes:
  
  ...which will require even more resources. No easy solution in sight, I'm afraid.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  Can you reproduce it reading "the entire site"? That sounds indeed ineffective and when I look at how fast for example Perplexity is in downloading pages to add them to the search results I doubt they do more than 1-3 requests for it.
  I could imagine some companies to use the request as starting point to (re)crawl the site for training, though. They now that at least one user was interested in it.
- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  Does Gemini have to read the entire site directly? Or do they get training data forwarded from the same processes which Google uses to crawl the web, which hmm I'm noticing by the way they don't seem to be offering cached pages any more. I bet that was getting crawled by competing AIs.
Ran Into It This Week (Score:5, Interesting)

by Trip Ericson ( 864747 ) writes: on Wednesday March 19, 2025 @02:04PM (#65245409) Homepage

My own web server suddenly had the load spike to 100 from its normal 5 or so and was having database errors due to too many connections. It turned out to be an AI crawler that wasn't identifying itself as a crawler and was hitting multiple times per second. I had to block several IP ranges to stop it.
The annoying part is that I don't even mind sharing my site's content with the world, including such crawlers; it's non-profit and I'm fine with it. Just don't destroy it for everyone in the process!

- Re: (Score:3)
  
  by drinkypoo ( 153816 ) writes:
  
  Couldn't you solve that with rate limiting, if a goal is to permit crawlers? Or were the requests coming from many IPs even before you started blocking?
  - Re:Ran Into It This Week (Score:5, Informative)
    
    by Trip Ericson ( 864747 ) writes: on Wednesday March 19, 2025 @02:38PM (#65245489) Homepage
    
    The requests were coming from dozens of different IPs, all within certain ranges. That's why I blocked the whole ranges.
    
    - Re: (Score:3)
      
      by drinkypoo ( 153816 ) writes:
      
      Yeah, I'd block them for that on principle no matter how I felt about them otherwise.
    - Re: Ran Into It This Week (Score:3)
      
      by FudRucker ( 866063 ) writes:
      
      that sounds a lot like a distributed denial of service attack
      - Re: (Score:2)
        
        by taustin ( 171655 ) writes:
        
        That's exactly what it is. Different motive, though just as malicious, and in the end, it's all the same thing.
      - Re: Ran Into It This Week (Score:2)
        
        by cristiroma ( 606375 ) writes:
        
        We had traffic from tens of thousands of ips over several days all over the globe, all identifying as Google Chrome user agent ... 2-3 requests from each. Very tough to catch, but it was clear a bot. No resources loaded, just the html page. Difficult to block...
        
        Re: (Score:2)
        
        by allo ( 1728082 ) writes:
        
        Be careful, sometimes it is a Google bot. But most the time I see first the Google bot and then an iPhone useragent hitting pages. I think they are looking at the same time if you provide other content to their bot and if your site is mobile friendly, thus requesting it with a mobile useragent.
  - Re: (Score:2)
    
    by allo ( 1728082 ) writes:
    
    Even if you can, most server software doesn't do it by default and doesn't prominently document it. That's a problem that didn't happen that often until recently. And now people are rather looking to block the crawlers than to rate limit them, what also doesn't lead to much documentation how to tame them instead of blocking them. Also people can try to rate limit their dawn crawlers. It can't be that hard, if you crawl the complete web anyway, you can round robin the requests between different sites instead
- Re: (Score:2)
  
  by taustin ( 171655 ) writes:
  
  I had the exact same thing happen, except my hosted server is pretty robust. It would have handled one bot crawling the site, but not three or four at the same time.
  At least all but one of them obeyed the robots.txt. The remaining one will never access my server again.
  (I also had to block every Chinese IP address I could identify, but that was actually a malicious DOS attack.)
can this be done (Score:1)

by FudRucker ( 866063 ) writes:

whenever an AI scraper starts harvesting data can it be detected and the server being harvested start feeding the AI useless trash data like lorem ipsum or some other sort of garbage, do that enough times and maybe the AI owners will quit or start respecting other people's servers
- Re: (Score:2)
  
  by nightflameauto ( 6607976 ) writes:
  
  whenever an AI scraper starts harvesting data can it be detected and the server being harvested start feeding the AI useless trash data like lorem ipsum or some other sort of garbage, do that enough times and maybe the AI owners will quit or start respecting other people's servers
  The hallucinations would certainly become more entertaining for a brief flash of a moment, if nothing else.
- Re: (Score:2)
  
  by jsonn ( 792303 ) writes:
  
  That's the tarpit mentioned in the article. The problem is that you still need to detect the scrapers in first place and have enough bandwidth and cpu time to waste on them.
- Re: (Score:2)
  
  by allo ( 1728082 ) writes:
  
  You're better off just blocking them. Serving lorem ipsum still costs you bandwidth and sometimes CPU (people already tried to feed them generated plausible texts). And don't get your hopes up, you won't ruin the next model by serving bullshit. The web has that much spam sites, that if spam content could ruin the models, they would already be crap.
Why we can't have nice things (Score:3)

by CommunityMember ( 6662188 ) writes: on Wednesday March 19, 2025 @02:11PM (#65245427)

A (very) few AI crawlers seem to respect rate limits and robots.txt and other various hints. But as the first companies who can index the world (and make sense of it all) may win it all (whatever it is), no one wants to be a distant 3rd. And this crawling abuse is another example of why we can't have nice things.

They should pay for hositing costs (Score:2)

by xack ( 5304745 ) writes:

AI generates billions, they should be offering to put websites on multi gigabit servers so they are crawl friendly and upgrade hosting centres globally In fact as much as I hate Wikipedia, AI crawlers rely on it so much they should compensate it enough so Wikimedia dosen't have to run donation banners anymore. Same with Archive.org and Fediverse sites that have organic human content.
Feed the crawlers BS (Score:1)

by Anonymous Coward writes:

Identify all the crawlers and purposefully feed them a ton of BS and gibberish. Apart from using up your bandwidth, this is one effective way to fight back. If more people do this, we'll see a real result.
- Re: (Score:3)
  
  by CommunityMember ( 6662188 ) writes:
  
  Identify all the crawlers and purposefully feed them a ton of BS and gibberish. Apart from using up your bandwidth, this is one effective way to fight back. If more people do this, we'll see a real result.
  That is what Nepenthes (mentioned in the summary) attempts to do (although it feeds the data slowly, so it may not, itself, use your bandwidth). While such solutions can be somewhat effective to slow a bit of crawling, companies spinning up additional crawlers, and all the other new businesses doing the crawling, is expanding faster than the tar pits can capture all the crawlers.
The problem is the competition (Score:3)

by allo ( 1728082 ) writes: on Wednesday March 19, 2025 @05:11PM (#65245821)

The problem are not single crawlers. The big companies even perfectly respect robots.txt (even though some people claim otherwise because of fake useragents). The problem is, that there are tens of companies building their own dataset and your site is crawled by all of them. And that boils down to the companies not sharing datasets, first because of competition and second, because they are not allowed for copyright reasons.
You can use the TDM exception to train your AI, but if you give others access to your dataset, you provide access to the data which is not covered by TDM. Maybe if they would have some kind of agreement that the receiving party also only uses the data for TDM they could share, but who would get them to invest the time to discuss something like this, when they can just run an own crawler having fine control over their own dataset, hoping it to be better, larger, more recent that that of the competition.

LOL Headline (Score:2, Insightful)

by Anonymous Coward writes:

"AI Crawlers Haven't Learned To Play Nice With Websites"
More like:
"AI Crawlers Have Been Coded Explicitly To Not Play Nice With Websites"
There's a tool for that (Score:2)

by head_dunce ( 828262 ) writes:

Article makes no mention of CloudFlare having a solution for it - https://blog.cloudflare.com/de... [cloudflare.com] I haven't used it, haven't been a problem on my servers (yet anyway.)
Don't block them, poison them (Score:2)

by reanjr ( 588767 ) writes:

Instead of trying to block them, if you can identify the traffic then use that to serve junk content, preferably generated by LLMs.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

They don't know right from wrong (Score:5, Interesting)

Re:They don't know right from wrong (Score:4, Interesting)

Re:They don't know right from wrong (Score:5, Insightful)

Aaron Schwartz (Score:5, Insightful)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3, Interesting)

Re: (Score:1)

Re: (Score:1)

There is trillions of dollars at stake here (Score:3, Interesting)

Re:There is trillions of dollars at stake here (Score:5, Insightful)

Re: (Score:2)

Re: (Score:1)

It's still capital (Score:2)

Re:There is trillions of dollars at stake here (Score:5, Interesting)

Re: (Score:2)

AI, as Implimented is Massive Data Theft (Score:4, Insightful)

Re:AI, as Implimented is Massive Data Theft (Score:4, Insightful)

Re:AI, as Implimented is Massive Data Theft (Score:5, Informative)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Not necessarily even crawling (Score:2)

easy to mimick you (Score:2)

Re: (Score:2)

Re: (Score:2)

Ran Into It This Week (Score:5, Interesting)

Re: (Score:3)

Re:Ran Into It This Week (Score:5, Informative)

Re: (Score:3)

Re: Ran Into It This Week (Score:3)

Re: (Score:2)

Re: Ran Into It This Week (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

can this be done (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Why we can't have nice things (Score:3)

They should pay for hositing costs (Score:2)

Feed the crawlers BS (Score:1)

Re: (Score:3)

The problem is the competition (Score:3)

LOL Headline (Score:2, Insightful)

There's a tool for that (Score:2)

Don't block them, poison them (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals