
AI Crawlers Haven't Learned To Play Nice With Websites (theregister.com) 57
SourceHut, an open-source-friendly git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data. From a report: "SourceHut continues to face disruptions due to aggressive LLM crawlers," the biz reported Monday on its status page. "We are continuously working to deploy mitigations. We have deployed a number of mitigations which are keeping the problem contained for now. However, some of our mitigations may impact end-users."
SourceHut said it had deployed Nepenthes, a tar pit to catch web crawlers that scrape data primarily for training large language models, and noted that doing so might degrade access to some web pages for users. "We have unilaterally blocked several cloud providers, including GCP [Google Cloud] and [Microsoft] Azure, for the high volumes of bot traffic originating from their networks," the biz said, advising administrators of services that integrate with SourceHut to get in touch to arrange an exception to the blocking.
SourceHut said it had deployed Nepenthes, a tar pit to catch web crawlers that scrape data primarily for training large language models, and noted that doing so might degrade access to some web pages for users. "We have unilaterally blocked several cloud providers, including GCP [Google Cloud] and [Microsoft] Azure, for the high volumes of bot traffic originating from their networks," the biz said, advising administrators of services that integrate with SourceHut to get in touch to arrange an exception to the blocking.
They don't know right from wrong (Score:5, Interesting)
AI companies are barely a step away from committing armed robbery just to feed their training datasets.
Re:They don't know right from wrong (Score:4, Interesting)
Didn't someone go to prison for scraping data from a web site then offering it to the public?
So why does AI get a pass.
Re:They don't know right from wrong (Score:5, Insightful)
Corporations can get away with this sort of stuff. Same goes for Facebook literally stalking you in ways that would be illegal if you as a private individual did what they do.
Aaron Schwartz (Score:5, Insightful)
Aaron did it for the greater good which cannot be allowed. That sort of thing gets you hounded to death.
Re: (Score:3, Informative)
They redistribute it to the public for profit though.
https://blogs.microsoft.com/on... [microsoft.com]
We have incorporated filters and other technologies that are designed to reduce the likelihood that Copilots return infringing content.
Reduce the likelihood means they infringe sometimes.
So is it OK for me to go unpunished if I only break the law sometimes?
Re: (Score:2)
Re: (Score:3)
Re: (Score:3, Interesting)
My GF has a fake account on Linkedin.
I mean: it is not her account. It is an account "in her name". Kind of a Ghost account.
Astonishingly her name is written "correctly". Which is interesting as her name on her passport is "wrong". There is a mixup of one vowel. Which is acceptable as her name is Thai and that is an "unwritten" vowel. All her family has an "A" there, and she has an "O"
Her Linkedin name is using the O, which is - well - astonishing. She is probably the only person in Thailand with that name.
Re: (Score:1)
No, the real reason i think, is either A) the crawlers are too lazy to first check if the URL is already in common crawl, B) the website creator created the website in such a way as that it can only be crawled with some sort of selenium chrome workers running on peoples botnetted home PCs.
Re: (Score:1)
There is trillions of dollars at stake here (Score:3, Interesting)
We could radically restructure our civilization to prevent that but frankly at this point I don't think there's a snowball's chance in hell. Old people are in charge and old people don't want to do that. Speaking is one of the few old farts that would be on board for that I can tell you there's damn few of us and we suck at politics. And this is very much a political thing.
So getting back on track AI bots are going to steal and take and devour everything and anything they can get their hands on without any concern of what's legal let alone moral because of the insane amounts of money and power involved here.
Meanwhile old farts who don't like change will continue to deny what's happening. Some of them might even be lucky enough to retire or die before the worst of it hits. Good for them I guess.
Re:There is trillions of dollars at stake here (Score:5, Insightful)
Re: (Score:2)
It's both. It's a piece of capital that works to acquire other people's capital.
Fortunately it was recently ruled that AI-generated work is not eligible for copyright. Unfortunately I don't expect that to last =/
Re: (Score:1)
Capital, is those durable produced goods that are in turn used as productive inputs for further production. as per wikipedia
It's still capital (Score:2)
Re:There is trillions of dollars at stake here (Score:5, Interesting)
AI is going to replace trillions of dollars of human labor and the output of that is going to belong to whoever owns the AI.
You could be forgiven for believing that in 2023. Things were new and seemed like they were moving at a break-neck pace. There was a lot of excitement around what would be in the very near future. It's now 2025 and you've had more than enough time to either learn about how the technology works or notice that the miracles you expected to disrupt the every industry two years ago never happened.
Even a surprising number of very qualified people got swept up in the excitement, convincing themselves that the magic of emergence would somehow overcome what are fairly obvious fundamental limitations. The truckloads of money being tossed around probably also helped a few folks overcome their initial misgivings...
While you expected a dramatic shift in the world order by now, I expected investor money to dry up. What neither one of us expected was that both would happen for an entirely different reason! I mean, of course, the mad king that has turned the US from the seemingly unshakable pillar of stability that held up the world's greatest period of prosperity ... into its greatest threat.
The only hope we have as a nation is to show the world that we are willing and able to deal with a threat like this quickly and without delay. While that won't erase the damage overnight, it just might be enough to save us from the worst of the consequences. We really, really, don't want to transition away from being the world's reserve currency. We need to display our reliability and stability now. That will mean his removal. There's no way around that. Unfortunately, that will either mean assassination or a sudden return to rationality on the right. The pain will get them there, it just might take longer than we have...
Re: (Score:2)
AI, as Implimented is Massive Data Theft (Score:4, Insightful)
Re:AI, as Implimented is Massive Data Theft (Score:4, Insightful)
Are we ... starting to root for RIAA and MPAA?
Re:AI, as Implimented is Massive Data Theft (Score:5, Informative)
It's a complicated issue, we can have complicated opinions.
For example you could be in favor of reasonably limited copyright terms, but also be against mass scraping to replace human creativity.
Re: (Score:2)
Let *them* fight. Just stay out from underfoot when they do.
Re: (Score:3)
Re: (Score:3)
That's why open ai (not OpenAI) is important. The big players will sooner or later try to gatekeep AI. Make sure to have your local models before lobbying leads to only big companies using AI and giving you only limited and paid access to some API that doesn't allow to use AI to the potential the big players are using it.
Re: (Score:2)
There is no honorable AI implementation. AI, as used is nothing more than a copyright breaking data theft machine.
But it's profitable, so that makes it alright. Right? I mean, profit is the only measure of worth we accept, so all other concerns can just be swept away with the magic-erase power of Wall Street.
Re: (Score:1)
That is nonsense.
Most AI is trained by humans.
But that is not my point.
-------------
Copyright breaking stuff is perhaps in text generation or image generation.
And: currently, unless the laws change, it is not a copyright infringement.
It might be morally flawed, but it is absolutely in the laws to scrape the internet, and use what you find to train an AI and that AI might use whatever to generate output. And that output is under the copyright of the person using that AI as a tool. Just like a 3D printer or a
Not necessarily even crawling (Score:2)
easy to mimick you (Score:2)
Re: (Score:2)
Can you reproduce it reading "the entire site"? That sounds indeed ineffective and when I look at how fast for example Perplexity is in downloading pages to add them to the search results I doubt they do more than 1-3 requests for it.
I could imagine some companies to use the request as starting point to (re)crawl the site for training, though. They now that at least one user was interested in it.
Re: (Score:2)
Does Gemini have to read the entire site directly? Or do they get training data forwarded from the same processes which Google uses to crawl the web, which hmm I'm noticing by the way they don't seem to be offering cached pages any more. I bet that was getting crawled by competing AIs.
Ran Into It This Week (Score:5, Interesting)
My own web server suddenly had the load spike to 100 from its normal 5 or so and was having database errors due to too many connections. It turned out to be an AI crawler that wasn't identifying itself as a crawler and was hitting multiple times per second. I had to block several IP ranges to stop it.
The annoying part is that I don't even mind sharing my site's content with the world, including such crawlers; it's non-profit and I'm fine with it. Just don't destroy it for everyone in the process!
Re: (Score:3)
Couldn't you solve that with rate limiting, if a goal is to permit crawlers? Or were the requests coming from many IPs even before you started blocking?
Re:Ran Into It This Week (Score:5, Informative)
The requests were coming from dozens of different IPs, all within certain ranges. That's why I blocked the whole ranges.
Re: (Score:3)
Yeah, I'd block them for that on principle no matter how I felt about them otherwise.
Re: Ran Into It This Week (Score:3)
Re: (Score:2)
That's exactly what it is. Different motive, though just as malicious, and in the end, it's all the same thing.
Re: Ran Into It This Week (Score:2)
We had traffic from tens of thousands of ips over several days all over the globe, all identifying as Google Chrome user agent ... 2-3 requests from each. Very tough to catch, but it was clear a bot. No resources loaded, just the html page. Difficult to block...
Re: (Score:2)
Be careful, sometimes it is a Google bot. But most the time I see first the Google bot and then an iPhone useragent hitting pages. I think they are looking at the same time if you provide other content to their bot and if your site is mobile friendly, thus requesting it with a mobile useragent.
Re: (Score:2)
Even if you can, most server software doesn't do it by default and doesn't prominently document it. That's a problem that didn't happen that often until recently. And now people are rather looking to block the crawlers than to rate limit them, what also doesn't lead to much documentation how to tame them instead of blocking them. Also people can try to rate limit their dawn crawlers. It can't be that hard, if you crawl the complete web anyway, you can round robin the requests between different sites instead
Re: (Score:2)
I had the exact same thing happen, except my hosted server is pretty robust. It would have handled one bot crawling the site, but not three or four at the same time.
At least all but one of them obeyed the robots.txt. The remaining one will never access my server again.
(I also had to block every Chinese IP address I could identify, but that was actually a malicious DOS attack.)
can this be done (Score:1)
Re: (Score:2)
whenever an AI scraper starts harvesting data can it be detected and the server being harvested start feeding the AI useless trash data like lorem ipsum or some other sort of garbage, do that enough times and maybe the AI owners will quit or start respecting other people's servers
The hallucinations would certainly become more entertaining for a brief flash of a moment, if nothing else.
Re: (Score:2)
Re: (Score:2)
You're better off just blocking them. Serving lorem ipsum still costs you bandwidth and sometimes CPU (people already tried to feed them generated plausible texts). And don't get your hopes up, you won't ruin the next model by serving bullshit. The web has that much spam sites, that if spam content could ruin the models, they would already be crap.
Why we can't have nice things (Score:3)
They should pay for hositing costs (Score:2)
Feed the crawlers BS (Score:1)
Identify all the crawlers and purposefully feed them a ton of BS and gibberish. Apart from using up your bandwidth, this is one effective way to fight back. If more people do this, we'll see a real result.
Re: (Score:3)
Identify all the crawlers and purposefully feed them a ton of BS and gibberish. Apart from using up your bandwidth, this is one effective way to fight back. If more people do this, we'll see a real result.
That is what Nepenthes (mentioned in the summary) attempts to do (although it feeds the data slowly, so it may not, itself, use your bandwidth). While such solutions can be somewhat effective to slow a bit of crawling, companies spinning up additional crawlers, and all the other new businesses doing the crawling, is expanding faster than the tar pits can capture all the crawlers.
The problem is the competition (Score:3)
The problem are not single crawlers. The big companies even perfectly respect robots.txt (even though some people claim otherwise because of fake useragents). The problem is, that there are tens of companies building their own dataset and your site is crawled by all of them. And that boils down to the companies not sharing datasets, first because of competition and second, because they are not allowed for copyright reasons.
You can use the TDM exception to train your AI, but if you give others access to your dataset, you provide access to the data which is not covered by TDM. Maybe if they would have some kind of agreement that the receiving party also only uses the data for TDM they could share, but who would get them to invest the time to discuss something like this, when they can just run an own crawler having fine control over their own dataset, hoping it to be better, larger, more recent that that of the competition.
LOL Headline (Score:2, Insightful)
"AI Crawlers Haven't Learned To Play Nice With Websites"
More like:
"AI Crawlers Have Been Coded Explicitly To Not Play Nice With Websites"
There's a tool for that (Score:2)
Don't block them, poison them (Score:2)
Instead of trying to block them, if you can identify the traffic then use that to serve junk content, preferably generated by LLMs.