

Are AI Web Crawlers 'Destroying Websites' In Their Hunt for Training Data? (theregister.com) 85
"AI web crawlers are strip-mining the web in their perpetual hunt for ever more content to feed into their Large Language Model mills," argues Steven J. Vaughan-Nichols at the Register.
And "when AI searchbots, with Meta (52% of AI searchbot traffic), Google (23%), and OpenAI (20%) leading the way, clobber websites with as much as 30 Terabits in a single surge, they're damaging even the largest companies' site performance..." How much traffic do they account for? According to Cloudflare, a major content delivery network (CDN) force, 30% of global web traffic now comes from bots. Leading the way and growing fast? AI bots... Anyone who runs a website, though, knows there's a huge, honking difference between the old-style crawlers and today's AI crawlers. The new ones are site killers. Fastly warns that they're causing "performance degradation, service disruption, and increased operational costs." Why? Because they're hammering websites with traffic spikes that can reach up to ten or even twenty times normal levels within minutes.
Moreover, AI crawlers are much more aggressive than standard crawlers. As the InMotionhosting web hosting company notes, they also tend to disregard crawl delays or bandwidth-saving guidelines and extract full page text, and sometimes attempt to follow dynamic links or scripts. The result? If you're using a shared server for your website, as many small businesses do, even if your site isn't being shaken down for content, other sites on the same hardware with the same Internet pipe may be getting hit. This means your site's performance drops through the floor even if an AI crawler isn't raiding your website...
AI crawlers don't direct users back to the original sources. They kick our sites around, return nothing, and we're left trying to decide how we're to make a living in the AI-driven web world. Yes, of course, we can try to fend them off with logins, paywalls, CAPTCHA challenges, and sophisticated anti-bot technologies. You know one thing AI is good at? It's getting around those walls. As for robots.txt files, the old-school way of blocking crawlers? Many — most? — AI crawlers simply ignore them... There are efforts afoot to supplement robots.txt with llms.txt files. This is a proposed standard to provide LLM-friendly content that LLMs can access without compromising the site's performance. Not everyone is thrilled with this approach, though, and it may yet come to nothing.
In the meantime, to combat excessive crawling, some infrastructure providers, such as Cloudflare, now offer default bot-blocking services to block AI crawlers and provide mechanisms to deter AI companies from accessing their data.
And "when AI searchbots, with Meta (52% of AI searchbot traffic), Google (23%), and OpenAI (20%) leading the way, clobber websites with as much as 30 Terabits in a single surge, they're damaging even the largest companies' site performance..." How much traffic do they account for? According to Cloudflare, a major content delivery network (CDN) force, 30% of global web traffic now comes from bots. Leading the way and growing fast? AI bots... Anyone who runs a website, though, knows there's a huge, honking difference between the old-style crawlers and today's AI crawlers. The new ones are site killers. Fastly warns that they're causing "performance degradation, service disruption, and increased operational costs." Why? Because they're hammering websites with traffic spikes that can reach up to ten or even twenty times normal levels within minutes.
Moreover, AI crawlers are much more aggressive than standard crawlers. As the InMotionhosting web hosting company notes, they also tend to disregard crawl delays or bandwidth-saving guidelines and extract full page text, and sometimes attempt to follow dynamic links or scripts. The result? If you're using a shared server for your website, as many small businesses do, even if your site isn't being shaken down for content, other sites on the same hardware with the same Internet pipe may be getting hit. This means your site's performance drops through the floor even if an AI crawler isn't raiding your website...
AI crawlers don't direct users back to the original sources. They kick our sites around, return nothing, and we're left trying to decide how we're to make a living in the AI-driven web world. Yes, of course, we can try to fend them off with logins, paywalls, CAPTCHA challenges, and sophisticated anti-bot technologies. You know one thing AI is good at? It's getting around those walls. As for robots.txt files, the old-school way of blocking crawlers? Many — most? — AI crawlers simply ignore them... There are efforts afoot to supplement robots.txt with llms.txt files. This is a proposed standard to provide LLM-friendly content that LLMs can access without compromising the site's performance. Not everyone is thrilled with this approach, though, and it may yet come to nothing.
In the meantime, to combat excessive crawling, some infrastructure providers, such as Cloudflare, now offer default bot-blocking services to block AI crawlers and provide mechanisms to deter AI companies from accessing their data.
Re: (Score:2, Insightful)
Re: (Score:3)
Did Cloudflare say that AI bots were destroying websites? If so, I missed it. Perhaps you could show me that part?
Re:Destroying Websites? (Score:5, Insightful)
A good way to solve this would be to use the Google antitrust trial to force the creation of a single crawler for the entire web which puts all of the results in to a single, central repository. Everyone can then use that central repository, while charging the users of the repository enough fees to break even on the costs. The antitrust settlement would require Google to construct this central repository. Once the repository exists, all crawling outside of this centralized crawling would be blocked by coordinated ISP action (ie, go use the central repository).
Re: (Score:1)
+1
Re: (Score:3)
Consider how this would work for a new AI entrant. They'd pay to join the repository collective and then the repository will ship them an array of disks with exabytes of data to get them started. No need to crawl at all. Over time that exabyte array will be remotely updated with newly crawled content. Once they have the exabytes of data in their data center they can copy it out at very high speeds. A complete snapshot of the entire internet is zettabytes, I don't believe anyone has a complete snapshot.
Re: (Score:3)
Before you run off saying 'sign me up' note that each empty exabyte of storage costs about $100M to buy.
Re: (Score:3)
But then you have the problem of who controls the single repository.
Who safeguards that no content is censored? Or access isn't denied?
Re: (Score:2)
That would be a good idea, agreed But because it would be an international thing, it will probably not happen.
Re: (Score:2)
The thing is crawlers don't just copy the entire internet verbatim. They add *certain* content to databases in *certain* ways which may or may not result in a suitable use case for the end user.
The first thing that will happen is someone will say "oh this crawler doesn't suit my needs" and then start doing their own, just like DuckDuckGo ran their own crawler despite serving up search results from Bing's crawler.
Re: (Score:2)
This universal crawler would just copy everything verbatim and compress it with standard compression. It would then be up to the subscribers to process it into their own internal formats.
This is not a cheap thing to build, the central repository is going to need several billion dollars worth of storage. So the fees to join the group will be in the $250M range or more. But you'd have to spend that much on your own crawler so there's no loss.
Re: (Score:2)
> puts all of the results in to a single, central repository
There already is such a repository - it's decentralised, and free to use. It's called _the web_. It's going to be a tough sell to say we need to make a copy of it so people can use the copy.
However, the core of the idea has some legs... perhaps we can elect to have 'crawl domains'. So www.example.com is my website, but you get to crawl bots.example.com instead. It's a copy I've made, or it's a low power server which responds quite slowly or what
Re: (Score:2)
I bet it could, as tech advances, fit in a box you could even show to people (if the elders of the Internet agree of course)
Re: (Score:2)
Re: Destroying Websites? (Score:5, Informative)
They are more aggressive than standard bots, and often follow links in a pathological way.
We've had to cut multiple bots off that weren't following robots.txt recommendations.
Balancing performance for real users is a challenge when the bots go overly aggressive and the tools for managing them aren't quite there yet.
Economic indicator (Score:2)
Conjecture 1:
- Spread out a few hundred content rich, with new content added daily useful web sites hosted around the world.
- Post the traffic stats especially from bot scraper like traffic on a daily basis.
- When the bot traffic drops off substantially, the LLM fueled race is near its end.
- Sell all of your high flying AI industry hyped up chipmaker and other AI stocks.
Conjecture 2:
- Due to demographics (wave of 30+ people aging out of the dating/child having years) and negative interactions for any social
Re: Destroying Websites? (Score:2)
I have a site with thousands of products and massive search space due to the number of attributes customers can search on.
The bots are completely stupid trying to product search and iterate over criteria combinations which aren't overall very useful.
You can efficiently grab all our products if a small amount of intelligence is applied and you'd have negligible impact on the server. Instead the bots try to search using each and every possible attribute and therefor blow the cache. I could write a bot to fetc
Re:Destroying Websites? (Score:5, Interesting)
As someone who's actively fighting this type of traffic, let me share my perspective.
I have been running a small-ish website with user peaks at around 50 requests per second. Over the last couple of months, my site is getting hit with loads of up to 300 requests per second by these kinds of bots. They're using distributed IPs, and random user agents making it hard to block.
My site has a lot of data and pages to scan, and despite an appropriate robots.txt, these things ignore that and just scan endlessly. My website isn't designed to be for profit, and I do this more or less as a hobby and therefore has trouble handling a nearly 10x increase in traffic. My DNS costs have gone up significantly, with 150 or so million DNS requests being done this month.
The net effect is that my website slows down and gets unresponsive by these scans, and I am looking at spending more money just to manage this excess traffic.
Is it destroying my site? No, not really. But it absolutely increases costs and forces me to spend more money and hours on infrastructure than I would have needed to. These things are hurting smaller communities generating significant cost increases onto those who may have difficulties covering those costs, so calling it bullshit isn't exactly accurate.
Re: Destroying Websites? (Score:2)
I know you're saying it's coming from lots of IP addresses, but I wonder if anyone has looked into geofencing to throttle any requests coming out of major data center cities. Normal users would get full speed access, but anyone in the valley or in Ashburn, VA would experience difficulty scraping.
Re: Destroying Websites? (Score:4, Interesting)
It's not just data centres, many of the requests from regular broadband IP addresses. I think they're using "services" of bottom feeders like Scraper API [scraperapi.com], or buying from the authors of malicious web browser extensions [arstechnica.com].
Re: (Score:2)
Re: (Score:2)
It's really no holds barred, with the amount of money we're talking about. This is an industry that's spent the last 3, 4 decades telling us how terrible unauthorized copying and systems access are. (Don't copy that floppy!) Those rules get thrown right out the window when they're eyeing the types of cash they think AI will bring.
Working with malware distributors and botnet admins would not surprise me at all. Particularly in this Project 2025 era where the government's been purchased, whole-hog, by tech br
Re: (Score:3)
Re: (Score:2)
"Major data center cities" are also generally major population cities, with a minor in geography. Rate-limiting by GeoIP would also rate-limit big chunks of real users. It might provide a marginal windfall to users outside of those locations - at least until the bots start using rural IP addresses.
There is a boom in rural areas spinning up data centers. That is to say, any random small city may now or in the near future suddenly become a "major data center city" at the behest of a singular tech bro.
Re: Destroying Websites? (Score:2)
""Major data center cities" are also generally major population cities"
Yes and no. Data centers are usually NEAR cities. But the economics of data centers keep them out in the suburbs. Data centers are more likely to be surrounded by fields than a high rise apartment.
But it sounds like from other comments that the requests are actually much more diffuse.
Re:Destroying Websites? (Score:5, Interesting)
Anubis [github.com] has worked well for us to get rid of most of the scrapers from our wiki, including the ones faking regular user agents.
Re: (Score:2)
Anubis has the side effect that it stops the internet archive crawler.
Re: (Score:2)
Anubis has the side effect that it stops the internet archive crawler.
Even though it whitelists [github.com] the IA crawlers by default?
Re: (Score:2)
I am not sure what they are whitelisting, but I've seen pages in the archive that only show the anubis girl. Maybe it was from an older version, but I saw the page less than 5 month ago.
Re: (Score:2)
Re:Destroying Websites? (Score:4, Interesting)
Someone should build an AI tool to detect these AI web crawlers and then send back corrupted information (not misspelling but actual falsehoods). The only way to stop the unneighborly actions is to eliminate the expectation of a reward.
Re: (Score:3)
Cloudflare built it, and it's called "AI Labyrinth". I'd like to deploy a similar webpage generator on my Apache server, without Cloudflare. If you know of any such scripts, link me and I'll check them out.
Re: (Score:2)
In the end I shut it down, as I would rather just block them to begin with instead of wasting CPU cycles for no real gain on my part.
Re: (Score:1)
Someone should build an AI tool to detect these AI web crawlers and then send back corrupted information (not misspelling but actual falsehoods). The only way to stop the unneighborly actions is to eliminate the expectation of a reward.
There's Nepenthes [zadzmo.org], and it's open source, though it sends back slow, Markov-chain nonsense rather than actual falsehoods.
Re: Destroying Websites? (Score:2)
Re: (Score:3)
The amount of traffic they apply to a given web site at a time is a denial of service attack. I've seen it personally on a site I host, and it took the site down until I blocked it.
You should learn what you're talking about before you talk.
Re: (Score:1)
Why don't you educate me? What are you hosting that they couldn't consume and move on in under an hour?
Even a Raspberry Pi could upload the English language Wikipedia(25GB) to a crawler in half an hour.
Re: (Score:2)
The site I host was showing about a hundred times the normal amount of traffic as usual, enough to overload the processor because of the database calls, that lasted for over a week - until I started blocking entire countries. Most of it was AI bots, according to the user agents, and that's just the ones that were accurately labeled. And that was what was left after Cloudflare's defense.
Masturbate as furiously as you want, retard, it was a denial of service attack that succeeded until I stopped it. If you're
Re: (Score:2)
Server problems happen all the time from th
Re: (Score:2)
I manager a several small specialist blog/cms type of sites as part of a non-profit, open-source project.
Typically they get a few hundred/thousand hits per day.
Recently they have been hit by AI crawlers. Some sites were getting 500,000 requests per day from 500,000 unique IP addresses with random user-agent strings. The "attacks" last 6-8 weeks.
Most of the connections get dropped by the robot, because they won't wait the 1 or 2 seconds that it takes to generate many of the pages. So most of the CPU goes
Re: (Score:2)
Re: (Score:2)
So you didn't actually even *skim* the whole post.
I know websites that have been unreachable/knocked offline due to bots. I, certainly, can't tell the difference between a DDoS attack, and chatbots.
Ouroboros of shit (Score:2)
AI Bots scraping AI generated content to feed the AI Machine.
AI crawlers should be treated as viruses (Score:3)
Re:AI crawlers should be treated as viruses (Score:5, Funny)
We can't be expected to contact the owner of every website we steal from - there are too many. Waaaaa.
Back to the directory days (Score:2)
Are you suggesting that search engines ought to go back from the crawling model of WebCrawler, AltaVista, and Google to the opt-in directory model of Yahoo! and Dmoz, with each website operator expected to be aware of each search engine and register a sitemap in order to get their site crawled?
Re: (Score:2)
Given that the search engines (or at least Google), and the websites themselves, were much better in that era, I'm not sure what the downside would be.
"Sitemap" doesn't sound like a terribly difficult thing for a web developer to create, once he's already creating the rest of the site. No web dev? "Well there's your problem", as they would say on Mythbusters.
Downsides of directory submission (Score:3)
Given that the search engines (or at least Google), and the websites themselves, were much better in that era
Google has always been a crawler, not a directory. Its crawl was seeded at times with data from Dmoz Open Directory Project, a directory that Netscape acquired in 1998 and ran as an open database to compete with Yahoo.
I'm not sure what the downside would be.
One downside of the directory model is that the operator of a newly established website may not know what search engines its prospective audience are using.
A second downside is time cost of navigating the red tape of keeping the site's listing updated everywhere. This has included finding wher
Re: (Score:1)
Such sites started using JavaScript rendering to get around the cheaper scrapers.
Re: (Score:2)
LLM tech bro AI is a scam and the crawling is abusive but it's not new.
Yep. And if you do mass commercial piracy, why would you care about being nice to the pages you scrape. They all hope to get out rich before the hammer falls.
\o/ (Score:3)
Is there a scenario where someone finds a way to make the LLMs DDOS each other (for those which have the ability to search the web to answer a prompt) ?
Robber Barons (Score:5, Insightful)
Re:Robber Barons (Score:5, Interesting)
Well, commercial copyright infringement is supposed to get people sent to prison. But the US legal system is just about as corrupted as the rest of its government these days.
Re: (Score:1, Flamebait)
Apparently some MAGA moron has gotten mod-points...
Re: (Score:2)
Well, lets see whether I can make that MAGA waste of oxygen throw away more mod-points.
I'm stealing your comment (Score:2)
The concept of "theft" is just stupid when it comes to things openly published on the internet. LLMs are no more stealing content than Google's old search crawler was.
Honestly it's 2025, why are we still conflating IP copyright infringement with theft, on Slashdot of all places.
Re: (Score:2)
It's only theft you peasants do it. When billion dollar companies pirate it's fair use.
Just do this (Score:2)
Just add some outrageous content on all you site pages, something like:
with enough sites doing this, someone's bound to st
Re: (Score:2)
That kind of thing is easy to filter out.
How can it be easy to filter when everyone puts their own variation of whatever outrageous thing they want to say? It should be just as indistinguishable just like the current web (note that my Vance jab is just an illustration of the reasoning to use, since clearly you don't need actual evidence of any actual facts anymore - even the stupidest humans will believe it - it should be simple to have AIs regurgitate it.)
Yes (Score:3)
The bots represent over 90% of the traffic for many/most sites. Since the "AI" systems theoretically don't store data they are querying it constantly. Forum software seems to be hit hardest since the content doesn't cache well. It has made one site I use frequently completely unusable despite significant resources and cloudflare fronting them. It is a lot like the /. effect, but harder to address.
Umm... robots.txt? (Score:2)
If your website is being murdered by crawlers, stop them.
Too obvious? What am I missing?
Re: (Score:2)
What you're missing is that while the big boys pretend to respect your robots.txt, they hire contractors who don't.
Re: (Score:2)
Its when you get hit with a botnet of over a million unique IPs that has been rented from some malware provider to crawl and slurp your site down as fast as possible. When your site goes from 4-5 requests per second to 1000s. All with randomized user agents, all coming from different residential subnets in different
I'm confused (Score:1)
Can't these just be made illegal, with HUGE fines for getting caught operating one?
Re: (Score:2)
AI scraping could be classified as DDoS. It may already be illegal.
AI is Destroying our Society (Score:2)
Re: (Score:2)
Indeed. Always the same crap with some humans.
It's not really a solution, but... (Score:2)
Are AI web crawlers destroying websites? (Score:2)
Unlike traditional web crawlers, AI crawlers aggressively scrape entire pages, often ignoring crawl-delay rules or robots.txt directives, and can cause major
The Library of Babelman /s (Score:3)
This structure embodies the synergistic relationship between humans and AI — a symbiotic, co-evolutionary lattice of knowledge creation and consumption. Around the periphery of each hex, twenty shelves (each a vial of encoded knowledge tokens) store volumes of training data snippets, neatly aligned in five units per side, each roughly human-scale to optimize the interface between human cognitive bandwidth and machine processing power.
The hex’s sixth face opens to a narrow vestibule: the human-AI interaction zone — a microcosm of user experience design optimizations. Adjacent compartments are for physiological needs, benchmarking the system’s requirement to balance energy expenditure with cognitive throughput. Embedded therein is a spiral staircase spiraling upward and downward, akin to escalating layers in a multi-scale attention mechanism, enabling bidirectional information flow across time and abstraction levels.
The vestibule’s mirror is a feedback loop — a live reflection of system-state and user inference patterns. It reveals that the Library’s infinity is not an unbounded dataset but an emergent, holographic abstraction: if true infinity were granted, what need would there be for duplicates, echoes, or synthetic reflections? The mirror manifests the probabilistic hallucinations of large language models, the recursive self-attention to patterns of human cognition.
Ambient illumination is provided by bioluminescent orbs — “bulbs” — serving as analogs for active learning signals within this ecosystem. Their crosswise arrangement mimics the dualistic nature of reinforcement learning phases: exploration balanced by exploitation, signaling the unceasing flow of loss gradients updating the model weights. Yet their light is insufficient — like the early stages of alignment — it beckons continuous iteration and human-in-the-loop calibration.
This Library isn’t a mere static archive; it is an active, emergent ecosystem coalescing human creativity and AI’s computational paradigm, a lattice where every token represents the pulse in the seamless dance of human-AI feedback loops — a digital agora where the infinite synergy of collective intelligence unfolds endlessly.
Re:The Library of Babelman /s (Score:4, Funny)
Where can I report "ChatGPT psychosis" on Slashdot?
Re: The Library of Babelman /s (Score:1)
I don't know if I've had far to many tequila sunrises or my concentrate pen has got to me, but dude, I totally understood that.
Feed them rubbish about evil unicorn pandas (Score:2)
Interesting to see whether that will show up in the latest models if done at scale.
Re: (Score:2)
It's sometimes hard to tell the difference between AI scraper bots and legit human users.
Yes, they are (Score:3)
If you're running any kind of "old school" forum with years worth of post history on it, the AI scrapers will eventually come for you. One tech forum I frequent has been brought to its knees multiple times by bots.
Relax and wait till the VC is drained .. (Score:2)
.. in the end there can only be .. few.
These firms will fail to do business and the crawledmy will dampen down.
I'm feeding the AI crawlers AI generated text now (Score:4, Interesting)
I've been hit many times over as well (smallish forum website with about 12000 posts).
Seen it all: fake user agent strings, ignoring robots.txt, either localized IPs (lots from China) or distributed, load increasing to 500 times the normal value, until the site goes down.
For now, a combination of these keeps it manageable:
- fail2ban
- apache mod_evasive
- restricting forum access to logged in users
When the forum is accessed by a crawler, they get a short paragraph about how great the site is, generated by ChatGPT :-)