


Open Source Devs Say AI Crawlers Dominate Traffic, Forcing Blocks On Entire Countries (arstechnica.com) 64
An anonymous reader quotes a report from Ars Technica: Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures -- adjusting robots.txt, blocking known crawler user-agents, and filtering suspicious traffic -- Iaso found that AI crawlers continued evading all attempts to stop them, spoofing user-agents and cycling through residential IP addresses as proxies. Desperate for a solution, Iaso eventually resorted to moving their server behind a VPN and creating "Anubis," a custom-built proof-of-work challenge system that forces web browsers to solve computational puzzles before accessing the site. "It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more," Iaso wrote in a blog post titled "a desperate cry for help." "I don't want to have to close off my Gitea server to the public, but I will if I have to."
Iaso's story highlights a broader crisis rapidly spreading across the open source community, as what appear to be aggressive AI crawlers increasingly overload community-maintained infrastructure, causing what amounts to persistent distributed denial-of-service (DDoS) attacks on vital public resources. According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies' bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers.
Kevin Fenzi, a member of the Fedora Pagure project's sysadmin team, reported on his blog that the project had to block all traffic from Brazil after repeated attempts to mitigate bot traffic failed. GNOME GitLab implemented Iaso's "Anubis" system, requiring browsers to solve computational puzzles before accessing content. GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests (2,690 out of 84,056) passed their challenge system, suggesting the vast majority of traffic was automated. KDE's GitLab infrastructure was temporarily knocked offline by crawler traffic originating from Alibaba IP ranges, according to LibreNews, citing a KDE Development chat. While Anubis has proven effective at filtering out bot traffic, it comes with drawbacks for legitimate users. When many people access the same link simultaneously -- such as when a GitLab link is shared in a chat room -- site visitors can face significant delays. Some mobile users have reported waiting up to two minutes for the proof-of-work challenge to complete, according to the news outlet.
Iaso's story highlights a broader crisis rapidly spreading across the open source community, as what appear to be aggressive AI crawlers increasingly overload community-maintained infrastructure, causing what amounts to persistent distributed denial-of-service (DDoS) attacks on vital public resources. According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies' bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers.
Kevin Fenzi, a member of the Fedora Pagure project's sysadmin team, reported on his blog that the project had to block all traffic from Brazil after repeated attempts to mitigate bot traffic failed. GNOME GitLab implemented Iaso's "Anubis" system, requiring browsers to solve computational puzzles before accessing content. GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests (2,690 out of 84,056) passed their challenge system, suggesting the vast majority of traffic was automated. KDE's GitLab infrastructure was temporarily knocked offline by crawler traffic originating from Alibaba IP ranges, according to LibreNews, citing a KDE Development chat. While Anubis has proven effective at filtering out bot traffic, it comes with drawbacks for legitimate users. When many people access the same link simultaneously -- such as when a GitLab link is shared in a chat room -- site visitors can face significant delays. Some mobile users have reported waiting up to two minutes for the proof-of-work challenge to complete, according to the news outlet.
Idea (Score:1, Troll)
Re: Idea (Score:5, Insightful)
No more shitcoins please.
Re: (Score:2)
Now let's talk about how you communicate the proof of work task and result. If I send you a packet, you need to send me a challenge. When that packet arrives, I send you a challenge before accepting it, but to accept the challenge, you first send me ... you see the problem? And if you managed to solve that recursion, we need to talk about sending the result of the PoW, requiring even more PoW to be accepted.
Re: (Score:2)
He's a moron, it's not what he says but what he means. Clearly not every packet can be dependent on additional packets that are recursively dependent. He doesn't know any better.
Every establishment of connection could be though. Of course, that would be terrible but at least conceivable. Then there would be workarounds for that. And just imagine the corruption associated with collecting the fees. Truly the spirit of an open internet, right?
Re: (Score:2)
I was a bit making fun of the OP in my post ...
Here it would be enough to introduce PoW for HTTP. And I don't search for it know, because I have a dark feeling that the blockchain fans already have a spec for that.
Re: (Score:2)
"each and every TCP packet should ... contribute to a FOSS-forward cryptocurrency validation system. Make the bots pay for FOSS development."
Note that the suggestion to make "bots" pay in fact forces "each and every TCP packet" to pay. To each according to his ability! Tax everybody to give me free stuff.
Re: (Score:2)
Just in time for TCP traffic to fall thru the floor.
Alphabet, Cloudflare, and the big mobile carriers are going to make HTTP/3 happen. They don't give a rats arse that moving the connection oriented-ness out of the transport layer where it can be generically understood, and into the application layer while encrypting it will once again further erode anyone else s ability to identify or categorize traffic on their networks, for any reasons including legitimate like security, cost control, and performance ma
Re: Idea (Score:2)
Ok so itâ(TM)s coming⦠what does one do about it as it hits, pragmatically?
Re: (Score:2)
Nothing, you can do about it. Just realize you are not going to stop the AI scraping without also getting yourself effectively memory holed as far as the search majors are concerned.
its accept it and deal with the bot traffic, or try to stop the crawlers but the outcome is you will be getting no traffic.
Re: (Score:2)
But hey, enjoy the oversaturated internet that all this AI traffic will cause without such a plan!
It's called captcha, guys. (Score:1)
What is this 1995 over there?
Re:It's called captcha, guys. (Score:4, Informative)
No, it's 2025 and CAPTCHAs don't work anymore. https://arstechnica.com/ai/202... [arstechnica.com]
Re: (Score:1)
They slow down bots and waste normal people's time which is what this guy reinvented.
Re: (Score:2)
Captchas don't work against spam, as a spammer needs to submit one form to make (potentially) profit. That still implies that the form is worth the compute to solve the Captcha, and that only scales when you make direct profit. If you just want to crawl some data, there isn't that much value. Running one captcha solver per site would be way too expensive for any search engine, AI crawler, or social media bot.
Re: It's called captcha, guys. (Score:3)
Re: (Score:2)
It's the same why they simple can't license each work to a price a human would have to pay. They are dealing with HUGE amounts of data and every bit of it only contributes a tiny part. But if you need to solve a captcha for every tiny part, you won't get enough data in a reasonable time.
Llama3 was trained on 15T tokens, I have no estimate how many tokens a site would provide before you have to solve the next captcha, but you still end up with A LOT of captchas. And other than model training, you have to sol
How about subversion? (Score:5, Interesting)
Instead of blocking crawler traffic, which requires some cooperation to some extent, how about making it unprofitable to crawl? So, the problem right now is that the server detects a crawler and then tries to block future access by that crawler but is ultimately unable to block all such attempts. Why not detect the crawler and then surreptiously send back corrupted data? Don't make the corruption obvious and too frequent but just enough to make it futile/dangerous to use as training data.
Re:How about subversion? (Score:5, Informative)
This is actually being done for AI focused crawling. Feed the bot with script/AI generated nonsense after directing it into a honeypot directory or such. We actually had an article about it last week.
Found it - Cloudflare Turns AI Against Itself With Endless Maze of Irrelevant Facts [slashdot.org]
Subversion but with random facts. (Score:2)
The Cloudflare crawler honeypots serve an endless web of verified facts to the crawlers, so as to not increase mis/dis information. The auto populated page content and links are designed to NOT include facts relevant to the site.
Re: (Score:2, Informative)
If you can detect the crawler, a 403 is cheaper. And there is little reason to actively make AI dumber, there is only a reason to slow down crawlers on the own site. The problem is, that no two companies use the same crawler, so you need to keep updating your robots.txt (and possibly blocking for the naughty ones) as new crawlers are created.
Re:How about subversion? (Score:5, Informative)
The whole point of the story is that those crawlers could not care less about your robots.txt file.
Re: (Score:2)
Exactly, they use fake user agents, and ignore robots.txt
I run a small Calcudoku puzzle site, and recently had to take the forum offline because they were hitting it so hard the server load went up to 300+ (a normal high load is 3..).
For the past few months the crawlers are all from Chinese IPs.
I'm looking to create custom fail2ban rules to detect and block them.
Re:How about subversion? (Score:4)
FWIW, we tried fail2ban and the bots circumvented it in days - instead of dozens of requests coming from one IP, they went down to one request from each IP, and swarms of IPs coming from all over the place (not within an easy-to-define CIDR range).
The first thing that's been effective for us is Turnstile. A colleague of mine wrote up a general approach in Rails https://bibwild.wordpress.com/... [wordpress.com] and we wrote up our version of that using Trafik https://github.com/pulibrary/p... [github.com]
Re: (Score:2)
That's what "proof of work" is about. The problem is to make that work useful rather than just friction. And it's got to be really easy to check the answer, but not easy to compute. There are lots of problems of that kind, but most aren't useful. The classical problem of that kind is factoring numbers...easy to check, more difficult to do...but turning that into something that's in itself useful is difficult.
Re: (Score:2)
Because of the economics, and original objective.
If you have someone on the web it is because you want eyeballs on it in the vast majority of cases. Either there is economic value in the eyeballs, your thing is an ad itself for something you are selling, you get paid for others ads based on impressions/views, its vanity you wrote something so you want others to read it.
To get eyeballs you need to be indexed in some kind of search, or possibly sucked in to some AI model either directly or via RAG so you can
Re: How about subversion? (Score:2)
You absolutely can look up AI tarpit.
There are people fighting back
Bring Back Kitten Net! (Score:2)
Major problem (Score:5, Informative)
This problem is very real. I manage a small club website on an old, simple machine. These bots completely ignore robots.txt (we don't want ANY indexing or crawling) and 99.9999% of traffic to the site is bots. And they are so pervasive that they caused my site to come to a crawl and often fail completely.
In desperation, one week, trying to get the site up, I had to resort to just outright blocking of thousands of IP addresses in iptables. That is almost hopeless whack-a-mole, but at least it got the site reachable.
What this ultimately means is that nobody will be able to run their own infrastructure anymore and will be forced to use huge, paid third-party infrastructure/gatekeepers. And those mega-corps will then have access to your data, stats, and traffic info, take more of your money, and even control filtering to your sites if they like.
I don't know the solution. We probably need to treat the sites like we do spammers and have RBL at the ISP-level. But I can't see them wanting to do that.
Re:Major problem (Score:5, Interesting)
You can "whois" the IP and get the entire IP address range for that carrier from the output, then just block them all at once. They'll switch ISPs maybe 3-5 more times on you that evening, but after that you've usually pretty much silenced them for a few weeks. There's a finite amount of ISPs that willingly allow their customers to carry out this type of abuse while also helping them shuffle IP addresses at will, and once you've found them all just... don't ever unblock them.
Re: (Score:1)
Staying anonymous, since the weasels will attack me/my sites if they can figure out who I am, I regularly block entire ranges based on AFRINC (Africa)
ARIN, APINC, LACINC, or RIPENCC assignments. Mostly because of email spam, often also to deal with WordPress attacks, but so far the scapers have not hit me hard yet. I'm not messing with clever tricks, I'll just ban them.
Oh, and currently I ban country code TLDs such as .cn, .ru, .ua, and a few others that have been common bad actors to me. .
Re:Major problem (Score:4, Interesting)
I have my own personal genealogy website, running on a small Armbian machine. It recently got hit by a scanning bot army from Huawei Cloud, also completely ignoring my robots.txt, so badly that I had to block all of their IP ranges at my router. This kind of behaviour makes me pine for the internet of the 90s. It wasn't fast, but it was fun.
Re: (Score:2)
Yep, Huawei Cloud was one of my worst abusers too. I blocked huge swathes of China, but ended up password-protecting my site except for the main page, where I give the username and password users should use to access the rest of the site. This has stopped bots cold... for now.
Re: (Score:1)
Hide the origin, proxy through cloudflare, and let them cache everything. They do it for free.
Re: (Score:2)
This problem is very real. I manage a small club website on an old, simple machine. These bots completely ignore robots.txt (we don't want ANY indexing or crawling) and 99.9999% of traffic to the site is bots. And they are so pervasive that they caused my site to come to a crawl and often fail completely.
In desperation, one week, trying to get the site up, I had to resort to just outright blocking of thousands of IP addresses in iptables. That is almost hopeless whack-a-mole, but at least it got the site reachable.
What this ultimately means is that nobody will be able to run their own infrastructure anymore and will be forced to use huge, paid third-party infrastructure/gatekeepers. And those mega-corps will then have access to your data, stats, and traffic info, take more of your money, and even control filtering to your sites if they like.
I don't know the solution. We probably need to treat the sites like we do spammers and have RBL at the ISP-level. But I can't see them wanting to do that.
We're probably headed to a new form of the web, where only the giants have "open to the world" sites, and everybody else has to have some form of invite token passed from the browser to the server before access is allowed. It's either that, or your nightmare scenario of allowing the giants to control even the small sites. It's amazing what we've done to the open web by allowing greed beyond reason to run amok. Because that's all current gen AI is. Greed manifested in digital form. They need more power, more
Re: (Score:2)
We're probably headed to a new form of the web, where only the giants have "open to the world" sites, and everybody else has to have some form of invite token passed from the browser to the server before access is allowed.
So the "giants" own the internet or it's just a different form of net neutrality. The "giants" are the gatekeepers of the internet.
Re: (Score:3)
Re: (Score:2)
If you are sitting behind a good CDN, or any CDN, you should be able to have this traffic rejected before it even gets to your site.
https://blog.cloudflare.com/de... [cloudflare.com]
Sitting behind any CDN should provide some protection from DDOS attacks and unwanted synthetic traffic. The CDN is analyzing traffic with ML, so it adapts as the AI bots do, in theory.
Geoblocking (Score:5, Interesting)
The US, most of the soviet block and the middle east are out. We even geoblocked a fair part of South America.
They still try to get to us via local proxies, but its down from millions of hits to thousands, and our usual defenses can handle that.
Re: (Score:2)
The US, most of the soviet block and the middle east are out.
Did you remember the Roman and Aztec Empires? The "Soviet bloc" came to a decisive end a third of a century ago. (If you ignore small parts of the UK).
Re: (Score:2)
Re: Geoblocking (Score:2)
Re: (Score:1)
> We've been geoblocking the worst countries for several years...[such as the US]
If feels odd living in a pariah nation: USA. The Tinted One didn't create that condition, but he did put pariahood on steroids and gave it a big check.
> when the CIA wouldn't leave our servers alone.
I doubt you'd know it was from the CIA. They wouldn't use CIA-identified addresses, defeats being a spy. It's possible another group was using/spoofing their IP blocks though.
Forcing Blocks On Entire Countries (Score:4, Insightful)
That's the lazy approach. Nobody forces you to block a whole country. You're just taking the easy route at the cost of millions of visitors.
Re: (Score:3)
Probably not. My sites are of no use to anyone outside the UK.
Re: Forcing Blocks On Entire Countries (Score:1)
Are you sure? What about some Briton who's currently abroad (and isn't using roaming data)?
Re: (Score:2)
You'd have to weigh the value of that hypothetical wandering Briton's time against the actual time of the site sysadmin, who would probably like to do something else with his free hours than play endless games of unpaid whack-a-mole.
Re: (Score:2)
I would argue they are not millions (of his customers). But the point is, that many other sites do have millions of customers in countries someone may block because of their bots.
Re: (Score:2)
Block based on traffic patterns (Score:2)
Re: (Score:2)
Interesting take.
But a honeypot would not need to waste many resources. It could just be a robots.txt no-indexed basic page that is linked early from the main page but hidden from human users (javascript comes to mind, and for the ones who are disabling javascript, a warning). A hit to the page would land the offender into an IP blocklist for a while.
Re: (Score:2)
The funny part is, that you can certainly train an AI on finding traffic patterns. You would then have the AI (or a human) craft simpler rules that can be evaluated in real-time, though. But sifting through large data and finding patterns is exactly what AI can do best.
Why Brazil? (Score:2)
Re: (Score:2)
I can't understand why they blocked all traffic from Brazil. This is underdeveloped country. We don't have IA capabilities. All of our best engineers are abroad.
Perhaps locations in Brazil were/are being used as proxy nodes?
Re: (Score:2)
confusing (Score:2)
The problem here is that there is too much reading of open source code? Isn't that the goal of open source code?
Or is the problem who reads it and who benefits from the reading? If that's the case, why not make access credentialed?
It appears to me that the problem here is that the source code sites are inadequate for their intended purpose and their operators are hypocrites. I hate AI as much as anyone, but having AI train on open source code is literally the core vision of open source software.
Re: (Score:2)
The main problem is the rate of access. If one entity is using all you bandwidth, nobody else can get in. And you've got to pay for that bandwidth, but it's only visible to those who won't be contributing.
Re: (Score:2)
It's one thing to train on software. It's quite another to hammer a site repeatedly and overload it.
I have a few open-source projects that were hit by this. If the AI companies wanted to train on my software, they could git clone it once and train it that way, which would not put a strain on my server. Instead, they hit the web view of my git repo, pulling each commit's link at high speed. This is about 1000x the load of a simple git clone and ignores my robots.txt file. So banned they are.
Re: (Score:2)
Basic respect of resources would be polite. Some friend also crawled a few sites with own scripts for different purposes, but always with time delays that make the requests slower. And if you crawl the whole web, you should be able to schedule your requests in a way that the rate on a single site always stays low. The only reason for a high rate at a single site would be having more consistency of changing data depending on each other, but the brute-force crawlers have no idea what kind of data they are scr
If you are not behind a CDN.. (Score:2)
This problem is one that CDNs have been working on for years. They're using ML to detect synthetic traffic and block it. The block rate vs false positives is really good, in my experience. A few times we had some corp network step over the line into what looked like bot behavior and get blocked, but its rare. Point is, use ML to detect AI Bots, and let the CDN block it before it gets anywhere near your servers. Most CDNs provide this even in the free tiers of service.
Most of the traffic comes from Bytespide