Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
The Internet AI Open Source

Open Source Devs Say AI Crawlers Dominate Traffic, Forcing Blocks On Entire Countries (arstechnica.com) 64

An anonymous reader quotes a report from Ars Technica: Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures -- adjusting robots.txt, blocking known crawler user-agents, and filtering suspicious traffic -- Iaso found that AI crawlers continued evading all attempts to stop them, spoofing user-agents and cycling through residential IP addresses as proxies. Desperate for a solution, Iaso eventually resorted to moving their server behind a VPN and creating "Anubis," a custom-built proof-of-work challenge system that forces web browsers to solve computational puzzles before accessing the site. "It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more," Iaso wrote in a blog post titled "a desperate cry for help." "I don't want to have to close off my Gitea server to the public, but I will if I have to."

Iaso's story highlights a broader crisis rapidly spreading across the open source community, as what appear to be aggressive AI crawlers increasingly overload community-maintained infrastructure, causing what amounts to persistent distributed denial-of-service (DDoS) attacks on vital public resources. According to a comprehensive recent report from LibreNews, some open source projects now see as much as 97 percent of their traffic originating from AI companies' bots, dramatically increasing bandwidth costs, service instability, and burdening already stretched-thin maintainers.

Kevin Fenzi, a member of the Fedora Pagure project's sysadmin team, reported on his blog that the project had to block all traffic from Brazil after repeated attempts to mitigate bot traffic failed. GNOME GitLab implemented Iaso's "Anubis" system, requiring browsers to solve computational puzzles before accessing content. GNOME sysadmin Bart Piotrowski shared on Mastodon that only about 3.2 percent of requests (2,690 out of 84,056) passed their challenge system, suggesting the vast majority of traffic was automated. KDE's GitLab infrastructure was temporarily knocked offline by crawler traffic originating from Alibaba IP ranges, according to LibreNews, citing a KDE Development chat. While Anubis has proven effective at filtering out bot traffic, it comes with drawbacks for legitimate users. When many people access the same link simultaneously -- such as when a GitLab link is shared in a chat room -- site visitors can face significant delays. Some mobile users have reported waiting up to two minutes for the proof-of-work challenge to complete, according to the news outlet.

Open Source Devs Say AI Crawlers Dominate Traffic, Forcing Blocks On Entire Countries

Comments Filter:
  • Idea (Score:1, Troll)

    by blackomegax ( 807080 )
    each and every TCP packet should face a proof-of-work gate. This proof of work should contribute to a FOSS-forward cryptocurrency validation system. Make the bots pay for FOSS development.
    • Re: Idea (Score:5, Insightful)

      by Samuel Silverstein ( 10475946 ) on Wednesday March 26, 2025 @12:53AM (#65259705)

      No more shitcoins please.

    • by allo ( 1728082 )

      Now let's talk about how you communicate the proof of work task and result. If I send you a packet, you need to send me a challenge. When that packet arrives, I send you a challenge before accepting it, but to accept the challenge, you first send me ... you see the problem? And if you managed to solve that recursion, we need to talk about sending the result of the PoW, requiring even more PoW to be accepted.

      • by dfghjk ( 711126 )

        He's a moron, it's not what he says but what he means. Clearly not every packet can be dependent on additional packets that are recursively dependent. He doesn't know any better.

        Every establishment of connection could be though. Of course, that would be terrible but at least conceivable. Then there would be workarounds for that. And just imagine the corruption associated with collecting the fees. Truly the spirit of an open internet, right?

        • by allo ( 1728082 )

          I was a bit making fun of the OP in my post ...

          Here it would be enough to introduce PoW for HTTP. And I don't search for it know, because I have a dark feeling that the blockchain fans already have a spec for that.

    • by dfghjk ( 711126 )

      "each and every TCP packet should ... contribute to a FOSS-forward cryptocurrency validation system. Make the bots pay for FOSS development."

      Note that the suggestion to make "bots" pay in fact forces "each and every TCP packet" to pay. To each according to his ability! Tax everybody to give me free stuff.

      • by DarkOx ( 621550 )

        Just in time for TCP traffic to fall thru the floor.

        Alphabet, Cloudflare, and the big mobile carriers are going to make HTTP/3 happen. They don't give a rats arse that moving the connection oriented-ness out of the transport layer where it can be generically understood, and into the application layer while encrypting it will once again further erode anyone else s ability to identify or categorize traffic on their networks, for any reasons including legitimate like security, cost control, and performance ma

        • Ok so itâ(TM)s coming⦠what does one do about it as it hits, pragmatically?

          • by DarkOx ( 621550 )

            Nothing, you can do about it. Just realize you are not going to stop the AI scraping without also getting yourself effectively memory holed as far as the search majors are concerned.

            its accept it and deal with the bot traffic, or try to stop the crawlers but the outcome is you will be getting no traffic.

      • Thats the brilliant thing. Using POW. The average home user, using up 1TB of data per month, could be charged a total of 50 cents on that data via their electric bill. But an AI data center would be consuming so much data they'd be paying thousands.

        But hey, enjoy the oversaturated internet that all this AI traffic will cause without such a plan!
  • What is this 1995 over there?

    • by Chris Mattern ( 191822 ) on Wednesday March 26, 2025 @02:06AM (#65259775)

      No, it's 2025 and CAPTCHAs don't work anymore. https://arstechnica.com/ai/202... [arstechnica.com]

      • They slow down bots and waste normal people's time which is what this guy reinvented.

      • by allo ( 1728082 )

        Captchas don't work against spam, as a spammer needs to submit one form to make (potentially) profit. That still implies that the form is worth the compute to solve the Captcha, and that only scales when you make direct profit. If you just want to crawl some data, there isn't that much value. Running one captcha solver per site would be way too expensive for any search engine, AI crawler, or social media bot.

        • I think youâ(TM)ve missed how much money AI companies are lighting on fire on both compute and being first. If it cost a billion in compute to solve the captchas but they got the data, they would do it. The fundamental problem is companies are being rewarded with billions of investment and stock growth for this behavior. They will continue until the money dries up.
          • by allo ( 1728082 )

            It's the same why they simple can't license each work to a price a human would have to pay. They are dealing with HUGE amounts of data and every bit of it only contributes a tiny part. But if you need to solve a captcha for every tiny part, you won't get enough data in a reasonable time.

            Llama3 was trained on 15T tokens, I have no estimate how many tokens a site would provide before you have to solve the next captcha, but you still end up with A LOT of captchas. And other than model training, you have to sol

  • by larryjoe ( 135075 ) on Wednesday March 26, 2025 @01:22AM (#65259743)

    Instead of blocking crawler traffic, which requires some cooperation to some extent, how about making it unprofitable to crawl? So, the problem right now is that the server detects a crawler and then tries to block future access by that crawler but is ultimately unable to block all such attempts. Why not detect the crawler and then surreptiously send back corrupted data? Don't make the corruption obvious and too frequent but just enough to make it futile/dangerous to use as training data.

  • Major problem (Score:5, Informative)

    by markdavis ( 642305 ) on Wednesday March 26, 2025 @02:07AM (#65259777)

    This problem is very real. I manage a small club website on an old, simple machine. These bots completely ignore robots.txt (we don't want ANY indexing or crawling) and 99.9999% of traffic to the site is bots. And they are so pervasive that they caused my site to come to a crawl and often fail completely.

    In desperation, one week, trying to get the site up, I had to resort to just outright blocking of thousands of IP addresses in iptables. That is almost hopeless whack-a-mole, but at least it got the site reachable.

    What this ultimately means is that nobody will be able to run their own infrastructure anymore and will be forced to use huge, paid third-party infrastructure/gatekeepers. And those mega-corps will then have access to your data, stats, and traffic info, take more of your money, and even control filtering to your sites if they like.

    I don't know the solution. We probably need to treat the sites like we do spammers and have RBL at the ISP-level. But I can't see them wanting to do that.

    • Re:Major problem (Score:5, Interesting)

      by Narcocide ( 102829 ) on Wednesday March 26, 2025 @02:20AM (#65259785) Homepage

      You can "whois" the IP and get the entire IP address range for that carrier from the output, then just block them all at once. They'll switch ISPs maybe 3-5 more times on you that evening, but after that you've usually pretty much silenced them for a few weeks. There's a finite amount of ISPs that willingly allow their customers to carry out this type of abuse while also helping them shuffle IP addresses at will, and once you've found them all just... don't ever unblock them.

      • by Anonymous Coward

        Staying anonymous, since the weasels will attack me/my sites if they can figure out who I am, I regularly block entire ranges based on AFRINC (Africa)
        ARIN, APINC, LACINC, or RIPENCC assignments. Mostly because of email spam, often also to deal with WordPress attacks, but so far the scapers have not hit me hard yet. I'm not messing with clever tricks, I'll just ban them.

        Oh, and currently I ban country code TLDs such as .cn, .ru, .ua, and a few others that have been common bad actors to me. .

    • Re:Major problem (Score:4, Interesting)

      by BDeblier ( 155691 ) on Wednesday March 26, 2025 @02:49AM (#65259807) Homepage

      I have my own personal genealogy website, running on a small Armbian machine. It recently got hit by a scanning bot army from Huawei Cloud, also completely ignoring my robots.txt, so badly that I had to block all of their IP ranges at my router. This kind of behaviour makes me pine for the internet of the 90s. It wasn't fast, but it was fun.

      • by dskoll ( 99328 )

        Yep, Huawei Cloud was one of my worst abusers too. I blocked huge swathes of China, but ended up password-protecting my site except for the main page, where I give the username and password users should use to access the rest of the site. This has stopped bots cold... for now.

    • Hide the origin, proxy through cloudflare, and let them cache everything. They do it for free.

    • This problem is very real. I manage a small club website on an old, simple machine. These bots completely ignore robots.txt (we don't want ANY indexing or crawling) and 99.9999% of traffic to the site is bots. And they are so pervasive that they caused my site to come to a crawl and often fail completely.

      In desperation, one week, trying to get the site up, I had to resort to just outright blocking of thousands of IP addresses in iptables. That is almost hopeless whack-a-mole, but at least it got the site reachable.

      What this ultimately means is that nobody will be able to run their own infrastructure anymore and will be forced to use huge, paid third-party infrastructure/gatekeepers. And those mega-corps will then have access to your data, stats, and traffic info, take more of your money, and even control filtering to your sites if they like.

      I don't know the solution. We probably need to treat the sites like we do spammers and have RBL at the ISP-level. But I can't see them wanting to do that.

      We're probably headed to a new form of the web, where only the giants have "open to the world" sites, and everybody else has to have some form of invite token passed from the browser to the server before access is allowed. It's either that, or your nightmare scenario of allowing the giants to control even the small sites. It's amazing what we've done to the open web by allowing greed beyond reason to run amok. Because that's all current gen AI is. Greed manifested in digital form. They need more power, more

      • We're probably headed to a new form of the web, where only the giants have "open to the world" sites, and everybody else has to have some form of invite token passed from the browser to the server before access is allowed.

        So the "giants" own the internet or it's just a different form of net neutrality. The "giants" are the gatekeepers of the internet.

    • If it is a club web site, you could put it behind a membership wall that will reduce the traffic load to almost nothing. Also take a look at your sessions, if you are issuing any. Session overload is what brought my custom forum down a couple weeks ago. A complete session overhaul to make sure only logged in users were issued a session brought sesssion files down from 20k highs to less than 300. Guest users didn't actually need sessions, and single-requests from bots didn't need them either. CPU usage went
    • If you are sitting behind a good CDN, or any CDN, you should be able to have this traffic rejected before it even gets to your site.
      https://blog.cloudflare.com/de... [cloudflare.com]
      Sitting behind any CDN should provide some protection from DDOS attacks and unwanted synthetic traffic. The CDN is analyzing traffic with ML, so it adapts as the AI bots do, in theory.

  • Geoblocking (Score:5, Interesting)

    by Wolfling1 ( 1808594 ) on Wednesday March 26, 2025 @03:03AM (#65259819) Journal
    We've been geoblocking the worst countries for several years when the CIA wouldn't leave our servers alone.

    The US, most of the soviet block and the middle east are out. We even geoblocked a fair part of South America.

    They still try to get to us via local proxies, but its down from millions of hits to thousands, and our usual defenses can handle that.
    • The US, most of the soviet block and the middle east are out.

      Did you remember the Roman and Aztec Empires? The "Soviet bloc" came to a decisive end a third of a century ago. (If you ignore small parts of the UK).

    • by Tablizer ( 95088 )

      > We've been geoblocking the worst countries for several years...[such as the US]

      If feels odd living in a pariah nation: USA. The Tinted One didn't create that condition, but he did put pariahood on steroids and gave it a big check.

      > when the CIA wouldn't leave our servers alone.

      I doubt you'd know it was from the CIA. They wouldn't use CIA-identified addresses, defeats being a spy. It's possible another group was using/spoofing their IP blocks though.

  • by allo ( 1728082 ) on Wednesday March 26, 2025 @05:50AM (#65259923)

    That's the lazy approach. Nobody forces you to block a whole country. You're just taking the easy route at the cost of millions of visitors.

    • at the cost of millions of visitors.

      Probably not. My sites are of no use to anyone outside the UK.

      • Are you sure? What about some Briton who's currently abroad (and isn't using roaming data)?

        • by Jeremi ( 14640 )

          You'd have to weigh the value of that hypothetical wandering Briton's time against the actual time of the site sysadmin, who would probably like to do something else with his free hours than play endless games of unpaid whack-a-mole.

        • by allo ( 1728082 )

          I would argue they are not millions (of his customers). But the point is, that many other sites do have millions of customers in countries someone may block because of their bots.

          • I get the occasional email from a customer saying 'hay... I'm in country X and I can't access the service', and I say 'yeah, we geoblocked that country because its full of scammers and hacker', and they say 'yeah... fair call'.
  • This isn't my area of expertise and I'm sure someone has done this in hardware or software around or in the content delivery tier, and there false positives, limitations, and ways to work around any restrictions, but what about blocking or rerouting traffic based on request patterns, possibly in addition to other signatures? I mean, if something is systematically crawling relatively quickly, then it's probably a bot, and you might want to at least throttle it. If it's ignoring robots.txt, maybe it's not a p
    • Interesting take.

      But a honeypot would not need to waste many resources. It could just be a robots.txt no-indexed basic page that is linked early from the main page but hidden from human users (javascript comes to mind, and for the ones who are disabling javascript, a warning). A hit to the page would land the offender into an IP blocklist for a while.

    • by allo ( 1728082 )

      The funny part is, that you can certainly train an AI on finding traffic patterns. You would then have the AI (or a human) craft simpler rules that can be evaluated in real-time, though. But sifting through large data and finding patterns is exactly what AI can do best.

  • I can't understand why they blocked all traffic from Brazil. This is underdeveloped country. We don't have IA capabilities. All of our best engineers are abroad.
    • by Jeremi ( 14640 )

      I can't understand why they blocked all traffic from Brazil. This is underdeveloped country. We don't have IA capabilities. All of our best engineers are abroad.

      Perhaps locations in Brazil were/are being used as proxy nodes?

    • Is there was a way to be in a different county, but digitally appear to be in Brazil? I doubt the interwebs has figured out how to do that yet.
  • The problem here is that there is too much reading of open source code? Isn't that the goal of open source code?

    Or is the problem who reads it and who benefits from the reading? If that's the case, why not make access credentialed?

    It appears to me that the problem here is that the source code sites are inadequate for their intended purpose and their operators are hypocrites. I hate AI as much as anyone, but having AI train on open source code is literally the core vision of open source software.

    • by HiThere ( 15173 )

      The main problem is the rate of access. If one entity is using all you bandwidth, nobody else can get in. And you've got to pay for that bandwidth, but it's only visible to those who won't be contributing.

    • by dskoll ( 99328 )

      It's one thing to train on software. It's quite another to hammer a site repeatedly and overload it.

      I have a few open-source projects that were hit by this. If the AI companies wanted to train on my software, they could git clone it once and train it that way, which would not put a strain on my server. Instead, they hit the web view of my git repo, pulling each commit's link at high speed. This is about 1000x the load of a simple git clone and ignores my robots.txt file. So banned they are.

    • by allo ( 1728082 )

      Basic respect of resources would be polite. Some friend also crawled a few sites with own scripts for different purposes, but always with time delays that make the requests slower. And if you crawl the whole web, you should be able to schedule your requests in a way that the rate on a single site always stays low. The only reason for a high rate at a single site would be having more consistency of changing data depending on each other, but the brute-force crawlers have no idea what kind of data they are scr

  • This problem is one that CDNs have been working on for years. They're using ML to detect synthetic traffic and block it. The block rate vs false positives is really good, in my experience. A few times we had some corp network step over the line into what looked like bot behavior and get blocked, but its rare. Point is, use ML to detect AI Bots, and let the CDN block it before it gets anywhere near your servers. Most CDNs provide this even in the free tiers of service.
    Most of the traffic comes from Bytespide

When the bosses talk about improving productivity, they are never talking about themselves.

Working...