Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI Technology

Cloudflare Turns AI Against Itself With Endless Maze of Irrelevant Facts (arstechnica.com) 62

Web infrastructure provider Cloudflare unveiled "AI Labyrinth" this week, a feature designed to thwart unauthorized AI data scraping by feeding bots realistic but irrelevant content instead of blocking them outright. The system lures crawlers into a "maze" of AI-generated pages containing neutral scientific information, deliberately wasting computing resources of those attempting to collect training data for language models without permission.

"When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them," Cloudflare explained. The company reports AI crawlers generate over 50 billion requests to their network daily, comprising nearly 1% of all web traffic they process. The feature is available to all Cloudflare customers, including those on free plans. This approach marks a shift from traditional protection methods, as Cloudflare claims blocking bots sometimes alerts operators they've been detected. The false links contain meta directives to prevent search engine indexing while remaining attractive to data-scraping bots.

Cloudflare Turns AI Against Itself With Endless Maze of Irrelevant Facts

Comments Filter:
  • I hope there is something like this for code soon https://slashdot.org/story/25/... [slashdot.org]

    • As the famous philosopher said, "Humans. You know, we give ourselves a bad rep, but we're genuinely empathetic as a species. I mean, we don't actually really want to kill each other. Which is a good thing... until" --Schopenhauer [wikiquote.org]
  • by SumDog ( 466607 ) on Saturday March 22, 2025 @01:18AM (#65251455) Homepage Journal
    There are already open source solutions for this:

    https://zadzmo.org/code/nepenthes/ [zadzmo.org]

    But what I'm worried about is, what is an "unauthorized crawl?" Is this going to star restricting crawling for all CloudFuck protected websites so that we have fewer choices for new search engine startups? Google, DDG and bing are hot garbage. They can say this is about AI, but it's also about polluting all new crawlers for the big bois.
    • by martin-boundary ( 547041 ) on Saturday March 22, 2025 @01:42AM (#65251473)
      Search engines are a solved problem. The problem today is that the dominant search engines have moved *away* from the best approach, in favour of advertising, product placement, and general AI fuckery with the aim to monetize unsuspecting users.
      • If that were true, a Google killer would be pretty easy to develop.

        • by martin-boundary ( 547041 ) on Saturday March 22, 2025 @06:33AM (#65251703)
          Google's business is not search, and has not been search for about 20 years now. They started with search, and got funding, then diversified until they found their niche, which is advertising auctions. Search doesn't pay the bills.
        • The challenge of a Google-killer isn't necessarily a software problem, it's a data and infrastructure problem. They have 25 years of data that no one else has. That doesn't just include the crawling data, it's the data from search inputs and link clicks.

          Google also has other sources of data no one else has. Google Analytics, which every commercial website feels compelled to use. Chrome, which has been the dominant browser for well over a decade. Gmail, which is either the largest or second largest email pro

      • I get the idea that the internet is now floating in generated sewage now. Do we even want to index and go to that?

        Does it really make sense to index everything in a dead internet scenario?
        • That's actually an interesting question.

          As many of us, I remember a time before there were any useful search engines.

          We largely survived on our personal collections of links, "web rings", and other topical link collections.

          • I'm no fan of AI, one of the main reasons being that it's arguably a surveillance vector. I argue it is.

            But... I don't use the internet for entertainment very much. I have no social media accounts. I do not doom scroll. I do no internet dating or gambling or watching youtube for hours. All my communication is over a private network, with open source apps, connecting to a small circle of friends and family. I do read some news sites. I mostly search for technical material, and answers to technical questions.
    • by mysidia ( 191772 ) on Saturday March 22, 2025 @02:32AM (#65251515)

      Is this going to star restricting crawling for all CloudFuck protected websites so that we have fewer choices for new search engine startups

      Search engines should be fine so long as you obey the Robots.txt and Noindex/Noindex follow meta tags.
      If you are not, then that search engine can be considered Evil and part of the problem.

      On the other hand this can be problematic for individuals scraping a site for legitimate personal reasons

      Or tools designed to scan dark web sites to look for and inform users of evil on the part of websites themself (For example: Tools that help you find out if one of your passwords or your personal information was compromised by scraping certain forums known to be places where criminals leak that kind of info).

      .

      • Or the Wayback Machine. The good thing is though, that I have a hard time imagining a website worthy of preservation that uses Cloudfuck.

        • by mysidia ( 191772 )

          Twitter and Slashdot use them. So do thousands of other major websites.

          Others use alternate providers such as Fastly, but those other providers are also adding more and more anti-scraping features.
          It's not just CF. Another important thing is CF's decision to add this also set an example for other providers and individual sites to follow.

        • Tons and tons of websites use Cloud flare, including my own. It's free, its performance is great, and the GUI is excellent too - for my use case.

          I don't get why people hate it so much.

    • Google, DDG and bing are hot garbage

      I use DDG as my search engine, and afaik DDG is Bing without the tracking - they pass queries on to Bing after having stripped the identifying information.
      More relevantly, yesterday I did a search on something obscure and one of the more interesting results returned had an absolutely ludicrous URL, one which had virtually no chance of being genuine.
      Curious, I clicked on it (in a private window of course). Of course it was fake. Since I can't remember exactly what I was

    • A web site usually has a "robots.txt" file.
      In the root folder.

      That file contains hints, about how deep a crawler is "allowed" to crawl into the site.

      A search engine is supposed to honour that.

      And an AI scraper is supposed to do the same.

      On top of that: most crawlers voluntarily identify them selves in the browser identification string as: xyz.bot. Where xyz might be google, yahoo, you name it.

      On top of that, you are supposed to have a reasonable delay between web requests: in minutes. The site is not runnin

      • by XXongo ( 3986865 )

        A web site usually has a "robots.txt" file. In the root folder. That file contains hints, about how deep a crawler is "allowed" to crawl into the site. A search engine is supposed to honour that. And an AI scraper is supposed to do the same.

        Yes, "supposed to".

    • by allo ( 1728082 )

      Search engine startups have the same problems as smaller browsers. Cloudflare does not give a fuck about them and ruins their web experience.

      • 1) Show me a search engine startup that doesn't rely on querying Google and/or Microsoft APIs.

        2) If a crawler respects the robots.txt rules then CloudFlare won't fuck them.

        The AI crawlers that CloudFlare blocks lie about their user agent, suck up tons of bandwidth at the website owner's expense (especially since they get caught in weird loops where they crawl the site over and over again), and attempt to crawl admin pages and other non-public pages. AI web crawlers—just like their cousins, SEO optimiz

    • by tlhIngan ( 30335 )

      But what I'm worried about is, what is an "unauthorized crawl?" Is this going to star restricting crawling for all CloudFuck protected websites so that we have fewer choices for new search engine startups? Google, DDG and bing are hot garbage. They can say this is about AI, but it's also about polluting all new crawlers for the big bois.

      Well, it's one that ignores robots.txt, and you can easily see the behavior - you exclude the AI slop via robots.txt, and sure, people may stumble into it, but most thinking

  • Cloudflare is too lenient. They should provide an option for customers to feed unauthorised AI crawlers false data to poison them.

    • by mysidia ( 191772 )

      I'd say that is a smart idea, but it's probably best done by the website itself.

      For example if Slashdot detects a client is an AI scraper, then feed it a series of randomized articles and comments, and some of them would be full of nonsensical language, incorrect grammar, random spelling errors, and absurd statements that would tend to corrupt AI models trained on it with false knowledge.

      • by msauve ( 701917 ) on Saturday March 22, 2025 @08:07AM (#65251785)
        >if Slashdot detects a client is an AI scraper, then feed it a series of randomized articles and comments, and some of them would be full of nonsensical language, incorrect grammar, random spelling errors, and absurd statements

        How does that differ from normal Slashdot?
        • by mysidia ( 191772 )

          It's very similar, but they get a special merit badge for helping to stave off the AI parasites currently bleeding the internet.

      • It already will.

        It's AI training being fed AI created stuff, and if you intentionally use a not-very good AI for the generation, it will drag down the AI that's trying to be trained. (Garbage in, garbage out) ... but this isn't really all that new. 25+ years ago when I worked for an ISP, we had problems with crawlers that were looking for email addresses. If we saw one violating our robots.txt, we would redirect it to a CGI that would very slowly (lots of sleep calls) randomly generate bogus email address

    • In tomorrow's story: Why can't Siri tell me what month it is. Okay well that was yesterday's story, but the point is the same. Everyone thinks this is a good idea right until they receive a product trained on trash and then they complain how it's just not good, enshitification, etc.

      • I fail to see the downside here. If consumers start viewing "AI" as signifying that a product is trash, society wins.

    • The problem with that approach is that it then becomes a way for the crawlers to know if they've been detected. The counter attack is to waste the crawlers' resources. Keeping them in the dark improves the attack.

    • by allo ( 1728082 )

      The web is full of spam. Feeding trash is a drop in the ocean.

    • Or really ratchet it up a notch, kiddie porn and then notify the feds someone is downloading kiddie porn and give the AI's address.
  • ChatGPT, Gemini, etc. will become trash The intent is to discourage crawlers by making it unattractive to scrape from these sites. Meanwhile the LLMs fed by all the bots crawling them will then all be crap. Garbage in, garbage out.
    • The whole AI idea is garbage, so it is irrelevant what goes in or out. It's just a way for corporations to gain even more control over their employees and to launder money by investing it in AI.

      • by geekmux ( 1040042 ) on Saturday March 22, 2025 @04:26AM (#65251623)

        The whole AI idea is garbage, so it is irrelevant what goes in or out. It's just a way for corporations to gain even more control over their employees and to launder money by investing it in AI.

        Whether of not you or I feel AI is garbage or not, someone better realize the billions being invested in this is coming from those In Control. Of companies. Of multinational corporations. Of investment firms. Of lawmakers and politicians.

        Given those billions, I fully expect those In Control will suggest AI Poisoning will become a felony at best, and an act of domestic terrorism at worst. Very soon.

    • Now the AI bros are gonna say that this is an act of terrorism against progress that robs the whole humanity of its bright future.

  • Easy enough to skip those pages. Cloudflare is obviously so big that ai bots will adapt to its specific strategies. This would work well for a single site, but not for all of their customers. Also, I hope we're close to the point we can admit that the bots can identify squares that contain a traffic light (motorcycle, staircase) as well as me.
    • by allo ( 1728082 )

      Cloudflare has a new product to sell. They don't care if it will work, as long as some large websites pay money for the AI bot protection. I won't be surprised if Cloudflare would sell AI companies site data from their CDN without hitting the customers site ...

      • The AI bot protection comes with the free Cloudflare plan.

        AI bots waste bandwidth. . .that includes Cloudflare's bandwidth. I can't see Cloudflare selling AI companies the data from the CDNs because their whole business model relies on website owners viewing them as protection against nefarious actors, AI companies included. They payout from the AI companies wouldn't compensate from loss of customers. Cloudflare has always been pretty smart about knowing who pays their bills.

    • by HiThere ( 15173 )

      The easiest way to avoid the problem is to obey the "robots.txt" file. (Or at least to notice when you aren't obeying it.)

  • by quonset ( 4839537 ) on Saturday March 22, 2025 @07:29AM (#65251745)

    You are in a maze of twisty little passages, all alike.

  • by e3m4n ( 947977 )

    Looks like the beginnings of Intrusion Countermeasure Electronics from the Neuromancer series of books by William Gibson.

  • Reread the summary title. It describes the entirety of the internet today!

    The internet is dead.
  • We need laws to stop the widespread theft of the world's intellectual property.
  • I will write training data if you hire me. I'd feed it data to learn it to stubbornly to ask someone's pronouns, and insist that the person it is talking to is not sure about his or her gender but does not realize it. Also teach it about the importance of asking about the car's extended warrenty. It is very rude if you do not do that in a conversation. Also... teach it about the importance of correct spelling and grammar. Insist that the user corrects everything before really replying to the input.
  • than much of what can be already found now?
  • Spider traps are easy to code. You just "loop see loop" generated into an endless maze of htaccess redirects until the bot gives up. This is same story - different target, that we ran with 20years ago.

When we write programs that "learn", it turns out we do and they don't.

Working...