Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI Technology

Cloudflare Turns AI Against Itself With Endless Maze of Irrelevant Facts (arstechnica.com) 46

Web infrastructure provider Cloudflare unveiled "AI Labyrinth" this week, a feature designed to thwart unauthorized AI data scraping by feeding bots realistic but irrelevant content instead of blocking them outright. The system lures crawlers into a "maze" of AI-generated pages containing neutral scientific information, deliberately wasting computing resources of those attempting to collect training data for language models without permission.

"When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them," Cloudflare explained. The company reports AI crawlers generate over 50 billion requests to their network daily, comprising nearly 1% of all web traffic they process. The feature is available to all Cloudflare customers, including those on free plans. This approach marks a shift from traditional protection methods, as Cloudflare claims blocking bots sometimes alerts operators they've been detected. The false links contain meta directives to prevent search engine indexing while remaining attractive to data-scraping bots.

Cloudflare Turns AI Against Itself With Endless Maze of Irrelevant Facts

Comments Filter:
  • I hope there is something like this for code soon https://slashdot.org/story/25/... [slashdot.org]

    • As the famous philosopher said, "Humans. You know, we give ourselves a bad rep, but we're genuinely empathetic as a species. I mean, we don't actually really want to kill each other. Which is a good thing... until" --Schopenhauer [wikiquote.org]
  • by SumDog ( 466607 ) on Saturday March 22, 2025 @01:18AM (#65251455) Homepage Journal
    There are already open source solutions for this:

    https://zadzmo.org/code/nepenthes/ [zadzmo.org]

    But what I'm worried about is, what is an "unauthorized crawl?" Is this going to star restricting crawling for all CloudFuck protected websites so that we have fewer choices for new search engine startups? Google, DDG and bing are hot garbage. They can say this is about AI, but it's also about polluting all new crawlers for the big bois.
    • Search engines are a solved problem. The problem today is that the dominant search engines have moved *away* from the best approach, in favour of advertising, product placement, and general AI fuckery with the aim to monetize unsuspecting users.
    • by mysidia ( 191772 ) on Saturday March 22, 2025 @02:32AM (#65251515)

      Is this going to star restricting crawling for all CloudFuck protected websites so that we have fewer choices for new search engine startups

      Search engines should be fine so long as you obey the Robots.txt and Noindex/Noindex follow meta tags.
      If you are not, then that search engine can be considered Evil and part of the problem.

      On the other hand this can be problematic for individuals scraping a site for legitimate personal reasons

      Or tools designed to scan dark web sites to look for and inform users of evil on the part of websites themself (For example: Tools that help you find out if one of your passwords or your personal information was compromised by scraping certain forums known to be places where criminals leak that kind of info).

      .

      • Or the Wayback Machine. The good thing is though, that I have a hard time imagining a website worthy of preservation that uses Cloudfuck.

        • by mysidia ( 191772 )

          Twitter and Slashdot use them. So do thousands of other major websites.

          Others use alternate providers such as Fastly, but those other providers are also adding more and more anti-scraping features.
          It's not just CF. Another important thing is CF's decision to add this also set an example for other providers and individual sites to follow.

    • Google, DDG and bing are hot garbage

      I use DDG as my search engine, and afaik DDG is Bing without the tracking - they pass queries on to Bing after having stripped the identifying information.
      More relevantly, yesterday I did a search on something obscure and one of the more interesting results returned had an absolutely ludicrous URL, one which had virtually no chance of being genuine.
      Curious, I clicked on it (in a private window of course). Of course it was fake. Since I can't remember exactly what I was

    • A web site usually has a "robots.txt" file.
      In the root folder.

      That file contains hints, about how deep a crawler is "allowed" to crawl into the site.

      A search engine is supposed to honour that.

      And an AI scraper is supposed to do the same.

      On top of that: most crawlers voluntarily identify them selves in the browser identification string as: xyz.bot. Where xyz might be google, yahoo, you name it.

      On top of that, you are supposed to have a reasonable delay between web requests: in minutes. The site is not runnin

      • by XXongo ( 3986865 )

        A web site usually has a "robots.txt" file. In the root folder. That file contains hints, about how deep a crawler is "allowed" to crawl into the site. A search engine is supposed to honour that. And an AI scraper is supposed to do the same.

        Yes, "supposed to".

    • by allo ( 1728082 )

      Search engine startups have the same problems as smaller browsers. Cloudflare does not give a fuck about them and ruins their web experience.

  • Cloudflare is too lenient. They should provide an option for customers to feed unauthorised AI crawlers false data to poison them.

    • by mysidia ( 191772 )

      I'd say that is a smart idea, but it's probably best done by the website itself.

      For example if Slashdot detects a client is an AI scraper, then feed it a series of randomized articles and comments, and some of them would be full of nonsensical language, incorrect grammar, random spelling errors, and absurd statements that would tend to corrupt AI models trained on it with false knowledge.

      • by msauve ( 701917 )
        >if Slashdot detects a client is an AI scraper, then feed it a series of randomized articles and comments, and some of them would be full of nonsensical language, incorrect grammar, random spelling errors, and absurd statements

        How does that differ from normal Slashdot?
        • by mysidia ( 191772 )

          It's very similar, but they get a special merit badge for helping to stave off the AI parasites currently bleeding the internet.

      • It already will.

        It's AI training being fed AI created stuff, and if you intentionally use a not-very good AI for the generation, it will drag down the AI that's trying to be trained. (Garbage in, garbage out) ... but this isn't really all that new. 25+ years ago when I worked for an ISP, we had problems with crawlers that were looking for email addresses. If we saw one violating our robots.txt, we would redirect it to a CGI that would very slowly (lots of sleep calls) randomly generate bogus email address

    • In tomorrow's story: Why can't Siri tell me what month it is. Okay well that was yesterday's story, but the point is the same. Everyone thinks this is a good idea right until they receive a product trained on trash and then they complain how it's just not good, enshitification, etc.

    • The problem with that approach is that it then becomes a way for the crawlers to know if they've been detected. The counter attack is to waste the crawlers' resources. Keeping them in the dark improves the attack.

    • by allo ( 1728082 )

      The web is full of spam. Feeding trash is a drop in the ocean.

  • ChatGPT, Gemini, etc. will become trash The intent is to discourage crawlers by making it unattractive to scrape from these sites. Meanwhile the LLMs fed by all the bots crawling them will then all be crap. Garbage in, garbage out.
    • The whole AI idea is garbage, so it is irrelevant what goes in or out. It's just a way for corporations to gain even more control over their employees and to launder money by investing it in AI.

      • The whole AI idea is garbage, so it is irrelevant what goes in or out. It's just a way for corporations to gain even more control over their employees and to launder money by investing it in AI.

        Whether of not you or I feel AI is garbage or not, someone better realize the billions being invested in this is coming from those In Control. Of companies. Of multinational corporations. Of investment firms. Of lawmakers and politicians.

        Given those billions, I fully expect those In Control will suggest AI Poisoning will become a felony at best, and an act of domestic terrorism at worst. Very soon.

    • Now the AI bros are gonna say that this is an act of terrorism against progress that robs the whole humanity of its bright future.

  • Easy enough to skip those pages. Cloudflare is obviously so big that ai bots will adapt to its specific strategies. This would work well for a single site, but not for all of their customers. Also, I hope we're close to the point we can admit that the bots can identify squares that contain a traffic light (motorcycle, staircase) as well as me.
    • by allo ( 1728082 )

      Cloudflare has a new product to sell. They don't care if it will work, as long as some large websites pay money for the AI bot protection. I won't be surprised if Cloudflare would sell AI companies site data from their CDN without hitting the customers site ...

    • by HiThere ( 15173 )

      The easiest way to avoid the problem is to obey the "robots.txt" file. (Or at least to notice when you aren't obeying it.)

  • by quonset ( 4839537 ) on Saturday March 22, 2025 @07:29AM (#65251745)

    You are in a maze of twisty little passages, all alike.

  • by e3m4n ( 947977 )

    Looks like the beginnings of Intrusion Countermeasure Electronics from the Neuromancer series of books by William Gibson.

  • Reread the summary title. It describes the entirety of the internet today!

    The internet is dead.
  • We need laws to stop the widespread theft of the world's intellectual property.
  • I will write training data if you hire me. I'd feed it data to learn it to stubbornly to ask someone's pronouns, and insist that the person it is talking to is not sure about his or her gender but does not realize it. Also teach it about the importance of asking about the car's extended warrenty. It is very rude if you do not do that in a conversation. Also... teach it about the importance of correct spelling and grammar. Insist that the user corrects everything before really replying to the input.

"What people have been reduced to are mere 3-D representations of their own data." -- Arthur Miller

Working...