

Cloudflare Turns AI Against Itself With Endless Maze of Irrelevant Facts (arstechnica.com) 62
Web infrastructure provider Cloudflare unveiled "AI Labyrinth" this week, a feature designed to thwart unauthorized AI data scraping by feeding bots realistic but irrelevant content instead of blocking them outright. The system lures crawlers into a "maze" of AI-generated pages containing neutral scientific information, deliberately wasting computing resources of those attempting to collect training data for language models without permission.
"When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them," Cloudflare explained. The company reports AI crawlers generate over 50 billion requests to their network daily, comprising nearly 1% of all web traffic they process. The feature is available to all Cloudflare customers, including those on free plans. This approach marks a shift from traditional protection methods, as Cloudflare claims blocking bots sometimes alerts operators they've been detected. The false links contain meta directives to prevent search engine indexing while remaining attractive to data-scraping bots.
"When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them," Cloudflare explained. The company reports AI crawlers generate over 50 billion requests to their network daily, comprising nearly 1% of all web traffic they process. The feature is available to all Cloudflare customers, including those on free plans. This approach marks a shift from traditional protection methods, as Cloudflare claims blocking bots sometimes alerts operators they've been detected. The false links contain meta directives to prevent search engine indexing while remaining attractive to data-scraping bots.
Source code (Score:2)
I hope there is something like this for code soon https://slashdot.org/story/25/... [slashdot.org]
Re: (Score:2)
This won't do what they claim (Score:4, Informative)
https://zadzmo.org/code/nepenthes/ [zadzmo.org]
But what I'm worried about is, what is an "unauthorized crawl?" Is this going to star restricting crawling for all CloudFuck protected websites so that we have fewer choices for new search engine startups? Google, DDG and bing are hot garbage. They can say this is about AI, but it's also about polluting all new crawlers for the big bois.
Re:This won't do what they claim (Score:4, Interesting)
Re: This won't do what they claim (Score:2)
If that were true, a Google killer would be pretty easy to develop.
Re: This won't do what they claim (Score:4, Insightful)
Re: This won't do what they claim (Score:2)
Search absolutely pays a big portion of Alphabet's bills.
"2023 Total Google Search & Other Revenue: $175.04 billion"
Re: This won't do what they claim (Score:2)
Sorry, hit the "Submit" button before finishing up with $32B in search alone. A huge portion of their revenue comes from search.
Re: (Score:2)
The challenge of a Google-killer isn't necessarily a software problem, it's a data and infrastructure problem. They have 25 years of data that no one else has. That doesn't just include the crawling data, it's the data from search inputs and link clicks.
Google also has other sources of data no one else has. Google Analytics, which every commercial website feels compelled to use. Chrome, which has been the dominant browser for well over a decade. Gmail, which is either the largest or second largest email pro
Re: This won't do what they claim (Score:3)
Does it really make sense to index everything in a dead internet scenario?
Re: (Score:2)
That's actually an interesting question.
As many of us, I remember a time before there were any useful search engines.
We largely survived on our personal collections of links, "web rings", and other topical link collections.
Re: (Score:2)
But... I don't use the internet for entertainment very much. I have no social media accounts. I do not doom scroll. I do no internet dating or gambling or watching youtube for hours. All my communication is over a private network, with open source apps, connecting to a small circle of friends and family. I do read some news sites. I mostly search for technical material, and answers to technical questions.
Re:This won't do what they claim (Score:5, Interesting)
Is this going to star restricting crawling for all CloudFuck protected websites so that we have fewer choices for new search engine startups
Search engines should be fine so long as you obey the Robots.txt and Noindex/Noindex follow meta tags.
If you are not, then that search engine can be considered Evil and part of the problem.
On the other hand this can be problematic for individuals scraping a site for legitimate personal reasons
Or tools designed to scan dark web sites to look for and inform users of evil on the part of websites themself (For example: Tools that help you find out if one of your passwords or your personal information was compromised by scraping certain forums known to be places where criminals leak that kind of info).
.
Re: This won't do what they claim (Score:2)
Or the Wayback Machine. The good thing is though, that I have a hard time imagining a website worthy of preservation that uses Cloudfuck.
Re: (Score:2)
Twitter and Slashdot use them. So do thousands of other major websites.
Others use alternate providers such as Fastly, but those other providers are also adding more and more anti-scraping features.
It's not just CF. Another important thing is CF's decision to add this also set an example for other providers and individual sites to follow.
Re: This won't do what they claim (Score:2)
Tons and tons of websites use Cloud flare, including my own. It's free, its performance is great, and the GUI is excellent too - for my use case.
I don't get why people hate it so much.
Re: (Score:3)
I use DDG as my search engine, and afaik DDG is Bing without the tracking - they pass queries on to Bing after having stripped the identifying information.
More relevantly, yesterday I did a search on something obscure and one of the more interesting results returned had an absolutely ludicrous URL, one which had virtually no chance of being genuine.
Curious, I clicked on it (in a private window of course). Of course it was fake. Since I can't remember exactly what I was
Re: (Score:2)
A web site usually has a "robots.txt" file.
In the root folder.
That file contains hints, about how deep a crawler is "allowed" to crawl into the site.
A search engine is supposed to honour that.
And an AI scraper is supposed to do the same.
On top of that: most crawlers voluntarily identify them selves in the browser identification string as: xyz.bot. Where xyz might be google, yahoo, you name it.
On top of that, you are supposed to have a reasonable delay between web requests: in minutes. The site is not runnin
Re: (Score:2)
A web site usually has a "robots.txt" file. In the root folder. That file contains hints, about how deep a crawler is "allowed" to crawl into the site. A search engine is supposed to honour that. And an AI scraper is supposed to do the same.
Yes, "supposed to".
Re: (Score:3)
Search engine startups have the same problems as smaller browsers. Cloudflare does not give a fuck about them and ruins their web experience.
Re: (Score:2)
1) Show me a search engine startup that doesn't rely on querying Google and/or Microsoft APIs.
2) If a crawler respects the robots.txt rules then CloudFlare won't fuck them.
The AI crawlers that CloudFlare blocks lie about their user agent, suck up tons of bandwidth at the website owner's expense (especially since they get caught in weird loops where they crawl the site over and over again), and attempt to crawl admin pages and other non-public pages. AI web crawlers—just like their cousins, SEO optimiz
Re: (Score:2)
Well, it's one that ignores robots.txt, and you can easily see the behavior - you exclude the AI slop via robots.txt, and sure, people may stumble into it, but most thinking
Too lenient, feed them trash (Score:2)
Cloudflare is too lenient. They should provide an option for customers to feed unauthorised AI crawlers false data to poison them.
Re: (Score:3)
I'd say that is a smart idea, but it's probably best done by the website itself.
For example if Slashdot detects a client is an AI scraper, then feed it a series of randomized articles and comments, and some of them would be full of nonsensical language, incorrect grammar, random spelling errors, and absurd statements that would tend to corrupt AI models trained on it with false knowledge.
Re:Too lenient, feed them trash (Score:5, Funny)
How does that differ from normal Slashdot?
Re: (Score:2)
It's very similar, but they get a special merit badge for helping to stave off the AI parasites currently bleeding the internet.
Re: Too lenient, feed them trash (Score:2)
It already will.
It's AI training being fed AI created stuff, and if you intentionally use a not-very good AI for the generation, it will drag down the AI that's trying to be trained. (Garbage in, garbage out) ... but this isn't really all that new. 25+ years ago when I worked for an ISP, we had problems with crawlers that were looking for email addresses. If we saw one violating our robots.txt, we would redirect it to a CGI that would very slowly (lots of sleep calls) randomly generate bogus email address
Re: (Score:2)
In tomorrow's story: Why can't Siri tell me what month it is. Okay well that was yesterday's story, but the point is the same. Everyone thinks this is a good idea right until they receive a product trained on trash and then they complain how it's just not good, enshitification, etc.
Re: (Score:2)
I fail to see the downside here. If consumers start viewing "AI" as signifying that a product is trash, society wins.
Re: Too lenient, feed them trash (Score:2)
The problem with that approach is that it then becomes a way for the crawlers to know if they've been detected. The counter attack is to waste the crawlers' resources. Keeping them in the dark improves the attack.
Re: (Score:2)
The web is full of spam. Feeding trash is a drop in the ocean.
Re: (Score:2)
Re: (Score:2)
That's a good way to get yourself arrested.
Garbage in, garbage out (Score:2)
Re: (Score:1)
The whole AI idea is garbage, so it is irrelevant what goes in or out. It's just a way for corporations to gain even more control over their employees and to launder money by investing it in AI.
Garbage in, Felonies Out? (Score:4, Interesting)
The whole AI idea is garbage, so it is irrelevant what goes in or out. It's just a way for corporations to gain even more control over their employees and to launder money by investing it in AI.
Whether of not you or I feel AI is garbage or not, someone better realize the billions being invested in this is coming from those In Control. Of companies. Of multinational corporations. Of investment firms. Of lawmakers and politicians.
Given those billions, I fully expect those In Control will suggest AI Poisoning will become a felony at best, and an act of domestic terrorism at worst. Very soon.
Re: Garbage in, Felonies Out? (Score:2)
"those In Control will suggest AI Poisoning will become a felony"
And the first amendment will shut them down.
Re: (Score:1)
What makes you think you'll get to argue your case?
Trump will just send you to a prison in another country like he's already doing to people.
Re: Garbage in, Felonies Out? (Score:2)
Re: Garbage in, garbage out (Score:3)
Now the AI bros are gonna say that this is an act of terrorism against progress that robs the whole humanity of its bright future.
Re: (Score:2)
Re: (Score:2)
The robots.txt thing seems like a tell. (Score:1)
Re: (Score:2)
Cloudflare has a new product to sell. They don't care if it will work, as long as some large websites pay money for the AI bot protection. I won't be surprised if Cloudflare would sell AI companies site data from their CDN without hitting the customers site ...
Re: (Score:2)
The AI bot protection comes with the free Cloudflare plan.
AI bots waste bandwidth. . .that includes Cloudflare's bandwidth. I can't see Cloudflare selling AI companies the data from the CDNs because their whole business model relies on website owners viewing them as protection against nefarious actors, AI companies included. They payout from the AI companies wouldn't compensate from loss of customers. Cloudflare has always been pretty smart about knowing who pays their bills.
Re: (Score:2)
The easiest way to avoid the problem is to obey the "robots.txt" file. (Or at least to notice when you aren't obeying it.)
Almost sounds like . . . (Score:3)
You are in a maze of twisty little passages, all alike.
ICE (Score:2)
Looks like the beginnings of Intrusion Countermeasure Electronics from the Neuromancer series of books by William Gibson.
Re: (Score:2)
Everything about our current implementation of AI looks like the Neuromancer series.
An infinite amount of ntirely hallucinated content (Score:2)
Oh, the irony! (Score:2)
The internet is dead.
"AI" Causing Far More Harm to Humanity than Help (Score:2)
Hire me (Score:1)
How is that content any different (Score:2)
Nothing to see here -same story ran 20yrs ago (Score:1)