

Perplexity is Using Stealth, Undeclared Crawlers To Evade Website No-Crawl Directives, Cloudflare Says (cloudflare.com) 86
AI startup Perplexity is deploying undeclared web crawlers that masquerade as regular Chrome browsers to access content from websites that have explicitly blocked its official bots, according to a Cloudflare report published Monday. When Perplexity's declared crawlers encounter robots.txt restrictions or network blocks, the company switches to a generic Mozilla user agent that impersonates "Chrome/124.0.0.0 Safari/537.36" running on macOS, the web infrastructure firm reported.
Cloudflare engineers tested the behavior by creating new domains with robots.txt files prohibiting all automated access. Despite the restrictions, Perplexity provided detailed information about the protected content when queried, while the stealth crawler generated 3-6 million daily requests across tens of thousands of domains. The undeclared crawler rotated through multiple IP addresses and network providers to evade detection.
Cloudflare engineers tested the behavior by creating new domains with robots.txt files prohibiting all automated access. Despite the restrictions, Perplexity provided detailed information about the protected content when queried, while the stealth crawler generated 3-6 million daily requests across tens of thousands of domains. The undeclared crawler rotated through multiple IP addresses and network providers to evade detection.
Imagine the material... (Score:1, Insightful)
Re: (Score:2, Troll)
Re: Imagine the material... (Score:5, Informative)
Huh? They aren't accessing anything secret. Everything they are getting is publicly available to any human surfer. It's just that the content has been marked off-limits to crawlers and they are masquerading as human to get around the restriction.
Re: (Score:2)
Re: Imagine the material... (Score:2)
Re: Imagine the material... (Score:2)
How come.this argument doesn't hold when it comes to the Inrernet Archive, and now we're discussing them having to delete stuff that was openly available on the "public" internet?
Re: (Score:2)
It holds there too. the trouble is caused by thieves claiming IP ownership of the content in their archives
Re: (Score:1)
There is a difference between "secret" and "Don't Crawl Our Site".
It's almost impossible to masquerade as a human; even throttled crawlers are easily identifiable through many different and often evil traits used.
The kleptocracy of AI (and other) crawlers is what's at issue.
Re: Imagine the material... (Score:5, Informative)
Please share the secret. Currently about 70% of the traffic to some sites I run is from badly behaving robots puppeting residential IPs.
About the only trick that works fairly well is hidden poison links (effectively, "click here to ban your IP"). Some robot farms coordinate the search space, so you have to leave them all over, and they're using huge amounts of IPs, so you'll still get beat to shit.
What is your magic sauce?
Re: Imagine the material... (Score:4, Informative)
Get Wordfence if your site is Wordpress. The controls inside (free version) are enough to rate-limit crawlers effectively.
If you don't have Wordpress, your choices are more complex; you MUST use an IP filtering system and front-end your site with it to rate-limit everyone methodically. Crawlers eventually quite.
Many crawlers identify themselves in the get/post sequence. You have to parse those. If you understand fail2ban conceptually, it's the method used to create like-type gets that score with higher rates, and folder transversals. Accumulate your list and band them/null-route/block or whatever your framework permits.
Yes, you can blackhole through various famous time-wasters, but this also dogs your site performance. Captcha and others are becoming easier to fool, and for this reason, they're not a good strategy.
Once you decide on a filtering strategy, monitor it. Then share your IP ban list with others. Ban the entire CIDR block, because crawlers will attack using randomized IPs within their block. If you get actual customers/viewers, monitor your complaint box and put them on your exemption list.
Re: (Score:2)
I have a fairly custom stack, which is a liability a lot of the time, but can be useful in cases like this. Being different than the herd means they're spending most of their time on attacking their mitigations instead of whatever weird shit I'm up to.
Re: (Score:2)
There's a lot of info available about attack mitigation (or just hungry crawlers) and how to avoid/blackhole them. Problem is, you have to have control of portions of the network stack to do them effectively.
Security through obscurity only works so long as you can be obscure, which is part of the vibe of the post. It's really stressful sometimes, depending on what's hosted.
Of the sites I don't have behind Cloudflare, the assets aren't worth anything and I truly don't care if they show up in AI. Otherwise, w
Re: (Score:2)
Check out anubis [github.com]
I think it works by making the browser do some JS maths that's unnoticeable to a user. and if an AI crawler implements the JS stuff to do the mathematics it is too costly in processing power at the scale they operate.
Re: (Score:2)
This is one of the best methods I've seen so far. Make it an economic issue, and voila!
Re: (Score:2)
If you can afford it, also consider Cloudflare; their bot identification is really good. You can use defaults or make your own filters. They're not the only ones that do this, but my experience with them has been positive. Much depends on your skills in how the web actually works, network + site interaction.
Their protections are cheap for the quality/speed. All of the large sites I manage are behind Cloudflare, including their DNS. Their DNS management is superior, and has interesting tricks for mixed-media
Re: (Score:1)
Tell us you don't get it without saying "I'm a clueless twat!"
These scum are sucking in content created by others and surfacing it as their own. If there's no content to consume for regular browsers the internet will be what you and your ilk have always dreamed of. An echo chamber of recursive garbage.
What?! (Score:5, Funny)
Next you're going to tell me that they just ignore robots.txt. Surely they would never do that just for their own selfish interests?
Re:What?! (Score:4, Insightful)
Makes Google look like the white knight.
Re: (Score:3, Interesting)
Time to change the payment model. (Score:2, Insightful)
ISPs just burn the candle at both ends, charging those who produce content AND those who consume content, for internet access. Sometimes they find ways to burn the candle in the middle, too.
Instead, content producers should GET paid per web request. And the payor should be the person making the web request. ISPs would just skim off the top.
Automated crawling would suddenly become a lot more expensive (good), but website owners wouldn't mind since they get paid for that instead of paying for that (good),
Re:Time to change the payment model. (Score:5, Insightful)
Instead, content producers should GET paid per web request. And the payor should be the person making the web request. ISPs would just skim off the top.
The internet is more than just the web, and this is just a bizarre proposal. If you think bandwidth caps are bad, just wait until you can get charged per-connection fees.
Re: (Score:2)
Re: (Score:3)
Re: (Score:2)
users vote with their wallet simply browsing to whatever isn't charging them money.
FTFY.
The idea of charging people per site (let alone per request), died a long time ago. It's NOT coming back, no matter how much good it might do. People get pissed over the idea of paying for yet another subscription. They aren't about to pay $10/month just to access one of TikTok / Facebook / Bluesky / X / etc. Let alone the massive restructuring it would take for all of those sites wanting to push ADs. (ADs would have to be free to request or the users would abandon the site instantly, but they won't
Re: (Score:2)
These aren't circuit-switched networks, they're packet-switched.
Our routers consume wattage per packet, not per constant bitrate circuit.
And could you imagine having to deal with routing through
pngs! (Score:1)
Publish all webpages as PNGs made on the fly. Maybe that won't stop AI webcrawlers, I guess they'd just resort to OCR to glean the info.
Re: (Score:1)
Not just that, they would use AI to OCR, so their energy consumption would increase substantially. Every problem is a nail.
Re: (Score:3)
Re: (Score:3, Insightful)
Public is public (Score:2)
If it's accessible, it's public
To keep things private, don't put them on the internet or use encryption
The old "robots.txt" idea only works when robots politely obey the rules
Re: (Score:2)
Re:Public is public (Score:5, Insightful)
And that problem is spreading into the real world - it existed but now it's growing and we mere sheep can't really do anything either, just like on the internet.
The Internet has already gone into the "Public" (Score:2)
Many things in society are necessary for society to continue, one of those things is being respectful to other's wishes. If anything the current state of the internet is a very sad commentary of how people act when there is no one to slap them in the face and say 'don't do that'.
On the internet? Have you not seen all the broccoli heads doing wanker shite on TikTok for views -- all outside in the real world? Being "disrespectful" on the internet is so 00s. We are already living with 2 generations who proliferated it into the "touch grass" part.
Re: Public is public (Score:2)
Re:Why are you pissed now? - Block them (Score:5, Interesting)
Re: (Score:2)
I block by the ASN now. I have a list of about 20-30 ASNs that removes about 99% of the attack traffic. Unfortunally, that includes DO.
It is truly automated access (Score:4, Interesting)
Re: (Score:3, Interesting)
Just because something can be done doesn't mean is should be done. This is how we lose 'civilized society'. The internet is warning us, and we won't listen.
Re: (Score:3)
Re: (Score:2)
If anything, the person making the AI prompt should be held legally responsible for the AI's actions. If the AI crashes a site, the person who wrote the prompt to do so gets sued for damages by the site's owners.
Re: (Score:2)
AI (LLM) crawlers totally do not activate upon user queries. The crawlers are used to build the teaching dataset.
User queries only trigger inferences. As I understand it, inference is effectively working from a ROM image. There is just no trigger to go from inference to teaching.
Re: (Score:2)
This was true a while back, but is totally different with agentic AI.
Re: (Score:1)
Re: (Score:2)
I don't think there needs to be any legally binding contract?
This is basically the "just because the door is unlocked doesn't mean you can help yourself to my toolshed" that they hit people who access systems with all the time.
Re: (Score:2)
I've had huge arguments over this with some security researcher. Their main argument was that if it is online and accessible, it is fair game. Including to monetize the content. I think I'm still worked up over it tbh...
Re: (Score:3)
The language in the statute is "exceeding authorized access"
Re: (Score:2)
Almost sounds like trespassing.
(1) Unsurprising (2) Ironic (Score:5, Interesting)
I've been here a long, long time. And this is one of the worst things I've ever seen. And it's all to feed the insatiable egos and greed of the tech bros who've bet the farm on AI and have yet to realize that "garbage in, garbage out" still applies no matter how much computing capacity you throw at it.
(2) It's ironic that Cloudflare, of all operations, would whine about someone else's abusive conduct. Here's an exercise for the reader: read the article here Scammers Unleash Flood of Slick Online Gaming Sites - Krebs on Security [krebsonsecurity.com]. Then follow the link he provides to the list of domains involved in this. Now look where they're (almost) all hosted.
People with outdated browsers living in DCs (Score:3)
My site gets a lot of traffic by people using quite outdated browsers, all those IPs trace back to data centers.
I've created fail2ban rules to for certain patterns. Once those IPs are blocked, different IPs with the same behavior show up.
It is not just Perplexity.
Anubis for the win! (Score:2)
Check out Anubis [techaro.lol] to block these rogue crawlers.
Re: (Score:2)
Re: (Score:1)
Anubis isn't there just as a blocker - it's there to also make it computationally infeasible for companies to repeatedly spam servers with junk requests.
An average Joe/Jane Public doesn't usually care if they have to wait a few moments once to access a site. But that computational cost is soon going to add up for bad actors.
Internet of Old (Score:3)
The age of the internet where requests are honored has long been over. Tha fact that it has taken this long is an anomaly more than anything.
Websites are going the way of the Dodo due to AI. Once the data has been captured by AI, it is owned by AI.
Re: (Score:1)
This is exactly it. The cat is out of the bag. The world is forever changed and there is nothing anyone can do about it. It will only get worse (or "better" if you're on the AI's side).
It's like the printing press or any other number of technologies. Oh no, it lets people copy text without even knowing how to write, etc. AI is the same.
Impossible to predict what's going to happen in the future when any person on the planet can ask AI for a detailed roadmap on how to secretly build a nuclear bomb in their ga
Re: (Score:3)
On the other hand, on the internet of old, everything published on the net was actually *meant* to be public. That robots.txt file was just meant to keep search engines from cataloging a bunch of garbage, not to make the pages unavailable or secret.
Re: (Score:2)
Re: (Score:2)
Unfortunately, I doubt the "rogue" engines are actually that dumb.
Surprised? (Score:3)
Nice that they call our a specific bad actor, but I always assumed bots were doing this all along.
It's actually surprisingly easy to bypass Cloudflare's bot filters just by setting the right user agent. Their bot detection technology doesn't seem to be as sophisticated as they claim.
Re: (Score:2)
as a site admin that used cloudflare, i say you are wrong.
we have several settings we can use to finetune things, but bot score equal 1 is really a bot... and that don't even look to the user-agent
also, trying to fake googlebot or other known bot will trigger the "fake googlebot" rule, that you can block or allow
Re: (Score:2)
Sure, there are lots of settings you can tinker with. It's a bit of security theater. It will gladly block lazily-coded bots with a useragent of Java-http-client/17.0.10, but from the other side, even if you have a known AWS IP address, some very basic steps will let you through. Just change your useragent to Mozilla/5.0 (Windows NT 11.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.6998.166 Safari/537.36 and if you go low and slow, you'll get through. Of course you can ratchet up the pai
Re: (Score:2)
you need to research about JA3, JA4, bot score, attack score and bot detection ID
there is lot more than user-agent... i actually only use user-agent to whitelist some requests from certain ASN, most of the time, they are just totally ignored.
bot score on the other hand...bot score == 1 is 100% certain that is a bot... and many have chrome or firefox user-agent
JA3 and JA4 are standards, go read wikipedia
Bot score used machine learning and static rules to give a score to a request. score 1 is a bot, score 100
Naming the bad actors... (Score:3)
When... (Score:2)
... is someone who can claim "standing" going to file *criminal* charges?
Re: (Score:2)
Never, because that's not how criminal prosecution works at all.
Victims report a crime, and prosecutors decide if charges are filed, and to what end.
A party has to have "standing" to file a civil lawsuit. Nothing to do with criminal court.
Re: (Score:2)
Technically, the one who would "have standing" to file criminal charges would be the prosecutor who is assigned the case after a careful investigation.
Not holding my breath, though.
Re:When... (Score:4, Insightful)
Technically, the one who would "have standing" to file criminal charges would be the prosecutor who is assigned the case after a careful investigation.
Technically, just The Government. The Prosecutor merely acts as their agent- but ya. I wouldn't hold your breath, either.
Even though it's a pretty clear violation of the CFAA, they really only enforce that against solitary kids that scrape research papers.
Technical solution (Score:3)
Re: Technical solution (Score:2)
Re: Technical solution (Score:2)
That's shocking! (Score:3)
Imagine that. There are bad actors out there (Score:1)
I'm completely surprised. I can't believe this has happened! There are bad actors out there? Color me surprised! Look guys. You can't just use the userid "rms" with a blank password. People are evil. Grow up. Get our of your naive phase.
AI = parasite (Score:2)
There's nothing new under the sun (Score:2)
... this is the same sort of cockwombles who had startups claiming they had "re-invented Search" back in the early 2000s and their only "innovation" was to ignore robots.txt
Robots.txt was never adopted (Score:3)
Let’s stop pretending the web is polite. It never was.