Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
The Internet IT

Perplexity is Using Stealth, Undeclared Crawlers To Evade Website No-Crawl Directives, Cloudflare Says (cloudflare.com) 86

AI startup Perplexity is deploying undeclared web crawlers that masquerade as regular Chrome browsers to access content from websites that have explicitly blocked its official bots, according to a Cloudflare report published Monday. When Perplexity's declared crawlers encounter robots.txt restrictions or network blocks, the company switches to a generic Mozilla user agent that impersonates "Chrome/124.0.0.0 Safari/537.36" running on macOS, the web infrastructure firm reported.

Cloudflare engineers tested the behavior by creating new domains with robots.txt files prohibiting all automated access. Despite the restrictions, Perplexity provided detailed information about the protected content when queried, while the stealth crawler generated 3-6 million daily requests across tens of thousands of domains. The undeclared crawler rotated through multiple IP addresses and network providers to evade detection.

Perplexity is Using Stealth, Undeclared Crawlers To Evade Website No-Crawl Directives, Cloudflare Says

Comments Filter:
  • That you would have in your possession indexing the web secretly. Immune from takedowns and scrub requests... The blackmail material alone that one would come into possession of, should make this beyond illegal.
    • Re: (Score:2, Troll)

      by GoTeam ( 5042081 )
      It can't be illegal, they paid the right fees [bloomberg.com].
    • by Nebulo ( 29412 ) on Monday August 04, 2025 @11:52AM (#65565406)

      Huh? They aren't accessing anything secret. Everything they are getting is publicly available to any human surfer. It's just that the content has been marked off-limits to crawlers and they are masquerading as human to get around the restriction.

      • Look at the news at what's been scrubbed from the wayback and others like it in just the last few months. Any crawler that can crawl, can mirror... In total secret, the progression of this, especially AI driven, has deeper implications. Not everyone would use that information for honorable things. Extortion a plenty can be made from such things.
        • This is shit that is literally posted to the internet for all to see getting indexed and catalogued. If it's secret, the "owner" of said secrets already fucked up if its getting scraped by a crawler bot (namely it's posted on the open and free internet)
      • There is a difference between "secret" and "Don't Crawl Our Site".

        It's almost impossible to masquerade as a human; even throttled crawlers are easily identifiable through many different and often evil traits used.

        The kleptocracy of AI (and other) crawlers is what's at issue.

        • by abulafia ( 7826 ) on Monday August 04, 2025 @12:43PM (#65565528)
          even throttled crawlers are easily identifiable

          Please share the secret. Currently about 70% of the traffic to some sites I run is from badly behaving robots puppeting residential IPs.

          About the only trick that works fairly well is hidden poison links (effectively, "click here to ban your IP"). Some robot farms coordinate the search space, so you have to leave them all over, and they're using huge amounts of IPs, so you'll still get beat to shit.

          What is your magic sauce?

          • by postbigbang ( 761081 ) on Monday August 04, 2025 @12:52PM (#65565558)

            Get Wordfence if your site is Wordpress. The controls inside (free version) are enough to rate-limit crawlers effectively.

            If you don't have Wordpress, your choices are more complex; you MUST use an IP filtering system and front-end your site with it to rate-limit everyone methodically. Crawlers eventually quite.

            Many crawlers identify themselves in the get/post sequence. You have to parse those. If you understand fail2ban conceptually, it's the method used to create like-type gets that score with higher rates, and folder transversals. Accumulate your list and band them/null-route/block or whatever your framework permits.

            Yes, you can blackhole through various famous time-wasters, but this also dogs your site performance. Captcha and others are becoming easier to fool, and for this reason, they're not a good strategy.

            Once you decide on a filtering strategy, monitor it. Then share your IP ban list with others. Ban the entire CIDR block, because crawlers will attack using randomized IPs within their block. If you get actual customers/viewers, monitor your complaint box and put them on your exemption list.

            • by abulafia ( 7826 )
              Yeah, this is all fairly basic stuff. I don't do Wordpress, and I won't do CloudFlare.

              I have a fairly custom stack, which is a liability a lot of the time, but can be useful in cases like this. Being different than the herd means they're spending most of their time on attacking their mitigations instead of whatever weird shit I'm up to.

              • There's a lot of info available about attack mitigation (or just hungry crawlers) and how to avoid/blackhole them. Problem is, you have to have control of portions of the network stack to do them effectively.

                Security through obscurity only works so long as you can be obscure, which is part of the vibe of the post. It's really stressful sometimes, depending on what's hosted.

                Of the sites I don't have behind Cloudflare, the assets aren't worth anything and I truly don't care if they show up in AI. Otherwise, w

            • by Xarius ( 691264 )

              Check out anubis [github.com]

              I think it works by making the browser do some JS maths that's unnoticeable to a user. and if an AI crawler implements the JS stuff to do the mathematics it is too costly in processing power at the scale they operate.

          • If you can afford it, also consider Cloudflare; their bot identification is really good. You can use defaults or make your own filters. They're not the only ones that do this, but my experience with them has been positive. Much depends on your skills in how the web actually works, network + site interaction.

            Their protections are cheap for the quality/speed. All of the large sites I manage are behind Cloudflare, including their DNS. Their DNS management is superior, and has interesting tricks for mixed-media

      • Tell us you don't get it without saying "I'm a clueless twat!"

        These scum are sucking in content created by others and surfacing it as their own. If there's no content to consume for regular browsers the internet will be what you and your ilk have always dreamed of. An echo chamber of recursive garbage.

  • What?! (Score:5, Funny)

    by SlashbotAgent ( 6477336 ) on Monday August 04, 2025 @11:28AM (#65565338)

    Next you're going to tell me that they just ignore robots.txt. Surely they would never do that just for their own selfish interests?

    • Re:What?! (Score:4, Insightful)

      by evanh ( 627108 ) on Monday August 04, 2025 @11:35AM (#65565354)

      Makes Google look like the white knight.

      • Re: (Score:3, Interesting)

        by TheWho79 ( 10289219 )
        Really? Google says they dl and "render" pages with a browser. They even touted how they'd upgraded it to the latest chromium last year. Ever seen an IP from Google make an AJAX pull? Yet we know they index some js as js generated content does show up in Google search. So where-oh-where does that come from?
    • ISPs just burn the candle at both ends, charging those who produce content AND those who consume content, for internet access. Sometimes they find ways to burn the candle in the middle, too.

      Instead, content producers should GET paid per web request. And the payor should be the person making the web request. ISPs would just skim off the top.

      Automated crawling would suddenly become a lot more expensive (good), but website owners wouldn't mind since they get paid for that instead of paying for that (good),

      • by Zak3056 ( 69287 ) on Monday August 04, 2025 @12:40PM (#65565516) Journal

        Instead, content producers should GET paid per web request. And the payor should be the person making the web request. ISPs would just skim off the top.

        The internet is more than just the web, and this is just a bizarre proposal. If you think bandwidth caps are bad, just wait until you can get charged per-connection fees.

      • Your system would collapse internet usage under connection costs. Not just bad bots, but also good usage would decline.
      • We move the packets. Get fucked.
      • users vote with their wallet simply browsing to whatever isn't charging them money.

        FTFY.

        The idea of charging people per site (let alone per request), died a long time ago. It's NOT coming back, no matter how much good it might do. People get pissed over the idea of paying for yet another subscription. They aren't about to pay $10/month just to access one of TikTok / Facebook / Bluesky / X / etc. Let alone the massive restructuring it would take for all of those sites wanting to push ADs. (ADs would have to be free to request or the users would abandon the site instantly, but they won't

        • Furthermore, you're not going to coerce those spending the power to move the packets to collect money by the stream, and further than that, you're not going to tell us that we, who move the fucking packets, are just a tax man on top of this commerce. They're engaging in this commerce over the medium we provide.

          These aren't circuit-switched networks, they're packet-switched.
          Our routers consume wattage per packet, not per constant bitrate circuit.

          And could you imagine having to deal with routing through
  • Publish all webpages as PNGs made on the fly. Maybe that won't stop AI webcrawlers, I guess they'd just resort to OCR to glean the info.

    • Not just that, they would use AI to OCR, so their energy consumption would increase substantially. Every problem is a nail.

      • compressing PNGs 'on-the-fly' for every visitor would also increase energy consumption substantially. As usual our environment is the unwilling battle ground of computing.
    • Re: (Score:3, Insightful)

      by Anonymous Coward
      Would cause problems with Accessibility. Not everyone has perfect eyesight.
  • If it's accessible, it's public
    To keep things private, don't put them on the internet or use encryption
    The old "robots.txt" idea only works when robots politely obey the rules

    • I agree. The idea that no one will scrape sites because it is ungentlemanly is ridiculous. Even the idea that AI shouldn't be allowed to train on publicly available material seems a bit thin to me. Not saying there shouldn't be laws or regulation necessarily, only that it is hard to understand the rationale of publicly publishing things and then getting upset when something "reads" them and categorizes them, etc
      • by Fly Swatter ( 30498 ) on Monday August 04, 2025 @12:11PM (#65565448) Homepage
        Many things in society are necessary for society to continue, one of those things is being respectful to other's wishes. If anything the current state of the internet is a very sad commentary of how people act when there is no one to slap them in the face and say 'don't do that'.

        And that problem is spreading into the real world - it existed but now it's growing and we mere sheep can't really do anything either, just like on the internet.
        • Many things in society are necessary for society to continue, one of those things is being respectful to other's wishes. If anything the current state of the internet is a very sad commentary of how people act when there is no one to slap them in the face and say 'don't do that'.

          On the internet? Have you not seen all the broccoli heads doing wanker shite on TikTok for views -- all outside in the real world? Being "disrespectful" on the internet is so 00s. We are already living with 2 generations who proliferated it into the "touch grass" part.

      • There's also a distinction between crawling everything to feed into the beast, vs a user making a query and the model decides to load the first 20 hits on a web search. To the website the traffic may be the same, but legally you would probably come to a different conclusion if a human was making the original query. However, only the AI service knows the reason for the access and they have no reason to share that with the website, or anyone else.
  • by memory_register ( 6248354 ) on Monday August 04, 2025 @11:38AM (#65565368)
    I think there is a big difference between a search engine crawler that is running on an automated schedule with no particular user behind it, and a web search that is created as part of answering an AI query. I do not think the same robots.txt rules should apply if I go to perplexity and ask it to search the web for me, and then it does so.
    • Re: (Score:3, Interesting)

      by Fly Swatter ( 30498 )
      So because I asked someone to rob a house, it is perfectly fine if they do it? No.

      Just because something can be done doesn't mean is should be done. This is how we lose 'civilized society'. The internet is warning us, and we won't listen.
      • The crawler is representing a real human this time, it should get the same rights as the real human. Consider it an assistive technology, like TTS or Braille reader. The best way would be to just use a Chrome tab to perform the operations using a Computer Use agent.
        • TTS and Braille don't crash websites by merely using them. There's a difference.

          If anything, the person making the AI prompt should be held legally responsible for the AI's actions. If the AI crashes a site, the person who wrote the prompt to do so gets sued for damages by the site's owners.
        • by evanh ( 627108 )

          AI (LLM) crawlers totally do not activate upon user queries. The crawlers are used to build the teaching dataset.

          User queries only trigger inferences. As I understand it, inference is effectively working from a ROM image. There is just no trigger to go from inference to teaching.

  • by Arrogant-Bastard ( 141720 ) on Monday August 04, 2025 @11:43AM (#65565382)
    (1) There are all kinds of abusive things going on with AI crawlers, including ignoring robots.txt - or not even bothering to check for it, using faked user-agents, using end-user systems on commercial ISPs, using systems distributed across various clouds, not rate-limiting queries, etc. That's why there are myriad efforts all working toward mitigating the damage that's being done, and unfortunately, no one technique solves the problem entirely. (And the people running the AI crawlers are responding to these defensive efforts by escalating their attacks.) What's happening is essentially a DDoS against every web site, and it's not only costing a fortune in bandwidth/cycles/etc., it's costing a fortune in human time.

    I've been here a long, long time. And this is one of the worst things I've ever seen. And it's all to feed the insatiable egos and greed of the tech bros who've bet the farm on AI and have yet to realize that "garbage in, garbage out" still applies no matter how much computing capacity you throw at it.

    (2) It's ironic that Cloudflare, of all operations, would whine about someone else's abusive conduct. Here's an exercise for the reader: read the article here Scammers Unleash Flood of Slick Online Gaming Sites - Krebs on Security [krebsonsecurity.com]. Then follow the link he provides to the list of domains involved in this. Now look where they're (almost) all hosted.
  • My site gets a lot of traffic by people using quite outdated browsers, all those IPs trace back to data centers.

    I've created fail2ban rules to for certain patterns. Once those IPs are blocked, different IPs with the same behavior show up.

    It is not just Perplexity.

  • Check out Anubis [techaro.lol] to block these rogue crawlers.

    • by xack ( 5304745 )
      It's only a matter of time before this gets cracked too, proof of work is nothing to an Epyc Processor with all their cores.
      • Anubis isn't there just as a blocker - it's there to also make it computationally infeasible for companies to repeatedly spam servers with junk requests.

        An average Joe/Jane Public doesn't usually care if they have to wait a few moments once to access a site. But that computational cost is soon going to add up for bad actors.

  • by Archangel Michael ( 180766 ) on Monday August 04, 2025 @11:53AM (#65565410) Journal

    The age of the internet where requests are honored has long been over. Tha fact that it has taken this long is an anomaly more than anything.

    Websites are going the way of the Dodo due to AI. Once the data has been captured by AI, it is owned by AI.

    • by Anonymous Coward

      This is exactly it. The cat is out of the bag. The world is forever changed and there is nothing anyone can do about it. It will only get worse (or "better" if you're on the AI's side).

      It's like the printing press or any other number of technologies. Oh no, it lets people copy text without even knowing how to write, etc. AI is the same.

      Impossible to predict what's going to happen in the future when any person on the planet can ask AI for a detailed roadmap on how to secretly build a nuclear bomb in their ga

    • On the other hand, on the internet of old, everything published on the net was actually *meant* to be public. That robots.txt file was just meant to keep search engines from cataloging a bunch of garbage, not to make the pages unavailable or secret.

      • OK, put a hidden directory in your website, fill it with several gigabytes of core files and other junk and list it in robots.txt as not to be searched. Honest search engines won't even look at it, but the rogues will spend bandwidth sucking it down and lots of expensive computing time trying to make sense of it. Serves them right!
  • by TwistedGreen ( 80055 ) on Monday August 04, 2025 @11:56AM (#65565414)

    Nice that they call our a specific bad actor, but I always assumed bots were doing this all along.

    It's actually surprisingly easy to bypass Cloudflare's bot filters just by setting the right user agent. Their bot detection technology doesn't seem to be as sophisticated as they claim.

    • by higuita ( 129722 )

      as a site admin that used cloudflare, i say you are wrong.
      we have several settings we can use to finetune things, but bot score equal 1 is really a bot... and that don't even look to the user-agent
      also, trying to fake googlebot or other known bot will trigger the "fake googlebot" rule, that you can block or allow

      • Sure, there are lots of settings you can tinker with. It's a bit of security theater. It will gladly block lazily-coded bots with a useragent of Java-http-client/17.0.10, but from the other side, even if you have a known AWS IP address, some very basic steps will let you through. Just change your useragent to Mozilla/5.0 (Windows NT 11.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.6998.166 Safari/537.36 and if you go low and slow, you'll get through. Of course you can ratchet up the pai

        • by higuita ( 129722 )

          you need to research about JA3, JA4, bot score, attack score and bot detection ID
          there is lot more than user-agent... i actually only use user-agent to whitelist some requests from certain ASN, most of the time, they are just totally ignored.
          bot score on the other hand...bot score == 1 is 100% certain that is a bot... and many have chrome or firefox user-agent

          JA3 and JA4 are standards, go read wikipedia

          Bot score used machine learning and static rules to give a score to a request. score 1 is a bot, score 100

  • by js_sebastian ( 946118 ) on Monday August 04, 2025 @12:13PM (#65565454)
    Just listened to an interview with the Cloudflare CEO about this on the Hard Fork podcast. He mentioned several times that some AI companies were behaving well and some were behaving badly, obfuscating their crawling to try to bypass blocking, but he didn't name any names...
  • ... is someone who can claim "standing" going to file *criminal* charges?

    • by Rendus ( 2430 )

      Never, because that's not how criminal prosecution works at all.

      Victims report a crime, and prosecutors decide if charges are filed, and to what end.

      A party has to have "standing" to file a civil lawsuit. Nothing to do with criminal court.

      • by taustin ( 171655 )

        Technically, the one who would "have standing" to file criminal charges would be the prosecutor who is assigned the case after a careful investigation.

        Not holding my breath, though.

        • Re:When... (Score:4, Insightful)

          by DamnOregonian ( 963763 ) on Monday August 04, 2025 @04:00PM (#65566014)

          Technically, the one who would "have standing" to file criminal charges would be the prosecutor who is assigned the case after a careful investigation.

          Technically, just The Government. The Prosecutor merely acts as their agent- but ya. I wouldn't hold your breath, either.
          Even though it's a pretty clear violation of the CFAA, they really only enforce that against solitary kids that scrape research papers.

  • by Visarga ( 1071662 ) on Monday August 04, 2025 @01:26PM (#65565648)
    Technical solution is to avoid multiple parties doing their separate crawls. If there was a central cache, it could be set up such that crawlers pay the cache, and the cache pays the websites, ensuring fast and fresh access.
  • by hcs_$reboot ( 1536101 ) on Monday August 04, 2025 @01:31PM (#65565662)
    It's shocking that they didn't manage to hide their IP addresses or use rented servers like others do.
  • I'm completely surprised. I can't believe this has happened! There are bad actors out there? Color me surprised! Look guys. You can't just use the userid "rms" with a blank password. People are evil. Grow up. Get our of your naive phase.

  • ... this is the same sort of cockwombles who had startups claiming they had "re-invented Search" back in the early 2000s and their only "innovation" was to ignore robots.txt

  • by TheWho79 ( 10289219 ) on Monday August 04, 2025 @09:01PM (#65566624)
    No standards org ever endorsed or adopted robots.txt as a standard - ever. This isn’t a crisis. It’s Monday.

    Let’s stop pretending the web is polite. It never was.

Always look over your shoulder because everyone is watching and plotting against you.

Working...