Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
AI The Internet

Perplexity Says Cloudflare's Accusations of 'Stealth' AI Scraping Are Based On Embarrassing Errors (zdnet.com) 96

In a report published Monday, Cloudflare accused Perplexity of deploying undeclared web crawlers that masquerade as regular Chrome browsers to access content from websites that have explicitly blocked its official bots. Since then, Perplexity has publicly and loudly announced that Cloudflare's claims are baseless and technically flawed. "This controversy reveals that Cloudflare's systems are fundamentally inadequate for distinguishing between legitimate AI assistants and actual threats," says Perplexity in a blog post. "If you can't tell a helpful digital assistant from a malicious scraper, then you probably shouldn't be making decisions about what constitutes legitimate web traffic."

Perplexity continues: "Technical errors in Cloudflare's analysis aren't just embarrassing -- they're disqualifying. When you misattribute millions of requests, publish completely inaccurate technical diagrams, and demonstrate a fundamental misunderstanding of how modern AI assistants work, you've forfeited any claim to expertise in this space."

Perplexity Says Cloudflare's Accusations of 'Stealth' AI Scraping Are Based On Embarrassing Errors

Comments Filter:
  • by PsychoSlashDot ( 207849 ) on Tuesday August 05, 2025 @09:39PM (#65568906)
    "Make some shit up about how we're not doing that... except to be helpful... which is totally not the purpose of our product in the first place."
  • by Bradac_55 ( 729235 ) on Tuesday August 05, 2025 @09:44PM (#65568914) Journal

    As fucked up as Cloudflare is I'd believe them over Perplexity any day of the week.

  • What? (Score:5, Informative)

    by viperidaenz ( 2515578 ) on Tuesday August 05, 2025 @09:48PM (#65568916)

    Cloudflare are saying Perplexity is disguising its crawler bots as Chrome users.

    Perplexity counters saying their crawlers don't do that, their other AI tools do it.

    Seems like Cloudflare correctly determined that Perplexity are using automated tools to access website that requested to not be accessed by automated tool, despite Perplexity trying to hide their behaviour.

    • Re:What? (Score:5, Interesting)

      by larryjoe ( 135075 ) on Tuesday August 05, 2025 @11:10PM (#65569098)

      Cloudflare are saying Perplexity is disguising its crawler bots as Chrome users.

      Perplexity counters saying their crawlers don't do that, their other AI tools do it.

      Seems like Cloudflare correctly determined that Perplexity are using automated tools to access website that requested to not be accessed by automated tool, despite Perplexity trying to hide their behaviour.

      This is an interesting technicality. Perplexity is basically saying that robots.txt prohibitions only apply to training, either in terms of building page ranks for search or accumulating data for AI model training. So, if the data is immediately used, then it doesn't count as training, so robots.txt doesn't apply. Is this true?

      Of course, the curious question is how Perplexity knows what's behind the robots.txt wall to serve as immediate on-the-fly answers if it didn't already previously crawl past the wall and store something in its database/model to indicate that the hidden data was useful.

      • by martin-boundary ( 547041 ) on Wednesday August 06, 2025 @06:25AM (#65569520)
        Who cares? It isn't true because it's an irrelevant question and therefore nonsensical in this context. Don't fall for sophistry designed to pull you into a mental rabbit hole.

        The issue is: has an automated crawl initiated by an AI occurred or not. Intent is irrelevant, as is arbitrary redefinition of terms by Perplexity.

      • by allo ( 1728082 )

        They basically say that their agent is an extension (not as in addon, but like in reaching farther) to the user's browser, which also neither identifies as robot nor accesses robots.txt.

    • by flug ( 589009 )
      Exactly.
    • by Rei ( 128717 )

      You really don't know the difference between a crawler and a web search?

      An AI agent is tasked by a user to collect info about a topic - I do this all the time. The agent then does a bunch of successive searches, each time distilling the important things relative to the user's task that may need subsequent followup, and then ultimately sums up everything they found on the topic. What doesn't happen is "Perplexity building up a database of said content". Crawlers try to visit the whole internet (minus site

      • It’s still a bot accessing content requested not to be accessed by bots. It’s alsonot identifyingitself as a bot.

      • So, as a practical matter, your argument means that I - as a site owner, or web host/CDN - have no legal or philosophical right to distinguish between an actual visitor to my page, or consumer of my content - and a bot sent there by another commercial entity, to scrape the contents of my site, and summarize, analyze or aggregate it, as a critical part of their revenue model?

        Think really hard about the logic behind that before you respond. If you follow that thread it to its conclusion - then there is no
        • by allo ( 1728082 )

          The point is, that users are free to choose their software. If you deny specific softwares to access content, websites will not only deny AI agents, but also browsers with adblockers. Or maybe just everything that is not Chrome. If I want to read your website with lynx so I don't have to see your ugly background color it's my right. If I let some AI website fetch and summarize it, it is the same. Yeah, you wanted me to read the whole thing, but I said TL;DR and created a summary out of it, just as I may cho

      • "The agent then does a bunch of successive searches", so an automated system to follow links and download all the content it finds.
        Got it.

    • by allo ( 1728082 )

      It is not unusual to use a browser user agent when accessing websites with a virtual browser.

      Do you run a webserver? Ever noticed how an iPhone visits sites that are little known a few minutes after the Google bot accessed them?
      It isn't even unlikely that it may be an iPhone, testing the load times and how good the page renders on mobile to decide on the ranking (especially in mobile searches). Still it doesn't identify as Google bot, probably because Google also wants to catch you if you serve other conten

      • Do you run a webserver?

        Yes

        Ever noticed how an iPhone visits sites that are little known a few minutes after the Google bot accessed them?

        No

        Still it doesn't identify as Google bot, probably because Google also wants to catch you if you serve other content to the Google bot than to actual users.

        Websites do this all the time. They allow googlebot through their paywalls so they get indexed.

        • by allo ( 1728082 )

          I could confirm that behavior on several domains. As said, I think they are testing for mobile rendering and to access content that is fetched via javascript and not accessible by a pure scraper that doesn't run a browser engine.

  • Soo, who to trust? (Score:5, Insightful)

    by gweihir ( 88907 ) on Tuesday August 05, 2025 @09:54PM (#65568924)

    The content delivery network people with 16 years of experience in this game or the AI liars that peddle a basically fraudulent product based on a massive piracy campaign?

    Hmm. Difficult!

    • by PPH ( 736903 ) on Tuesday August 05, 2025 @10:57PM (#65569070)

      But Perplexity is basically admitting it:

      "If you can't tell a helpful digital assistant from a malicious scraper, then you probably shouldn't be making decisions about what constitutes legitimate web traffic."

      It really doesn't matter if Perplexity thinks they are "a helpful digital assistant". That's not what the robots.txt file says. There's no flag in there to allow only the "helpful" ones to scrape. Just don't scrape, m'kay?

      • by Rei ( 128717 )

        Define "scrape".

        The general definition of "scraping" involves actually, you know, storing the content, not just briefly caching it then summing it up to the user who asked to browse it.

        • Define "scrape".

          Any automated collection of webpage content. It doesn't matter if the agent doing the scraping stores it verbatim, summarises it or throws it in the bit bucket. From the point of view of the web server, all of these are equivalent.

          • It's not automated dipshit, it's requested by a user.
            Instead of using a browser, they used an LLM to generate a fucking curl call.

            It doesn't matter if the agent doing the scraping stores it verbatim, summarises it or throws it in the bit bucket. From the point of view of the web server, all of these are equivalent.

            Christ, you ignorant fuck.

            If you type a URL into your start bar or whatever stupid web integrated doohickeys you've got on your OS, and it shows you a preview of teh web page, do you expect it to have respected the robots.txt file before pulling that data?

            User-initiated requests are not "bots". There are definitions for this shit. Your ignorance mixed in a bowl with your fee

            • by PPH ( 736903 )

              It's not automated dipshit, it's requested by a user.

              When I access a web site, I send my request directly to that website. Not through an LLM. Semantics about proxy servers aside, If you fiddle with the content while in transit, you are operating an agent which is doing the scraping.

              And you need to medicate right now.

              • When I access a web site, I send my request directly to that website.

                No, you don't. You use a user agent. Though I love the mental image of you trying to speak into your mouse.

                Not through an LLM.

                The LLM in this instance is a user agent.

                Semantics about proxy servers aside, If you fiddle with the content while in transit, you are operating an agent which is doing the scraping.

                Fascinating. So every browser in existence today that does AI summarization of a site for you is guilty of being a scraper, and should consult the site's robots.txt before doing so.

                Chrome will summarize a page for you with Gemini. Further, boyyyy, let's talk about adblocking.
                I think we can safely summarize 85% of user-initiated web traffic as scraping, then.

                • by PPH ( 736903 )

                  So every browser in existence today that does AI summarization of a site for you is guilty of being a scraper

                  Yes. Particularly if they have to cloak a curl call [slashdot.org] as a Chrome browser.

                  let's talk about adblocking. I think we can safely summarize 85% of user-initiated web traffic as scraping, then.

                  I doubt it's as high as 85% at this time. Usually only the tech savvy can do effective ad blocking. But as AI sites such as Perplexity are increasingly being used by the average user this number will rise. And that will break the web's business model as we know it. Be prepared to pay for everything you access. And if Perplexity does it on your behalf, be prepared for a really big bill from them.

                  • Yes. Particularly if they have to cloak a curl call [slashdot.org] as a Chrome browser.

                    cloak?
                    Give me a break.
                    Every UA in existence lets you change how they identify themselves. There's a reason for this, and it isn't nefarious.
                    The web is open, and if you have a right to block based on a self-reported string, then I have a right to change that string to stop you from doing it.
                    Anyone who uses strange third-party browsers is familiar with "cloaking" their browser as Chrome, since if you don't do it- many poorly designed websites won't work right.

                    I doubt it's as high as 85% at this time.

                    We're talking about anything that affects or

        • by toddz ( 697874 )
          It's called robots.txt for a reason. Don't really care if it's the '90s model or the the '20s model.
          • Stupid people with opinions. Never gets old.

            If I type, "curl https://slashdot.org/ [slashdot.org]", should it have asked robots.txt before pulling the data?
            If I type "https://slashdot.org" into my start bar, should it have asked robots.txt before pulling the data?
            If I type it into an LLM chat window, should it have asked robots.txt before pulling the data?

            The answer to all 3 questions is no.
            If you would like to look less stupid among experts in the future, you can read here. [ietf.org]
        • And how do you propose that a CDN is supposed to know what is being done with the data it serves up? Remember... people pay for CDN's, Those requests cost money. Yes, one request is a trivial amount, and a trivial burden - but we're talking about billions of requests a day. That's a lot of aggregate demand from a free rider.
        • In the case of AI, many of them are using user queries to transform and bolster their AI's training. They're arguably storing the content, even if it's not in it's original form.
        • by allo ( 1728082 )

          No, it does not.

          I side with Perplexity on the access being no crawling, not being used to train AI and not being a request that needs to obey robots.txt.
          But it *IS* scraping by the very definition, i.e., fetching content automatically to parse it and extract information. And most scrapers do not respect robots.txt, as it is made for crawlers, which are autonomously following links, what scrapers don't.

          Of course sometimes crawlers and scrapers are combined, e.g., in the bots that search for content for AI tr

      • Fuck. Idiots everywhere.

        The LLM is operating as a user agent, as in the RFC 3986 form of teh term.
        It is not acting as a scraper/indexer.

        You have an opinion, and it's adorable- really, but it's wrong.
        Cloudflare is wrong.

        If you want to deny certain types of user agents that are not bots (as defined in RFC 9309) from accessing your server, that's a different discussion entirely.
  • by Rendus ( 2430 ) <rendus&gmail,com> on Tuesday August 05, 2025 @09:56PM (#65568928)

    Sounds to me like the accusations are true, and Perplexity is deflecting by saying they're harmless and even helpful.

    On my own servers, I see a pattern of behavior of something hitting my robots.txt (which both has a blanket denial for all user agents, AND specific denails for all known bots), and then suddenly a variety of IP addresses start hammering my site. It's bad enough I'm either going to put my servers behind Cloudflare, or at least one of the gatekeeper challenge systems.

    Perpexity's shitty response really does nothing for me but confirm the accusations.

    • by ndsurvivor ( 891239 ) on Tuesday August 05, 2025 @10:01PM (#65568942)
      To play a devils advocate, if a human asks his/her Perplexity agent to buy something off of, or to get information from your website, is that the AI scraping your data, or a person using your website?
      • by CrankyFool ( 680025 ) on Tuesday August 05, 2025 @10:09PM (#65568956)
        "scraping" is maybe questionable, but there's no question that an AI is accessing your website.
      • by Rendus ( 2430 )

        In my case, I have this in robots.txt, on a forum I don't want to get absolutely destroyed by bots that don't rate-limit themselves:

        User-agent: *
        Disallow: /

        As for agentic actions, it's still an AI performing the actions. I said "No" to robots accessing my site. That they then pivot and do something that ISN'T not accessing my site is inappropriate no matter the reason.

        • The concept behind BBS's, maybe that is a proper way to deal with AI scrapers. AI companies are clearly unwilling to acknowledge or respect any terms or "no scraping allowed" for any website, besides their own.

          So, those that know how a BBS works, should be smart/capable enough to pay (money/capabilities/favors) for access to it. The rest of the world none the wiser, and through obscurity and a vigilant BBS admin AI companies will have literally nothing new to scrape anymore.

          Given that many types of data bec

      • by tepples ( 727027 )

        It depends. Does the Perplexity agent faithfully relay the personalized messages from sponsors that would have otherwise been presented adjacent to the information on the website?

        • It looks like a mess to me. No, I don't believe the advertising is getting to the person if the person is using an AI agent.
        • by Rendus ( 2430 )

          Depending on the agent, the output may never be seen by a human. "Constantly monitor eBay for a good deal on a waffle iron, and trigger a notification when found", for example. No eyes will ever see the pages the agent loads. It's just consuming eBay's compute resources. A much better written prompt will also ignore promoted eBay listings, inline advertisements, and so on.

          Even if it's an action taken on behalf of a person, it's very unlikely the ads on the page will be delivered to the user. Keep in mind, t

      • If a human asks an AI agent to do this for them, why would the AI agent access robots.txt? Did you check Robots.txt before posting on Slashdot?

        • by Rendus ( 2430 )

          By that logic, anything executed by a person (such as, say, deploying a scraper) can just ignore robots.txt - a person started the task, after all.

          Robots should respect robots.txt.

          • Robots should respect robots.txt.

            - Asimov's 4th law

          • My point exactly. I was demonstrating how the GP's logic is messed up. Either a robot respects robots.txt or it shouldn't bother reading it. In either case the pattern being displayed is illogical.

          • by allo ( 1728082 )

            How many scrapers do you know that respect robots.txt? That's a standard for crawlers, not for scrapers.

            An example: I had for some time a tool running, that scrapes webpages from RSS feeds, to amend the headline-only feed to a fulltext feed. That's a scraper, as it gets a fixed list of URLs to process, fetches data, transforms data and then serves it to the user. But as the list is given fixed and no new links are added, it doesn't (need to) access robots.txt. The list is to be fetched, as the user said it.

            • by Rendus ( 2430 )

              All scrapers, crawlers, and other 'bots' SHOULD respect robots.txt. The original intent was to block what was termed at the time (1994) as "crawlers", but that has evolved as the Internet has evolved.

              Justifying crawling behavior by saying it's "just scraping, and then loading additional pages..." is... Well. Fucky logic to say the least. Following your logic here: If I access a single page, then extract all the links in it and add it to an RSS feed, I'm free to then access all those subsequent pages because

              • by allo ( 1728082 )

                The difference is, that a crawler is discovering new content itself. A RSS bot fetches a list of links. The site provides the list, the scraper fetches the links, that's it. It will never fetch a link of another site. A crawler on the other hand can start at the RSS feed and end up on a site I never knew it existed.

            • by Rendus ( 2430 )

              And I'll add one more point, which I meant to make in my previous comment:

              Agents, AI or otherwise, aren't just pulling down a single page to present to the user. They're performing logic on that page that was accessed, and probably accessing additional pages. AI agents can be given instructions to "collect all the content on slashdot.org". I did that, and here's what ChatGPT did: https://chatgpt.com/share/6894... [chatgpt.com]

              That's behavior that should respect robots.txt, but it apparently uses Chrome, so... It clearly

              • by allo ( 1728082 )

                I thought about it, and I think the conclusion is that robots.txt should apply to following links. Before you follow a link that was not in your list of starting links, you need to check the robots.txt, while you don't for the links you were instructed to fetch. I guess that's basically the same as you are saying, a "Collect all content on ..." that spawns access recursively is similar to a crawler that just puts all links into a queue.

                And somewhere one needs to draw the line, because if agents can act auto

      • The Truth of the person asking the agent to do the thing has no bearing on the Truth of an automated agent scraping the site. Your question is an irrelevant distraction. Have a nice day. :)

        Or, to sink to your level, yes, it is still an AI agent scraping data regardless of any attempts to cloud the issue.

        • And you're a fucking idiot.

          An LLM instantiating a web request on your behalf is no different than a browser, a start bar, or a fucking script.

          You are trying to redefine scraping to mean, "accessing my web server using a user agent that I don't like.".
          • I think that if the agent identified itself properly in the user agent string rather than trying to masquerade as a person, there would be less of an issue. However, it is being deliberately deceptive. Hard to make any kind of argument from that starting point.
            • I think that if the agent identified itself properly in the user agent string rather than trying to masquerade as a person, there would be less of an issue.

              Perhaps. But what if you, as a user, ran across a site that blocked Chrome because they don't like what that browser stands for.
              Would you change your UA, so that sites didn't prejudice your choice of user agent?
              Now, what if a user asked Perplexity's LLMified search tool to use a different agent?

              However, it is being deliberately deceptive.

              Or is deliberately trying to make their UA work.
              For example, what if you, as the maker of Brave, tried to make your UA look like a regular Chrome UA so that pages would behave correctly since your renderer was bas

  • Sure, it's a little different to request a site on behalf of someone rather than downloading content to be used to generate the model. However it's seem pretty legitimate to block the "AI Assistants" as well as lack of eyeballs means lack of ads or memberships.
    • This is starting to split hairs. There are lots of legitimate reasons for pages to be fetched by a "bot". If you post a link to a social media platform, that system will fetch the page to access the HTML meta tags to find things like the page title, an image to represent the page, a description, etc, and that is what is displayed in the post instead of just a plain URL. That request also doesn't result in "eyeballs" and ads are not served. Browsers can pre-fetch URLs on a page, again, not resulting in the

      • Re:Differences (Score:4, Informative)

        by Himmy32 ( 650060 ) on Tuesday August 05, 2025 @10:29PM (#65569000)

        Yes, undoubtedly there are a lot of legitimate reasons for automated requests like the ones you have listed.

        But individual requests for content for AI assistants can be even more problematic than a traditional scrape. Getting scraped once a day isn't that much traffic. But if 2 million people ask an AI assistant want to know what foo actor starred in bar movie that generates a request to movie database, those 2 million eyeballs aren't seeing the ads for upcoming movies and the host has to serve all that extra traffic.

        It's enough of an issue for web devs that there is now even an open source AI blocker [techaro.lol] that devs are including in their sites as F5 reports about half all requests are coming from AI bots [zdnet.com]

        • Perhaps this will finally spell the end of ad-supported internet. when literally everyone has an ad-blocker installed by default (practically speaking) they're going to have to find new ways to support the business that doesn't involve spamming the crap out of internet real estate with ads. I know that would be deleterious for some of the existing businesses out there, but perhaps it will lead to a better model. Like pay a microtransaction to access content, every time. or have monthly subscriptions that yo
      • If Cloudflare is misrepresenting the data to make it appear as if data is being scraped for training purposes when it is not then that is indeed something different.

        You can guarantee that what ever the trigger for Perplexity's bots collecting data, they'll use it for training too.

        If they were navigating through a site like a person would, I doubt it would be triggering Cloudflare's bot logic.

      • If every person using an AI assistant becomes a mechanical turk for sending data back upstream to the AI provider as a means to bypass robots.txt that's what isn't cool. Local usage for the benefit of that user should be ok, but not the follow on effects.
    • This is the first rational fucking observation so far about this.
      You are entirely correct.

      This is problematic because it has a statistical effect on the monetization of views for them.
      It still is not, and never will be expected to adhere to robots.txt as traditionally understood.

      This problem exists for ad blockers, text-only browsers, etc. It's only at a different scale given LLM adoption.
  • Hey wait... (Score:5, Insightful)

    by Tschaine ( 10502969 ) on Tuesday August 05, 2025 @10:15PM (#65568968)

    Cloudflare claimed that perplexity's AI was able to answer questions about content on pages that perplexity was prohibited from accessing.

    How exactly does perplexity explain that phenomenon?

    • by Himmy32 ( 650060 )
      It's not scraped into the model, it's request on behalf of the users loaded individual a whole lot of times for a whole lot of users. Totally different because not only are the users not going to the site, the site also has to be pay for a whole bunch of individual traffic.
    • Perplexing, isn't it? That's Clairvoyic AI, the latest advancement. It learns things without having to read them. Knows without training. It's a whole paradigm shift that you're too feeble to understand.

  • They're guilty (Score:5, Insightful)

    by Jeremi ( 14640 ) on Wednesday August 06, 2025 @12:02AM (#65569198) Homepage

    Perplexity's accusatory and belligerent tone is as good as guilty plea in my book. They sound like someone who is trying to insult the other party into submission so that they won't have to come clean.

    • They sound like someone who is trying to insult the other party into submission so that they won't have to come clean.

      Perhaps the statement was written for them by Karoline Leavitt?

    • Well, they already admit it themselves. "It's not a malicious scraper, it's our AI tool searching the web when web search is enabled". So, "we're doing it" but it's legitimate because... well, they had AI assist in the response so I guess they needed some help answering that one.

  • TBH, there's no difference between what the various AI's do to websites, in all cases they pull/steal data and obscure the presentation that the owner intended from their owner/user. I think the blatant personal attack shows how right cloudflare is here. AI assistants need to be cleared/vetted by the sites they visit. Because they steal from providers and obfuscate things from users. And that means they may need to pay where a traditional user may not.
  • From a site owner's perspective, there really is no difference. The bot is still taking my content, giving me nothing in return, and then presenting it to their users with at best just a tiny link at the bottom nobody ever clicks.
  • Robots.txt is to ask politely to keep out not just web crawlers, but any automated processes that potentially put undue load on your server. Once again it's about being 'civil'; which is almost completely dead when it comes to the 'internet.'

    Any automated assistant is still automated regardless of the intention. It's the whole reason robots.txt exists.

    Also, anyone expecting robots.txt to hide content from public view or public use is just an idiot.

    Cloudflare is big and has plenty of resources to make automated travel through content covered by robots.txt to be more hassle than it's worth. Once again anything on a networked computer becomes a battle of resources. Security is losing the battle, many forums lost the battle long ago, now AI is flooding anything left with both unending traffic and unending gibberish content. We lost, we now stand in shit.
  • "If you can't tell a helpful digital assistant from a malicious scraper, then you probably shouldn't be making decisions about what constitutes legitimate web traffic."

    If you can't create a "helpful digital assistant" that anyone can easily distinguish from a malicious scraper, then you probably shouldn't be creating traffic on the internet.

  • Perplexity has three types of access to websites.

    The first is crawling. Most of that is not done by them, as they mostly use models trained by others.
    The second is web search. When the AI decided what search terms may find the information for your question, it asks a search engine (e.g. using the bing API). The search engine crawled the site and a request may even trigger a recrawling.
    The third is agentic access. It doesn't crawl information, but only accesses it as proxy for the user. This means when the s

  • Proplexity is pretty snippetty and judgy in their rebuttal. I think they're caught and trying to misdirect.

In the realm of scientific observation, luck is granted only to those who are prepared. - Louis Pasteur

Working...