Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet AI

Cloudflare Rolls Out Feature For Blocking AI Companies' Web Scrapers (siliconangle.com) 40

Cloudflare today unveiled a new feature part of its content delivery network (CDN) that prevents AI developers from scraping content on the web. According to Cloudflare, the feature is available for both the free and paid tiers of its service. SiliconANGLE reports: The feature uses AI to detect automated content extraction attempts. According to Cloudflare, its software can spot bots that scrape content for LLM training projects even when they attempt to avoid detection. "Sadly, we've observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent," Cloudflare engineers wrote in a blog post today. "We've monitored this activity over time, and we're proud to say that our global machine learning model has always recognized this activity as a bot."

One of the crawlers that Cloudflare managed to detect is a bot that collects content for Perplexity AI Inc., a well-funded search engine startup. Last month, Wired reported that the manner in which the bot scrapes websites makes its requests appear as regular user traffic. As a result, website operators have struggled to block Perplexity AI from using their content. Cloudflare assigns every website visit that its platform processes a score of 1 to 99. The lower the number, the greater the likelihood that the request was generated by a bot. According to the company, requests made by the bot that collects content for Perplexity AI consistently receive a score under 30.

"When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint," Cloudflare's engineers detailed. "For every fingerprint we see, we use Cloudflare's network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint." Cloudflare will update the feature over time to address changes in AI scraping bots' technical fingerprints and the emergence of new crawlers. As part of the initiative, the company is rolling out a tool that will enable website operators to report any new bots they may encounter.

This discussion has been archived. No new comments can be posted.

Cloudflare Rolls Out Feature For Blocking AI Companies' Web Scrapers

Comments Filter:
  • by Visarga ( 1071662 ) on Thursday July 04, 2024 @02:22AM (#64599927)
    They are fighting fire with fire, AI against AI. I presume many companies/websites would like to opt out of LLM answers based off their content, as it removes the need to visit the site. So they would like to differentiate by usage - plain search is OK, generative search not OK.
    • What about Windows Recall, Google and Apple AI that are already inside personal OS? Stopping AI on cloud platforms means very little if you cannot remove AI from those personal OS that most people and businesses are using. https://www.youtube.com/watch?... [youtube.com]
      • The Cloudflare feature has two major uses. It protects content from being used as training data and it shields servers from the inelegant scraping that hoovers up bandwidth. For sysadmins, the latter is particularly important because the AI companies do not respect robots.txt and their methods of scraping appear to be extremely inefficient.

    • by Anonymous Coward

      They are fighting fire with fire, AI against AI. I presume many companies/websites would like to opt out of LLM answers based off their content, as it removes the need to visit the site. So they would like to differentiate by usage - plain search is OK, generative search not OK.

      And when Google abused search for decades, to dominate the entire online search space? Well, that doesn’t count.

      You’re not a “bad actor” if you pay enough politicians, enough money. It’s a Don’t Be Evil trademark.

    • Rock'em Sock'em Robots for the 21st century.
    • by sg_oneill ( 159032 ) on Thursday July 04, 2024 @05:03AM (#64600167)

      As a general rule, if its behaving itself, doesnt flood peoples bandwidth and honestly reports its identity in the user agent field and follows robots.txt , then its a good citizen and can go about its business. Google, Bing and so on all this , they are fine. The AI scrapers tend to absolutely beset peoples bandwidth, ignore the robots.txt and disguise themselves as chrome or firefox. THOSE bots can f*** right off.

    • by martin-boundary ( 547041 ) on Thursday July 04, 2024 @07:39AM (#64600287)
      It's not just that it removes the incentive to visit their site, it's that AI probably *misrepresents* the content on their site. Every time a summary of a website happens, there's information compression. And this kind of compression results in some aspects of the content being hidden or described badly. It's inevitable,and quite understandable that sites do not want to be "summarised" as opposed to being given the chance to show fully what they are about to a real human web surfer.
    • - Blind person uses a browser to read a government web site hosted in Cloudflare
      - Blind person's browser uses scraper like processing to download, consolidate, and display the government web site in a braile or text to speech capacity
      - Cloudflare blocks the web site access
      - Blind person sues the government under ADA and Cloudflare is also named

  • Let's hear it for recursive hypocrisy!

  • by Ken_g6 ( 775014 ) on Thursday July 04, 2024 @02:28AM (#64599939)

    I can never get Cloudflare to understand that I'm human when I use ScriptSafe. I have to disable the add-on to get past the checkbox.

  • by xack ( 5304745 ) on Thursday July 04, 2024 @03:31AM (#64600035)
    Quite often users of non mainstream Linux browsers get put in an infinite loop by Cloudflare. Cloudflare is even hostile to Firefox some days. Just admit that when it comes to Linux users and other niche OS users their usage fingerprint is enough to be bot like and you shouldn't discriminate against them by using the internet equivalent of racial profiling.
    • by AmiMoJo ( 196126 ) on Thursday July 04, 2024 @07:07AM (#64600249) Homepage Journal

      It happens all the time on Windows too, if you use a VPN or have too many privacy enhancements in your browser.

    • Users often read into things which isn't there. I've experienced a Cloudflare verification loop precisely once, and it was with a Chrome browser. I get the "are you human" thing all the time using Chrome on Android. I get it about as regularly using Firefox as I get it on Chrome or Edge in windows. I get it all the time on ToR / behind VPNs. But I don't see it any more often on my Seamonkey running on Ubuntu NAS than I do on my Windows 11 desktop.

      No one is being locked out, but humans are great at identifyi

      • I've seen the bot accusing loop frequently for ipv6 tunnels which is unfortunate because a lot of iSPS still don't have a true v6 stack so it's that or ipv4 only.

        • And that's the thing. They aren't targeting a specific browser. They are targeting non standard behaviour. What's the best the OP is running some plugins that mess with scripts, obfuscate their behaviour, or end up making their network connection look weird (like VPNs).

          The conspiracy here is just silly.

      • by tlhIngan ( 30335 )

        No one is being locked out, but humans are great at identifying non-existent trends.

        Well, it isn't a non-existent trend. It's just that people are making their traffic look like abusive traffic.

        Cloudflare looks at traffic trends. If they notice a lot of "bad" traffic originating from a certain IP, they put up enhanced protections against it. It just so happens a lot of bad actors hide their activities through VPNs and TOR and Cloudflare notices that.

        I mean, if you're seeing lots of DDoS attacks from an IP,

    • ... you shouldn't discriminate against them by using the internet equivalent of racial profiling.

      As a Linux, Firefox, VPN, and Cookie AutoDelete user that is constantly asked to verify if I'm human and I detest how cloudflare operates. However, operating systems and browsers are not protected classes and are optional because you can choose to use a different OS or browser which is why I also recognize it as being a business decision that is entirely impersonal and has nothing to do with any ideology (besides avarice). To call it "the internet equivalent of racial profiling" belittles just how hurtful

  • it's being constantly crawled by amazon, facebook and of course various 'ai' sites.

    Let them eat it.

  • by Eunomion ( 8640039 ) on Thursday July 04, 2024 @05:04AM (#64600169)
    I've already had to terminate business relationships because of skyrocketing false positives.
  • Next, the AI companies will also use AI (not an LLM) to generate scraping patterns that defeat the Cloudflare fingerprinting.

    I am pretty sure I overheard one of the defeated scrapers, just the other day, saying, "I'll Be Back."

    • by cstacy ( 534252 )

      Next, the AI companies will also use AI (not an LLM) to generate scraping patterns that defeat the Cloudflare fingerprinting.

      I am pretty sure I overheard one of the defeated scrapers, just the other day, saying, "I'll Be Back."

      Then will come Scraping Services.
      "Come with me if you want to scrape."

  • IMHO this is excellent news for the web!

    As a rule, the ability to limit unknown browsers from accessing your content, especially if they treat robots.txt as if it was toilet paper, is a welcome addition for small operators who don't have the manpower to do their own filtering.

    I think Cloudflare is broadly on the right track here, as long as they allow the 0.1% of libertarian webmasters to disable the filtering if they want to. There are legitimate, if pretty limited, reasons to allow scraping spiders to

The shortest distance between two points is under construction. -- Noelie Alito

Working...