Cloudflare Rolls Out Feature For Blocking AI Companies' Web Scrapers (siliconangle.com) 40
Cloudflare today unveiled a new feature part of its content delivery network (CDN) that prevents AI developers from scraping content on the web. According to Cloudflare, the feature is available for both the free and paid tiers of its service. SiliconANGLE reports: The feature uses AI to detect automated content extraction attempts. According to Cloudflare, its software can spot bots that scrape content for LLM training projects even when they attempt to avoid detection. "Sadly, we've observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent," Cloudflare engineers wrote in a blog post today. "We've monitored this activity over time, and we're proud to say that our global machine learning model has always recognized this activity as a bot."
One of the crawlers that Cloudflare managed to detect is a bot that collects content for Perplexity AI Inc., a well-funded search engine startup. Last month, Wired reported that the manner in which the bot scrapes websites makes its requests appear as regular user traffic. As a result, website operators have struggled to block Perplexity AI from using their content. Cloudflare assigns every website visit that its platform processes a score of 1 to 99. The lower the number, the greater the likelihood that the request was generated by a bot. According to the company, requests made by the bot that collects content for Perplexity AI consistently receive a score under 30.
"When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint," Cloudflare's engineers detailed. "For every fingerprint we see, we use Cloudflare's network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint." Cloudflare will update the feature over time to address changes in AI scraping bots' technical fingerprints and the emergence of new crawlers. As part of the initiative, the company is rolling out a tool that will enable website operators to report any new bots they may encounter.
One of the crawlers that Cloudflare managed to detect is a bot that collects content for Perplexity AI Inc., a well-funded search engine startup. Last month, Wired reported that the manner in which the bot scrapes websites makes its requests appear as regular user traffic. As a result, website operators have struggled to block Perplexity AI from using their content. Cloudflare assigns every website visit that its platform processes a score of 1 to 99. The lower the number, the greater the likelihood that the request was generated by a bot. According to the company, requests made by the bot that collects content for Perplexity AI consistently receive a score under 30.
"When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint," Cloudflare's engineers detailed. "For every fingerprint we see, we use Cloudflare's network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint." Cloudflare will update the feature over time to address changes in AI scraping bots' technical fingerprints and the emergence of new crawlers. As part of the initiative, the company is rolling out a tool that will enable website operators to report any new bots they may encounter.
They are fighting fire with fire (Score:3)
Re: (Score:2)
Re: They are fighting fire with fire (Score:2)
The Cloudflare feature has two major uses. It protects content from being used as training data and it shields servers from the inelegant scraping that hoovers up bandwidth. For sysadmins, the latter is particularly important because the AI companies do not respect robots.txt and their methods of scraping appear to be extremely inefficient.
Re: (Score:1)
They are fighting fire with fire, AI against AI. I presume many companies/websites would like to opt out of LLM answers based off their content, as it removes the need to visit the site. So they would like to differentiate by usage - plain search is OK, generative search not OK.
And when Google abused search for decades, to dominate the entire online search space? Well, that doesn’t count.
You’re not a “bad actor” if you pay enough politicians, enough money. It’s a Don’t Be Evil trademark.
Re: (Score:2)
Re:They are fighting fire with fire (Score:5, Interesting)
As a general rule, if its behaving itself, doesnt flood peoples bandwidth and honestly reports its identity in the user agent field and follows robots.txt , then its a good citizen and can go about its business. Google, Bing and so on all this , they are fine. The AI scrapers tend to absolutely beset peoples bandwidth, ignore the robots.txt and disguise themselves as chrome or firefox. THOSE bots can f*** right off.
Re:They are fighting fire with fire (Score:4, Insightful)
Expecting the troll lawsuits to start (Score:2)
- Blind person uses a browser to read a government web site hosted in Cloudflare
- Blind person's browser uses scraper like processing to download, consolidate, and display the government web site in a braile or text to speech capacity
- Cloudflare blocks the web site access
- Blind person sues the government under ADA and Cloudflare is also named
Uses AI to block AI (Score:1)
Let's hear it for recursive hypocrisy!
Re: Uses AI to block AI (Score:2)
AI in and of itself is not a bad thing. It is the implementation that matters.
Anywhere to report false positives? (Score:3)
I can never get Cloudflare to understand that I'm human when I use ScriptSafe. I have to disable the add-on to get past the checkbox.
Re:Anywhere to report false positives? (Score:4, Funny)
You should try this plugin "I am definitely, most absolutely, really, really, really, not a robot." It uses a form of AI that actually has no idea it's not human.
Re: (Score:2)
Not sure if that's a Funny or Informative!
Re: (Score:2)
They are at fault, when Cloudflare depends on your browser to allow its scripts to exploit fingerprinting vulnerabilities?
Re: (Score:2)
Are you scraping web sites? If not, you probably don't have to worry too much about this.
Re: (Score:2)
You're running a script that interferes with the detection, that is not a false positive.
Cloudflare already locks out alternative browsers (Score:5, Interesting)
Re:Cloudflare already locks out alternative browse (Score:5, Interesting)
It happens all the time on Windows too, if you use a VPN or have too many privacy enhancements in your browser.
Re: (Score:1)
Users often read into things which isn't there. I've experienced a Cloudflare verification loop precisely once, and it was with a Chrome browser. I get the "are you human" thing all the time using Chrome on Android. I get it about as regularly using Firefox as I get it on Chrome or Edge in windows. I get it all the time on ToR / behind VPNs. But I don't see it any more often on my Seamonkey running on Ubuntu NAS than I do on my Windows 11 desktop.
No one is being locked out, but humans are great at identifyi
Re: Cloudflare already locks out alternative brows (Score:2)
I've seen the bot accusing loop frequently for ipv6 tunnels which is unfortunate because a lot of iSPS still don't have a true v6 stack so it's that or ipv4 only.
Re: (Score:2)
And that's the thing. They aren't targeting a specific browser. They are targeting non standard behaviour. What's the best the OP is running some plugins that mess with scripts, obfuscate their behaviour, or end up making their network connection look weird (like VPNs).
The conspiracy here is just silly.
Re: (Score:2)
Well, it isn't a non-existent trend. It's just that people are making their traffic look like abusive traffic.
Cloudflare looks at traffic trends. If they notice a lot of "bad" traffic originating from a certain IP, they put up enhanced protections against it. It just so happens a lot of bad actors hide their activities through VPNs and TOR and Cloudflare notices that.
I mean, if you're seeing lots of DDoS attacks from an IP,
Re: (Score:2)
... you shouldn't discriminate against them by using the internet equivalent of racial profiling.
As a Linux, Firefox, VPN, and Cookie AutoDelete user that is constantly asked to verify if I'm human and I detest how cloudflare operates. However, operating systems and browsers are not protected classes and are optional because you can choose to use a different OS or browser which is why I also recognize it as being a business decision that is entirely impersonal and has nothing to do with any ideology (besides avarice). To call it "the internet equivalent of racial profiling" belittles just how hurtful
I host a massive repo of information (Score:2)
it's being constantly crawled by amazon, facebook and of course various 'ai' sites.
Let them eat it.
Cloudflare is already broken. (Score:4, Informative)
Re: (Score:2)
Re: (Score:2)
Can someone please roll back this guy off the latest ChatGPT beta? He was more fun when he was a coherent idiot.
Re:Cloudflare is already broken. (Score:4, Interesting)
Which brings up the question, what score did you or I, real humans, receive?
Next Level (Score:2)
Next, the AI companies will also use AI (not an LLM) to generate scraping patterns that defeat the Cloudflare fingerprinting.
I am pretty sure I overheard one of the defeated scrapers, just the other day, saying, "I'll Be Back."
Re: (Score:2)
Next, the AI companies will also use AI (not an LLM) to generate scraping patterns that defeat the Cloudflare fingerprinting.
I am pretty sure I overheard one of the defeated scrapers, just the other day, saying, "I'll Be Back."
Then will come Scraping Services.
"Come with me if you want to scrape."
Re: (Score:2)
that's great news (Score:2)
As a rule, the ability to limit unknown browsers from accessing your content, especially if they treat robots.txt as if it was toilet paper, is a welcome addition for small operators who don't have the manpower to do their own filtering.
I think Cloudflare is broadly on the right track here, as long as they allow the 0.1% of libertarian webmasters to disable the filtering if they want to. There are legitimate, if pretty limited, reasons to allow scraping spiders to