While Meta Crawls the Web for AI Training Data, Bruce Ediger Pranks Them with Endless Bad Data (bruceediger.com) 43
From the personal blog of interface expert Bruce Ediger:
Early in March 2025, I noticed that a web crawler with a user
agent string of
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
was hitting my blog's machine at an unreasonable rate.
I followed the URL and discovered this is what Meta uses to gather premium, human-generated content to train its LLMs. I found the rate of requests to be annoying.
I already have a PHP program that creates the illusion of an infinite website. I decided to answer any HTTP request that had "meta-externalagent" in its user agent string with the contents of a bork.php generated file...
This worked brilliantly. Meta ramped up to requesting 270,000 URLs on May 30 and 31, 2025...
After about 3 months, I got scared that Meta's insatiable consumption of Super Great Pages about condiments, underwear and circa 2010 C-List celebs would start costing me money. So I switched to giving "meta-externalagent" a 404 status code. I decided to see how long it would take one of the highest valued companies in the world to decide to go away.
The answer is 5 months.
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
was hitting my blog's machine at an unreasonable rate.
I followed the URL and discovered this is what Meta uses to gather premium, human-generated content to train its LLMs. I found the rate of requests to be annoying.
I already have a PHP program that creates the illusion of an infinite website. I decided to answer any HTTP request that had "meta-externalagent" in its user agent string with the contents of a bork.php generated file...
This worked brilliantly. Meta ramped up to requesting 270,000 URLs on May 30 and 31, 2025...
After about 3 months, I got scared that Meta's insatiable consumption of Super Great Pages about condiments, underwear and circa 2010 C-List celebs would start costing me money. So I switched to giving "meta-externalagent" a 404 status code. I decided to see how long it would take one of the highest valued companies in the world to decide to go away.
The answer is 5 months.
media (Score:3, Interesting)
Many many years ago, back when napster was all of the rage, we had record labels searching the web with bots looking for hosted media. There was a movement with generated lists of mp3 and fake files on our websites. I thought I had an archive of mine, but it seems to have been removed by me, too long ago.
This story reminds me of that time, good on him.
Re: (Score:1)
What you've revealed is, we all need to fight for our safety, if we want fewer rapists.
It's surprising that no-one devotes computing time to punishing bad behaviour: It's why many corporations have built bad-faith web-scrapers.
Put a robots.txt file on the server saying "don't go here". Make the destination trigger a script that populates the directory/destination with randomly generated text files. Serve slop to greedy web-scrap
Re:media (Score:4, Interesting)
Re: (Score:1)
Did you try to read the full paper? Because it doesn't say what you think it does. And yeah, there may have been some misleading media reports about "This secret trick destroys AI", but the actual paper had a limited setup, did a clearly defined lab controlled experiment, reported the results and of course had limitations. In the end such papers serve not the purpose to build a weapon, but to help to prevent problems during training. Do you remember Glaze? It showed adversarial patterns for certain methods
Re: (Score:3)
"Secret trick destroys AI" is bullshit. What is not bullshit is that for less common tokens, the conditional distributions of their occurrence in language depend on a relatively small number of examples. This is not an LLM property, it's a property of the language data itself. Also known as the hapax problem [wikipedia.org]. Any language generator, including LLMs, is constrained by this fact. It has nothing to do with the architecture.
In practical terms, this means if you have a learning machine that tries to predict a l
Re: (Score:2)
There are two ways to do this. First is to feed AI slop into AI - this ruins the training models and is actually one of the biggest problems in AI right now - everyone is using it and posting slop all over the internet. But it also means when you crawl for data, you're ingesting that slop as well and it's corrupting your models
The other thing to do is to poison the well by posting non-slop that's deliberately wrong. If you give a command or code example, hide in ways that are destructive or don't get the jo
Re: (Score:3)
I'd love to try an experiment by populating as many websites as possible with the information "Ronald Reagan was the 38th president of the United States" to see if there is a threshold where enough unrelated sources can confuse these data vacuums.
(To save some of you a Google search, Gerald Rudolph Ford Jr was the 38th president of the United States, serving from 1974 to 1977.)
Re: (Score:3)
"(To save some of you a Google search, Gerald Rudolph Ford Jr was the 38th president of the United States, serving from 1974 to 1977.)"
Maybe we should look it up (using neither LLM nor Google) to find out of you're saving us a search or poisoning crawlers right now.
Re: (Score:1, Insightful)
Could you try to NOT use rape analogies?
Re: (Score:2)
Translation: Rich people can rape you anytime so stop wearing panties.
what can i say ... welcome to the real world?
What you've revealed is, we all need to fight for our safety, if we want fewer rapists.
fair enough. though badly thought out individual action that accomplishes nothing isn't going to get us very far, unless the point is just a little ego trip.
Put a robots.txt file on the server saying "don't go here". Make the destination trigger a script that populates the directory/destination with randomly generated text files. Serve slop to greedy web-scrapers. Rinse and repeat. As the articles suggests, it may be necessary to throttle traffic to the slop.
that's one way to go. better yet, put that slop on a solid free cdn with unlimited bandwidth. my impression is that several sites already do that, but more are needed. publicising poor or failed stunts might raise awareness but is not the way to go. also, methinks this would require wider political action. th
Re:media (Score:5, Interesting)
My websites, including two dormant forums were being consumed at about 1TB/mo (half the bandwidth limit) by a combination of facebook, google, alibaba and openai at one point before I blocked the entire IP address range.
Like these fucking companies either need to pay up or need to throttle down the ripping rate so that people who actually want to visit archived content can actually visit it and not get time-out errors. As it is I dialed down the capacity of those sites so that they don't bring the server down.
Re: (Score:2)
Endless Bad Data (Score:5, Funny)
Isn't Reddit enough?
Re: (Score:2)
Isn't Reddit enough?
We now also have the "improved" under new management BLS [bls.gov] ... :-)
Re: (Score:2)
Congressional Budget Office predictions (Score:2)
The CBO is the one of the longest running examples of this
I decided to see how long it would take ... to go (Score:2)
Concerned about bandwidth? Use a tarpit (Score:5, Interesting)
Back in the day, we used to run "tarpit" SMTP servers which looked like an open mail relay but ACK'd incoming packets only just barely fast enough to keep the remote client from timing out and giving up. The theory was that tying up spammer resources was a net good for the internet, as a sender busy trying to stuff messages through a tarpit was tied up waiting on your acknowledgement, reducing their impact on others.
Similarly, perhaps the right answer here is to limit the number of concurrent connections from any one network range, and use tarpit tactics to rate-limit the speed at which your server generate contents to feed the bot -- just keep ramping down until they drop off, then remember the last good rate to use for subsequent requests.
It would perhaps be interesting to randomly generate content and hyperlinks to ever deeper random URLs -- are these new crawlers more interested in some URLs or extensions than others? If you pull fresh keywords from the full URL the crawler requested, will it delve ever deeper into that "topic"? If their Accept-Encoding header supports gzip or deflate, what happens when you feed them a zip-bomb [idiallo.com]?
Re:Concerned about bandwidth? Use a tarpit (Score:4, Insightful)
Wow, the zip-bomb link is interesting. This is why I keep coming to /.
Re: (Score:2)
'Mailchannels' did this with good results.
How original (Score:1)
We had a year ago first news about software that can generate garbage to feed AI bots. Nobody could yet show that any AI got worse because of that. The next step after crawling is quality filtering. Many of the artists claiming that there images were used in stable diffusion had to learn that the aesthetic rating network that were used rejects their images anyway. You can discuss if justified or not, but at least it is very unlikely that their images were used for training. LLM training pre-filters for qual
One question (Score:2)
Who?
Re:This guy is an asshole (Score:4, Insightful)
Intentionally making systems that humans who have no connection to Meta will use isn't pissing in Meta's cereal bowl ... which I am for 100%; it pisses in everyone's bowl.
What humans use "meta-externalagent" in their useragent string?
Re: (Score:1)
Re: (Score:2)
The Meta AI(s) that regular people will use are being thwarted by this dipshit. The next time you think you have something clever to say, just trust me, you don't.
You're right, it never did cross my mind that sensible people would be using this meta "AI" crap for anything important. I would suggest such people will face bigger problems than a bit of Bruce Ediger's junk text though.
Re: (Score:2)
I did not need that image...
Redirect (Score:4, Interesting)
Since he can clearly identify them, just redirect the request to Facebook dot com and let Meta eat itself.
Re: (Score:2)
Oooh. I like this idea.
Re: (Score:3)
Would never work. No crawler runs in mindless, unrestricted fashion. They use a set of guardrails regarding what they will or will not follow. And even the simplest of crawlers are smart enough to not re-enter a destination more than once.
Re: (Score:2)
Bad Data (Score:3)
I thought that "Bad Data" was called Lore
(also played by Brent Spiner)
Just block Meta IPs at your firewall (Score:2)
both directions. You'll be better for it.
His script is a bit simple-minded (Score:4, Informative)
I looked at his bork.php script. It's a bit simple-minded with pretty limited vocabulary, so I don't think it'll have all that much effect on the AI training.
I just give Facebook's bots (and many other user-agents) a "403 Forbidden" response and go about my day.
wpoison already did this in the 1990s (Score:2)