Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
AI The Internet

While Meta Crawls the Web for AI Training Data, Bruce Ediger Pranks Them with Endless Bad Data (bruceediger.com) 43

From the personal blog of interface expert Bruce Ediger: Early in March 2025, I noticed that a web crawler with a user agent string of

meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)

was hitting my blog's machine at an unreasonable rate.

I followed the URL and discovered this is what Meta uses to gather premium, human-generated content to train its LLMs. I found the rate of requests to be annoying.

I already have a PHP program that creates the illusion of an infinite website. I decided to answer any HTTP request that had "meta-externalagent" in its user agent string with the contents of a bork.php generated file...

This worked brilliantly. Meta ramped up to requesting 270,000 URLs on May 30 and 31, 2025...

After about 3 months, I got scared that Meta's insatiable consumption of Super Great Pages about condiments, underwear and circa 2010 C-List celebs would start costing me money. So I switched to giving "meta-externalagent" a 404 status code. I decided to see how long it would take one of the highest valued companies in the world to decide to go away.

The answer is 5 months.

While Meta Crawls the Web for AI Training Data, Bruce Ediger Pranks Them with Endless Bad Data

Comments Filter:
  • media (Score:3, Interesting)

    by wokka1 ( 913473 ) on Saturday November 15, 2025 @04:29PM (#65797907)

    Many many years ago, back when napster was all of the rage, we had record labels searching the web with bots looking for hosted media. There was a movement with generated lists of mp3 and fake files on our websites. I thought I had an archive of mine, but it seems to have been removed by me, too long ago.

    This story reminds me of that time, good on him.

    • Re:media (Score:5, Interesting)

      by Kisai ( 213879 ) on Saturday November 15, 2025 @07:10PM (#65798059)

      My websites, including two dormant forums were being consumed at about 1TB/mo (half the bandwidth limit) by a combination of facebook, google, alibaba and openai at one point before I blocked the entire IP address range.

      Like these fucking companies either need to pay up or need to throttle down the ripping rate so that people who actually want to visit archived content can actually visit it and not get time-out errors. As it is I dialed down the capacity of those sites so that they don't bring the server down.

      • It took some time to set up, but I found "anubis" to work really well: load and crazy traffic to forum gone, main site can still be found.
  • by PPH ( 736903 ) on Saturday November 15, 2025 @04:50PM (#65797921)

    Isn't Reddit enough?

  • but it didnt it just changed to meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/burgular
  • by Nonesuch ( 90847 ) on Saturday November 15, 2025 @05:52PM (#65797953) Homepage Journal

    Back in the day, we used to run "tarpit" SMTP servers which looked like an open mail relay but ACK'd incoming packets only just barely fast enough to keep the remote client from timing out and giving up. The theory was that tying up spammer resources was a net good for the internet, as a sender busy trying to stuff messages through a tarpit was tied up waiting on your acknowledgement, reducing their impact on others.

    Similarly, perhaps the right answer here is to limit the number of concurrent connections from any one network range, and use tarpit tactics to rate-limit the speed at which your server generate contents to feed the bot -- just keep ramping down until they drop off, then remember the last good rate to use for subsequent requests.

    It would perhaps be interesting to randomly generate content and hyperlinks to ever deeper random URLs -- are these new crawlers more interested in some URLs or extensions than others? If you pull fresh keywords from the full URL the crawler requested, will it delve ever deeper into that "topic"? If their Accept-Encoding header supports gzip or deflate, what happens when you feed them a zip-bomb [idiallo.com]?

  • by Anonymous Coward

    We had a year ago first news about software that can generate garbage to feed AI bots. Nobody could yet show that any AI got worse because of that. The next step after crawling is quality filtering. Many of the artists claiming that there images were used in stable diffusion had to learn that the aesthetic rating network that were used rejects their images anyway. You can discuss if justified or not, but at least it is very unlikely that their images were used for training. LLM training pre-filters for qual

  • Redirect (Score:4, Interesting)

    by kbrannen ( 581293 ) on Saturday November 15, 2025 @06:50PM (#65798029)

    Since he can clearly identify them, just redirect the request to Facebook dot com and let Meta eat itself.

    • Oooh. I like this idea.

    • Would never work. No crawler runs in mindless, unrestricted fashion. They use a set of guardrails regarding what they will or will not follow. And even the simplest of crawlers are smart enough to not re-enter a destination more than once.

    • Or, feed it answers to random questions coming from Meta AI.
  • by rossdee ( 243626 ) on Saturday November 15, 2025 @07:23PM (#65798073)

    I thought that "Bad Data" was called Lore

    (also played by Brent Spiner)

  • both directions. You'll be better for it.

  • by dskoll ( 99328 ) on Saturday November 15, 2025 @09:07PM (#65798177) Homepage

    I looked at his bork.php script. It's a bit simple-minded with pretty limited vocabulary, so I don't think it'll have all that much effect on the AI training.

    I just give Facebook's bots (and many other user-agents) a "403 Forbidden" response and go about my day.

I'd rather just believe that it's done by little elves running around.

Working...