An AI Scraping Tool Is Overwhelming Websites With Traffic 60
An anonymous reader quotes a report from Motherboard: The creator of a tool that scrapes the internet for images in order to power artificial intelligence image generators like Stable Diffusion is telling website owners who want him to stop that they have to actively opt out, and that it's "sad" that they are fighting the inevitable rise of AI. "It is sad that several of you are not understanding the potential of AI and open AI and as a consequence have decided to fight it," Romain Beaumont, the creator of the image scraping tool img2dataset, said on its GitHub page. "You will have many opportunities in the years to come to benefit from AI. I hope you see that sooner rather than later. As creators you have even more opportunities to benefit from it."
Img2dataset is a free tool Beaumont shared on GitHub which allows users to automatically download, and resize a list of URLs. The result is an image dataset, the kind that trains image-generating AI models like Open AI's DALL-E, the open source Stable Diffusion model, and Google's Imagen. Beaumont is also an open source contributor to LAION-5B, one of the largest image datasets in the world that contains more than 5 billion images and is used by Imagen and Stable Diffusion. Img2dataset will attempt to scrape images from any site unless site owners add https headers like "X-Robots-Tag: noai," and "X-Robots-Tag: noindex." That means that the onus is on site owners, many of whom probably don't even know img2dataset exists, to opt out of img2dataset rather than opt in. Beaumont defended img2dataset by comparing it to the way Google indexes all websites online in order to power its search engine, which benefits anyone who wants to search the internet.
"I directly benefit from search engines as they drive useful traffic to me," Eden told Motherboard. "But, more importantly, Google's bot is respectful and doesn't hammer my site. And most bots respect the robots.txt directive. Romain's tool doesn't. It seems to be deliberately set up to ignore the directives website owners have in place. And, frankly, it doesn't bring any direct benefit to me."
Motherboard notes: "A 'robots.txt' file tells search engine crawlers like Google which part of a site the crawler can access in order to prevent it from overloading the site with requests."
Img2dataset is a free tool Beaumont shared on GitHub which allows users to automatically download, and resize a list of URLs. The result is an image dataset, the kind that trains image-generating AI models like Open AI's DALL-E, the open source Stable Diffusion model, and Google's Imagen. Beaumont is also an open source contributor to LAION-5B, one of the largest image datasets in the world that contains more than 5 billion images and is used by Imagen and Stable Diffusion. Img2dataset will attempt to scrape images from any site unless site owners add https headers like "X-Robots-Tag: noai," and "X-Robots-Tag: noindex." That means that the onus is on site owners, many of whom probably don't even know img2dataset exists, to opt out of img2dataset rather than opt in. Beaumont defended img2dataset by comparing it to the way Google indexes all websites online in order to power its search engine, which benefits anyone who wants to search the internet.
"I directly benefit from search engines as they drive useful traffic to me," Eden told Motherboard. "But, more importantly, Google's bot is respectful and doesn't hammer my site. And most bots respect the robots.txt directive. Romain's tool doesn't. It seems to be deliberately set up to ignore the directives website owners have in place. And, frankly, it doesn't bring any direct benefit to me."
Motherboard notes: "A 'robots.txt' file tells search engine crawlers like Google which part of a site the crawler can access in order to prevent it from overloading the site with requests."
From the GitHub Page: (Score:5, Informative)
"Websites can pass the http headers X-Robots-Tag: noai, X-Robots-Tag: noindex , X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex By default img2dataset will ignore images with such headers."
Followed directly by:
"To disable this behavior and download all images, you may pass --disallowed_header_directives '[]'"
I wonder what option most users will end up enabling. :)
Also this tool doesn't seem to check robots.txt [from a quick source search, may be wrong.] Getting the impression they don't entirely care about this.
Re:From the GitHub Page: (Score:5, Informative)
> Also this tool doesn't seem to check robots.txt
Right, then he whinged on about it, closed the issue and in a linked issue offers to let other people do the coding for robots.txt
What an epic douchelord. Entitled, expects the world to change to suit his behavior and ignores existing solutions while berating folks for not sharing his worldview.
Re:From the GitHub Page: (Score:5, Insightful)
Add honeypot URLs to robots.txt, and reference them in places where crawlers are likely to find them, but end users won't ever find them without inspecting the source code. For example, dead code segments of html or js. Any IP that hits one of those gets blackhole'd for 24 hours. That effectively forces the crawlers to honor robots.txt.
Re: (Score:2)
Any IP that hits one of those gets blackhole'd for 24 hours.
Honestly as it is it is trivial to identify software scraping content vs humans doing so. Simply rate limiting unreasonable access already does a lot to prevent this behaviour.
Re: (Score:2)
Oh no that's easily fooled, particularly with tools like Selenium. And rate limiting is just going to piss off CGNAT users.
Re: (Score:2)
Better than a blackhole, replace all images with goatse.jpeg. If the AI doesn't try to tear out it's metaphorical eyeballs, the unlucky user probably will.
Re: (Score:2)
I don't think that would really do much if they're all either the same or basically the same. One other thought I had is to run the JS through an obfuscator (assuming you're not already for regular traffic) and generate all requested images using other AI software. That would really fuck with it, especially if they're generated to match the context.
Re:From the GitHub Page: (Score:5, Informative)
The developer thinks its [i]unethical[/i] to do this lol. Because its "advancing science" (Despite the fact that consent has *always* been a pivotal part of scientific ethics, even if it hasnt always been followed). He does seem like he's at least open to the robots.txt thing.
But yeah, the guys a scumbag.
Re:From the GitHub Page: (Score:5, Insightful)
If Romain Beaumont was a decent and respectful person, he would have started by honoring robots.txt. He clearly knew of its existence and he clearly knew that many websites use it and rely on it. His refusal to respect it is not some forward-looking visionary tech, it is straight-up selfishness.
Romain Beaumont, show us you are not just another troll by apologizing and modifying your scraper to honor robots.txt.
And, to everyone else, public shaming will not stop this sort of behavior. Even if Romain Beaumont turns out to be a decent person, others will not. You will need to design your websites more defensively if you cannot afford this level of profitless web traffic. It is really sad, but rude people with tech skills are the reason why we can't have nice things.
Re:From the GitHub Page: (Score:5, Informative)
In a past life, I've written this sort of logic at the load balancer level on F5's. It's not trivial, but it's effective as fuck. I eventually self-named the project TheBanHammer, because it did exactly that.
Scape my site? You get the BanHammer.
Think you're clever with your AWS/Azure magic? Well, I hope you have a lot of money to spend on IP addresses, because if you abused it enough it starts to ban at the allocation block level.
Oh but you're a legit user with a login cookie, but maybe still in the ban block? Oh okay, you may still pass. Thank you for your patronage.
TLDR; scrapers are nothing new. If you're still reeling from this sort of abuse in 2023, I wish you all the best good fucking luck in 2024.
Re:From the GitHub Page: (Score:5, Informative)
Oh its worse than that. People *intentionally* write bots that not only disrespect robots, but will actively fuck your site over for spam. I still harbor *deep* resentment over these fuckers after their spambots destroyed a wiki devoted to a python library I was helping maintain late 1990s. They just flat out deleted the content and replaced it with dick pills. This was much earlier, and naiver, internet but even though we kept a history of pages, our homebrewed wiki only kept them for about 5 revisions before binning them because we just didnt have the storage (remember, 1990s, hard drives with storage measured in the megabytes not gigabytes)
They destroyed 2 years of work just for short term profit and they didnt give a fuck. Worse, attempts at locking things down with hurdles , they seemingly would be manually working around it.
In the end the wiki, and the library, died. Fuck spammers.
Re: (Score:3)
It's even worse than that. Many of the big search engines ignore robots.txt because scammers were abusing it to hide their phishing pages and malware. SEOs were using it to serve different content to search engines and browsers, to avoid being down-ranked for spam.
Robots.txt was never going to work. Like DRM, it might be an unsolvable problem where the only way it can kinda work is by making the experience far worse for legitimate users.
Re: (Score:2)
Like DRM, it might be an unsolvable problem where the only way it can kinda work is by making the experience far worse for legitimate users.
No. DRM is an attempt to prevent legitimate users from accessing / using what they paid for. It's flawed design doesn't work because the "protected" content was already given to the "enemy" in a form that they can use at all. (Even if it's limited, the fact that it can be accessed at all means the device can remove the protections entirely if given the correct instructions.) Legitimate users are the only ones inconvenienced permanently by DRM, the "enemy" is only mildly inconvenienced until the smart cow a
Re:From the GitHub Page: (Score:5, Interesting)
Yeah, but at that point of time, kids was basically what we all where.
Hell, the first ISP I was a customer was, an extension of the BBS I had been with, when I phoned in and asked for routing change for my IP address, they told me to ask in the IRC channel for the ISP terminal servers root login.
Kids now have no idea just how cowboy the early to mid 1990s internet really was, And if they did, they'd never listen to us give our "wisdom" again lol
Re: (Score:2)
Things were even worse in the corporate world. My dad told me some stories about his years at GTE in the late 70's. Big business mainframes were pretty much just hooked up to phone lines with no security whatsoever. They pretty much relied on not giving out the phone number for the server, otherwise random people could (and did) log in and would be confronted by the admins in realtime.
Re: (Score:2)
Yeah I remember my grandfather logging into the mainframes at the lab he worked at, oh late 1970s to early 1980s sometimes with an "Acoustic coupler" tied into a portable printer/typewriter combo thing. He'd grumble about the new Vax requiring a password , cos he didnt need one for the IBM. Very different times.
I hope his AI can read usage rights declarations (Score:5, Insightful)
If this person is going to be so irresponsible with his program -- which appears to be the case based on his github comments -- then he better be ready for rights users to fight back with very real legal weapons. People will put a rights declaration img/txt/whatever file on their server. Then they will have a basis for suing him when he downloads that file but does not obey the legal rights of the images on the website.
Arguments about not knowing what the content was won't hold water since he's explicitly downloading everything for use with AI which, by definition, can interpret such things.
Re: I hope his AI can read usage rights declaratio (Score:2, Interesting)
Re: (Score:2)
Why would you use a hosting provider that doesn't offer unlimited traffic?
Re: (Score:1)
Good way to blame the victim.
Re: (Score:2)
Not blaming anything, just can't understand why would anyone not use hosting companies with unlimited traffic?
Re: (Score:1)
Maybe they host at home. Maybe unlimited traffic wasn't free. Maybe they don't know any better. Maybe they're already setup somewhere else from long ago and can't afford the technical help to transition to a newer host that provides that.
IMO, it doesn't matter why. The guy scraping sites is a jerk and DoSing them.
If I left my front door wide open that is not an invitation to everyone to raid my fridge and shit on my floor. You can look in from outside if you want, but no more.
Re: (Score:1)
Oh also, even if traffic is free, server hardware isn't. If I pound your small site with a dozen threads I'll kill your database or disk or cpu or run you out of ram or something. Whatever comes first.
Re: (Score:2)
People will put a rights declaration img/txt/whatever file on their server. Then they will have a basis for suing him when he downloads that file but does not obey the legal rights of the images on the website
You can declare whatever you want. But once you make something available publicly on the internet, someone else simply downloading it is not legally actionable in any jurisdiction I have ever heard of. The only thing that legally governs usage subsequent to downloading are the limits imposed by copyright/trademark... and it has yet to be decided in the courts whether training a neural network on someone else's copyrighted material is itself a violation of their copyright (IMHO, if it is, then every huma
Re: (Score:1)
Doing a DoS attack on their site is illegal in any civilized location.
This isn't about rights and privacy. He's DoSing innocent sites.
Re: (Score:2)
As I read TFA and posts following it:
He's not just running it himself. He built it as an open-source project. And his documentation says it obeys his new robots.txt terms (unless you tell it not to) - but it doesn't
Re: (Score:2)
I strongly recommend that some members of the AI community (who are the motivated parties) head off this disaster by writing and submitting the invited robots.txt obeying mods.
Also that they fork the project if he doesn't incorporate it immediately and accurately. Perhaps just fork it anyway, since he doesn't seem to realize the importance of the functionality and likely to keep it working.
Sounds like a DDOS tool to me (Score:5, Interesting)
Sounds like this guy is treading on extremely thin eggshells, thinking that he’s immune because of his “AI-bro” identity and the magical power of Internet 9.0
Real world incoming in 3.2
Dear Romain Beaumont: (Score:2)
what's sad is that you are a dirtbag. Suck eggs.
Re: (Score:2)
Feed them .. (Score:2)
Re: (Score:2)
Came to the comments to suggest just this. Massive website of nothing but AI generated versions of goatse so there's some variation.
Is that the 1990s calling? (Score:3, Funny)
Asking for its slashdotting back?
Sounds familiar (Score:5, Interesting)
that it's "sad" that they are fighting the inevitable rise of AI.
Just replace AI with crypto and you'll find it's the same people saying this shit.
Re: (Score:2)
I'm betting this guy was a crypto bro until that market went bust.
Re: (Score:2)
Crypto? This stuff is really old.
Try replacing 'AI' with 'communism', 'fascism' or even, for some values, 'islam'. Sound familiar?
Malicious code (Score:3)
IMO, the code is actively malicious as it fakes a User-Agent: header that makes it seem as if it's Mozilla. Yes, you can add a "(compatible; img2dataset)" tag but that's optional.
If I ever detected this robot on my site, I'd fail2ban the offending client IP.
Don't simply block obnoxious AI crawlers (Score:5, Informative)
Return garbage to poison their models. If they won't follow the rules they deserve what they get.
Years ago we had issues with a certain IPR narc spider. Got to the point where they were indiscriminately crawling through folders containing terabytes of data. A single crawler was single handedly consuming a quarter of global traffic even though these folders were explicitly excluded from crawling for all agents in robots.txt. Their entire netblock ended up getting routed thru /dev/null.
Re: (Score:2)
Gets expensive these days. To properly deal with them, you need to send valid images that aren't identical. That means you need to have an image host/generator that doesn't charge for data egress.
If it were just text, it wouldn't be nearly so much of a problem.
send some goons over (Score:2)
This is the same guy... (Score:5, Interesting)
That claims ownership of the LAION-5B dataset used to train Stable Diffusion.
Something tells me ignoring copyright is an essential part of his business model.
Interesting (Score:2)
One of his quotes:
This is now becoming my go-to for anyone trying to defend their "AI art" and prompts from "thieves".
There are other solutions to stop these bots (Score:4, Informative)
I use a system on my servers called Web-Shield [github.com] that has proven to GREAT at stopping unauthorized bots from scraping my web sites as well as hacks and system probes. It's a simple shell-script based system that uses ipset/iptables to block specific areas of the web where there aren't legit users - just super cheap web hosting platforms that are the source of most of this activity. It's free. You should check it out. There's also a companion system called "login-shield" that does the same for login ports. Hooray for open source solutions to stuff like this.
another sociopath on the loose (Score:2)
Can I use it to scrape pron? (Score:3)
Blacklist the IPs (Score:2)
If they don't care about our existing privacy directives and settings that the web has agreed upon for decades, they shouldn't be surprised when their bot gets blocked by several large ISPs or AS's for ignoring them. This kid sounds like an entitled douche. Probably an ex-cryptobro who switched to the AI fad when he lost his dogecoins in a rug
Re: (Score:2)
Or use any fair load balancer which prioritises initial requests and small files over large ones and throttles IPs that open multiple streams. And use proper caching and hosts with unlimited traffic.
They are on a hiding to nothing but turnarounds (Score:2)
Malarky (Score:2)
Wow, what a load of horse shit.
If I don't want my doorbell being rung 24/7 so people can see my flight simulator rig, you wouldn't say I'm "against getting more people into aviation".
You fuckwit.
Blacklisting Works (Score:1)
This is just downright disrespectful. I admin a site with a model car forum. We get this problem for a long time. Every month or so, we identify IP addresses taking way too much bandwidth and block them if they don't end up being a search engine. There have been multiple entities, and they keep scraping the same stuff. I'm not upgrading my server to support their projects.
Malicious compliance opportunity (Score:2)
Just make bogus links that real web-browsers won't follow because they get erased via script, and if their scraper follows those links, return pornographic imagery. Mess with their AI. They will learn to follow robots.txt like all other scrapers did when it was made almost 30 years ago...
This is in fact the standard web approach -- aim malicious scrapers at porn.