An AI Scraping Tool Is Overwhelming Websites With Traffic 60

Posted by BeauHD on Tuesday April 25, 2023 @07:02PM from the consent-and-ownership dept.

An anonymous reader quotes a report from Motherboard: The creator of a tool that scrapes the internet for images in order to power artificial intelligence image generators like Stable Diffusion is telling website owners who want him to stop that they have to actively opt out, and that it's "sad" that they are fighting the inevitable rise of AI. "It is sad that several of you are not understanding the potential of AI and open AI and as a consequence have decided to fight it," Romain Beaumont, the creator of the image scraping tool img2dataset, said on its GitHub page. "You will have many opportunities in the years to come to benefit from AI. I hope you see that sooner rather than later. As creators you have even more opportunities to benefit from it."

Img2dataset is a free tool Beaumont shared on GitHub which allows users to automatically download, and resize a list of URLs. The result is an image dataset, the kind that trains image-generating AI models like Open AI's DALL-E, the open source Stable Diffusion model, and Google's Imagen. Beaumont is also an open source contributor to LAION-5B, one of the largest image datasets in the world that contains more than 5 billion images and is used by Imagen and Stable Diffusion. Img2dataset will attempt to scrape images from any site unless site owners add https headers like "X-Robots-Tag: noai," and "X-Robots-Tag: noindex." That means that the onus is on site owners, many of whom probably don't even know img2dataset exists, to opt out of img2dataset rather than opt in. Beaumont defended img2dataset by comparing it to the way Google indexes all websites online in order to power its search engine, which benefits anyone who wants to search the internet.

"I directly benefit from search engines as they drive useful traffic to me," Eden told Motherboard. "But, more importantly, Google's bot is respectful and doesn't hammer my site. And most bots respect the robots.txt directive. Romain's tool doesn't. It seems to be deliberately set up to ignore the directives website owners have in place. And, frankly, it doesn't bring any direct benefit to me."

Motherboard notes: "A 'robots.txt' file tells search engine crawlers like Google which part of a site the crawler can access in order to prevent it from overloading the site with requests."

An AI Scraping Tool Is Overwhelming Websites With Traffic

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 60 Comments Log In/Create an Account

Comments Filter:

From the GitHub Page: (Score:5, Informative)

by micksam7 ( 1026240 ) * writes: on Tuesday April 25, 2023 @07:17PM (#63476662)

"Websites can pass the http headers X-Robots-Tag: noai, X-Robots-Tag: noindex , X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex By default img2dataset will ignore images with such headers."
Followed directly by:
"To disable this behavior and download all images, you may pass --disallowed_header_directives '[]'"
I wonder what option most users will end up enabling. :)
Also this tool doesn't seem to check robots.txt [from a quick source search, may be wrong.] Getting the impression they don't entirely care about this.

- Re:From the GitHub Page: (Score:5, Informative)
  
  by forgottenusername ( 1495209 ) writes: on Tuesday April 25, 2023 @07:44PM (#63476734)
  
  > Also this tool doesn't seem to check robots.txt
  Right, then he whinged on about it, closed the issue and in a linked issue offers to let other people do the coding for robots.txt
  What an epic douchelord. Entitled, expects the world to change to suit his behavior and ignores existing solutions while berating folks for not sharing his worldview.
  
  - Re:From the GitHub Page: (Score:5, Insightful)
    
    by ArmoredDragon ( 3450605 ) writes: on Tuesday April 25, 2023 @08:58PM (#63476844)
    
    Add honeypot URLs to robots.txt, and reference them in places where crawlers are likely to find them, but end users won't ever find them without inspecting the source code. For example, dead code segments of html or js. Any IP that hits one of those gets blackhole'd for 24 hours. That effectively forces the crawlers to honor robots.txt.
    
    - Re: (Score:2)
      
      by thegarbz ( 1787294 ) writes:
      
      Any IP that hits one of those gets blackhole'd for 24 hours.
      Honestly as it is it is trivial to identify software scraping content vs humans doing so. Simply rate limiting unreasonable access already does a lot to prevent this behaviour.
      - Re: (Score:2)
        
        by ArmoredDragon ( 3450605 ) writes:
        
        Oh no that's easily fooled, particularly with tools like Selenium. And rate limiting is just going to piss off CGNAT users.
    - Re: (Score:2)
      
      by AmiMoJo ( 196126 ) writes:
      
      Better than a blackhole, replace all images with goatse.jpeg. If the AI doesn't try to tear out it's metaphorical eyeballs, the unlucky user probably will.
      - Re: (Score:2)
        
        by ArmoredDragon ( 3450605 ) writes:
        
        I don't think that would really do much if they're all either the same or basically the same. One other thought I had is to run the JS through an obfuscator (assuming you're not already for regular traffic) and generate all requested images using other AI software. That would really fuck with it, especially if they're generated to match the context.
- Re:From the GitHub Page: (Score:5, Informative)
  
  by sg_oneill ( 159032 ) writes: on Tuesday April 25, 2023 @07:45PM (#63476738)
  
  The developer thinks its [i]unethical[/i] to do this lol. Because its "advancing science" (Despite the fact that consent has *always* been a pivotal part of scientific ethics, even if it hasnt always been followed). He does seem like he's at least open to the robots.txt thing.
  But yeah, the guys a scumbag.
  
- Re:From the GitHub Page: (Score:5, Insightful)
  
  by Brain-Fu ( 1274756 ) writes: on Tuesday April 25, 2023 @07:52PM (#63476748) Homepage Journal
  
  If Romain Beaumont was a decent and respectful person, he would have started by honoring robots.txt. He clearly knew of its existence and he clearly knew that many websites use it and rely on it. His refusal to respect it is not some forward-looking visionary tech, it is straight-up selfishness.
  Romain Beaumont, show us you are not just another troll by apologizing and modifying your scraper to honor robots.txt.
  And, to everyone else, public shaming will not stop this sort of behavior. Even if Romain Beaumont turns out to be a decent person, others will not. You will need to design your websites more defensively if you cannot afford this level of profitless web traffic. It is really sad, but rude people with tech skills are the reason why we can't have nice things.
  
  - Re:From the GitHub Page: (Score:5, Informative)
    
    by Arethan ( 223197 ) writes: on Tuesday April 25, 2023 @08:47PM (#63476824) Journal
    
    In a past life, I've written this sort of logic at the load balancer level on F5's. It's not trivial, but it's effective as fuck. I eventually self-named the project TheBanHammer, because it did exactly that.
    Scape my site? You get the BanHammer.
    Think you're clever with your AWS/Azure magic? Well, I hope you have a lot of money to spend on IP addresses, because if you abused it enough it starts to ban at the allocation block level.
    Oh but you're a legit user with a login cookie, but maybe still in the ban block? Oh okay, you may still pass. Thank you for your patronage.
    TLDR; scrapers are nothing new. If you're still reeling from this sort of abuse in 2023, I wish you all the best good fucking luck in 2024.
    
  - Re:From the GitHub Page: (Score:5, Informative)
    
    by sg_oneill ( 159032 ) writes: on Tuesday April 25, 2023 @08:50PM (#63476830)
    
    Oh its worse than that. People *intentionally* write bots that not only disrespect robots, but will actively fuck your site over for spam. I still harbor *deep* resentment over these fuckers after their spambots destroyed a wiki devoted to a python library I was helping maintain late 1990s. They just flat out deleted the content and replaced it with dick pills. This was much earlier, and naiver, internet but even though we kept a history of pages, our homebrewed wiki only kept them for about 5 revisions before binning them because we just didnt have the storage (remember, 1990s, hard drives with storage measured in the megabytes not gigabytes)
    They destroyed 2 years of work just for short term profit and they didnt give a fuck. Worse, attempts at locking things down with hurdles , they seemingly would be manually working around it.
    In the end the wiki, and the library, died. Fuck spammers.
    
    - Re: (Score:3)
      
      by AmiMoJo ( 196126 ) writes:
      
      It's even worse than that. Many of the big search engines ignore robots.txt because scammers were abusing it to hide their phishing pages and malware. SEOs were using it to serve different content to search engines and browsers, to avoid being down-ranked for spam.
      Robots.txt was never going to work. Like DRM, it might be an unsolvable problem where the only way it can kinda work is by making the experience far worse for legitimate users.
      - Re: (Score:2)
        
        by codebase7 ( 9682010 ) writes:
        
        Like DRM, it might be an unsolvable problem where the only way it can kinda work is by making the experience far worse for legitimate users.
        No. DRM is an attempt to prevent legitimate users from accessing / using what they paid for. It's flawed design doesn't work because the "protected" content was already given to the "enemy" in a form that they can use at all. (Even if it's limited, the fact that it can be accessed at all means the device can remove the protections entirely if given the correct instructions.) Legitimate users are the only ones inconvenienced permanently by DRM, the "enemy" is only mildly inconvenienced until the smart cow a
    - - Re:From the GitHub Page: (Score:5, Interesting)
        
        by sg_oneill ( 159032 ) writes: on Wednesday April 26, 2023 @07:07AM (#63477478)
        
        Yeah, but at that point of time, kids was basically what we all where.
        Hell, the first ISP I was a customer was, an extension of the BBS I had been with, when I phoned in and asked for routing change for my IP address, they told me to ask in the IRC channel for the ISP terminal servers root login.
        Kids now have no idea just how cowboy the early to mid 1990s internet really was, And if they did, they'd never listen to us give our "wisdom" again lol
        
        
        Re: (Score:2)
        
        by Waccoon ( 1186667 ) writes:
        
        Things were even worse in the corporate world. My dad told me some stories about his years at GTE in the late 70's. Big business mainframes were pretty much just hooked up to phone lines with no security whatsoever. They pretty much relied on not giving out the phone number for the server, otherwise random people could (and did) log in and would be confronted by the admins in realtime.
        
        Re: (Score:2)
        
        by sg_oneill ( 159032 ) writes:
        
        Yeah I remember my grandfather logging into the mainframes at the lab he worked at, oh late 1970s to early 1980s sometimes with an "Acoustic coupler" tied into a portable printer/typewriter combo thing. He'd grumble about the new Vax requiring a password , cos he didnt need one for the IBM. Very different times.
I hope his AI can read usage rights declarations (Score:5, Insightful)

by Tanman ( 90298 ) writes: on Tuesday April 25, 2023 @07:20PM (#63476668)

If this person is going to be so irresponsible with his program -- which appears to be the case based on his github comments -- then he better be ready for rights users to fight back with very real legal weapons. People will put a rights declaration img/txt/whatever file on their server. Then they will have a basis for suing him when he downloads that file but does not obey the legal rights of the images on the website.
Arguments about not knowing what the content was won't hold water since he's explicitly downloading everything for use with AI which, by definition, can interpret such things.

- Re: I hope his AI can read usage rights declaratio (Score:2, Interesting)
  
  by Anonymouse Cowtard ( 6211666 ) writes:
  
  What's irresponsible about it? Usage rights can say anything a publisher damn wants and then a court decides. If you publish something I can see for free, why can't I let my machine look at it also? Search engines disrupted the advertising industry scraping text. And by now they've got all the images and video in question in this case anyway. What's the difference? Google still can't grok machine learning and someone has published their own, possibly better, research and datasets in the open. Good luck to t
  - - Re: (Score:2)
      
      by raynet ( 51803 ) writes:
      
      Why would you use a hosting provider that doesn't offer unlimited traffic?
      - Re: (Score:1)
        
        by iAmWaySmarterThanYou ( 10095012 ) writes:
        
        Good way to blame the victim.
        
        Re: (Score:2)
        
        by raynet ( 51803 ) writes:
        
        Not blaming anything, just can't understand why would anyone not use hosting companies with unlimited traffic?
        
        Re: (Score:1)
        
        by iAmWaySmarterThanYou ( 10095012 ) writes:
        
        Maybe they host at home. Maybe unlimited traffic wasn't free. Maybe they don't know any better. Maybe they're already setup somewhere else from long ago and can't afford the technical help to transition to a newer host that provides that.
        IMO, it doesn't matter why. The guy scraping sites is a jerk and DoSing them.
        If I left my front door wide open that is not an invitation to everyone to raid my fridge and shit on my floor. You can look in from outside if you want, but no more.
        
        Re: (Score:1)
        
        by iAmWaySmarterThanYou ( 10095012 ) writes:
        
        Oh also, even if traffic is free, server hardware isn't. If I pound your small site with a dozen threads I'll kill your database or disk or cpu or run you out of ram or something. Whatever comes first.
- Re: (Score:2)
  
  by tomz16 ( 992375 ) writes:
  
  People will put a rights declaration img/txt/whatever file on their server. Then they will have a basis for suing him when he downloads that file but does not obey the legal rights of the images on the website
  You can declare whatever you want. But once you make something available publicly on the internet, someone else simply downloading it is not legally actionable in any jurisdiction I have ever heard of. The only thing that legally governs usage subsequent to downloading are the limits imposed by copyright/trademark... and it has yet to be decided in the courts whether training a neural network on someone else's copyrighted material is itself a violation of their copyright (IMHO, if it is, then every huma
  - Re: (Score:1)
    
    by iAmWaySmarterThanYou ( 10095012 ) writes:
    
    Doing a DoS attack on their site is illegal in any civilized location.
    This isn't about rights and privacy. He's DoSing innocent sites.
- Re: (Score:2)
  
  by Ungrounded Lightning ( 62228 ) writes:
  
  ... he better be ready for rights users to fight back with very real legal weapons. People will put a rights declaration img/txt/whatever file on their server. Then they will have a basis for suing him when he downloads that file but does not obey the legal rights of the images on the website.
  As I read TFA and posts following it:
  He's not just running it himself. He built it as an open-source project. And his documentation says it obeys his new robots.txt terms (unless you tell it not to) - but it doesn't
  - Re: (Score:2)
    
    by Ungrounded Lightning ( 62228 ) writes:
    
    I strongly recommend that some members of the AI community (who are the motivated parties) head off this disaster by writing and submitting the invited robots.txt obeying mods.
    Also that they fork the project if he doesn't incorporate it immediately and accurately. Perhaps just fork it anyway, since he doesn't seem to realize the importance of the functionality and likely to keep it working.
Sounds like a DDOS tool to me (Score:5, Interesting)

by hdyoung ( 5182939 ) writes: on Tuesday April 25, 2023 @07:29PM (#63476696)

Aren’t there laws against that? People who build DDOS tools - can’t they be sued if they’re identified?

Sounds like this guy is treading on extremely thin eggshells, thinking that he’s immune because of his “AI-bro” identity and the magical power of Internet 9.0

Real world incoming in 3.2

Dear Romain Beaumont: (Score:2)

by fredrated ( 639554 ) writes:

what's sad is that you are a dirtbag. Suck eggs.
- Re: (Score:2)
  
  by JKanoock ( 6228864 ) writes:
  
  Yup he is truly a POS. Here's hoping karma is real.
Feed them .. (Score:2)

by PPH ( 736903 ) writes:

... copies of the goatse guy.
- Re: (Score:2)
  
  by WankerWeasel ( 875277 ) writes:
  
  Came to the comments to suggest just this. Massive website of nothing but AI generated versions of goatse so there's some variation.
Is that the 1990s calling? (Score:3, Funny)

by blue trane ( 110704 ) writes: on Tuesday April 25, 2023 @07:45PM (#63476736) Homepage Journal

Asking for its slashdotting back?

Sounds familiar (Score:5, Interesting)

by Berkyjay ( 1225604 ) writes: on Tuesday April 25, 2023 @07:55PM (#63476754)

that it's "sad" that they are fighting the inevitable rise of AI.
Just replace AI with crypto and you'll find it's the same people saying this shit.

- Re: (Score:2)
  
  by 93 Escort Wagon ( 326346 ) writes:
  
  I'm betting this guy was a crypto bro until that market went bust.
- Re: (Score:2)
  
  by OolimPhon ( 1120895 ) writes:
  
  Crypto? This stuff is really old.
  Try replacing 'AI' with 'communism', 'fascism' or even, for some values, 'islam'. Sound familiar?
Malicious code (Score:3)

by dskoll ( 99328 ) writes: on Tuesday April 25, 2023 @08:01PM (#63476762) Homepage

IMO, the code is actively malicious as it fakes a User-Agent: header that makes it seem as if it's Mozilla. Yes, you can add a "(compatible; img2dataset)" tag but that's optional.
If I ever detected this robot on my site, I'd fail2ban the offending client IP.

Don't simply block obnoxious AI crawlers (Score:5, Informative)

by WaffleMonster ( 969671 ) writes: on Tuesday April 25, 2023 @08:01PM (#63476766)

Return garbage to poison their models. If they won't follow the rules they deserve what they get.
Years ago we had issues with a certain IPR narc spider. Got to the point where they were indiscriminately crawling through folders containing terabytes of data. A single crawler was single handedly consuming a quarter of global traffic even though these folders were explicitly excluded from crawling for all agents in robots.txt. Their entire netblock ended up getting routed thru /dev/null.

- Re: (Score:2)
  
  by Barny ( 103770 ) writes:
  
  Gets expensive these days. To properly deal with them, you need to send valid images that aren't identical. That means you need to have an image host/generator that doesn't charge for data egress.
  If it were just text, it wouldn't be nearly so much of a problem.
send some goons over (Score:2)

by FudRucker ( 866063 ) writes:

to smash his computers and kick him in the pants, and quit sounding like a helpless crybaby on the internet
This is the same guy... (Score:5, Interesting)

by Gideon Fubar ( 833343 ) writes: on Tuesday April 25, 2023 @08:39PM (#63476816) Journal

That claims ownership of the LAION-5B dataset used to train Stable Diffusion.
Something tells me ignoring copyright is an essential part of his business model.

Interesting (Score:2)

by Barny ( 103770 ) writes:

One of his quotes:
Letting a small minority (eg a few people that are publishing content) prevent the large majority (most publishers of content) from sharing their images and from having the benefit of last gen AI tool would definitely be unethical yes. [github.com]
This is now becoming my go-to for anyone trying to defend their "AI art" and prompts from "thieves".
There are other solutions to stop these bots (Score:4, Informative)

by mabu ( 178417 ) writes: on Tuesday April 25, 2023 @10:26PM (#63476948)

I use a system on my servers called Web-Shield [github.com] that has proven to GREAT at stopping unauthorized bots from scraping my web sites as well as hacks and system probes. It's a simple shell-script based system that uses ipset/iptables to block specific areas of the web where there aren't legit users - just super cheap web hosting platforms that are the source of most of this activity. It's free. You should check it out. There's also a companion system called "login-shield" that does the same for login ports. Hooray for open source solutions to stuff like this.

another sociopath on the loose (Score:2)

by Walt Dismal ( 534799 ) writes:

He's a Shkreli, and deserves some lawsuits.
Can I use it to scrape pron? (Score:3)

by Askmum ( 1038780 ) writes: on Wednesday April 26, 2023 @12:43AM (#63477096)

Just asking for a friend. I believe his interest is furry diaper hentai.

Blacklist the IPs (Score:2)

by OfMiceAndMenus ( 4553885 ) writes:

Just block the IPs hitting your network if it's overwhelming your bandwidth or server capacity. There's automated versions of this like an HTTP version of Fail2Ban.

If they don't care about our existing privacy directives and settings that the web has agreed upon for decades, they shouldn't be surprised when their bot gets blocked by several large ISPs or AS's for ignoring them. This kid sounds like an entitled douche. Probably an ex-cryptobro who switched to the AI fad when he lost his dogecoins in a rug
- Re: (Score:2)
  
  by raynet ( 51803 ) writes:
  
  Or use any fair load balancer which prioritises initial requests and small files over large ones and throttles IPs that open multiple streams. And use proper caching and hosts with unlimited traffic.
They are on a hiding to nothing but turnarounds (Score:2)

by Growlley ( 6732614 ) writes:

abitch The 'have to opt out part' made me laugh.
Malarky (Score:2)

by kingbilly ( 993754 ) writes:

"it's "sad" that they are fighting the inevitable rise of AI"

Wow, what a load of horse shit.
If I don't want my doorbell being rung 24/7 so people can see my flight simulator rig, you wouldn't say I'm "against getting more people into aviation".
You fuckwit.
Blacklisting Works (Score:1)

by stargazer1sd ( 708392 ) writes:

This is just downright disrespectful. I admin a site with a model car forum. We get this problem for a long time. Every month or so, we identify IP addresses taking way too much bandwidth and block them if they don't end up being a search engine. There have been multiple entities, and they keep scraping the same stuff. I'm not upgrading my server to support their projects.
Malicious compliance opportunity (Score:2)

by Photo_Nut ( 676334 ) writes:

Just make bogus links that real web-browsers won't follow because they get erased via script, and if their scraper follows those links, return pornographic imagery. Mess with their AI. They will learn to follow robots.txt like all other scrapers did when it was made almost 30 years ago...
This is in fact the standard web approach -- aim malicious scrapers at porn.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

From the GitHub Page: (Score:5, Informative)

Re:From the GitHub Page: (Score:5, Informative)

Re:From the GitHub Page: (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:From the GitHub Page: (Score:5, Informative)

Re:From the GitHub Page: (Score:5, Insightful)

Re:From the GitHub Page: (Score:5, Informative)

Re:From the GitHub Page: (Score:5, Informative)

Re: (Score:3)

Re: (Score:2)

Re:From the GitHub Page: (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

I hope his AI can read usage rights declarations (Score:5, Insightful)

Re: I hope his AI can read usage rights declaratio (Score:2, Interesting)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Sounds like a DDOS tool to me (Score:5, Interesting)

Dear Romain Beaumont: (Score:2)

Re: (Score:2)

Feed them .. (Score:2)

Re: (Score:2)

Is that the 1990s calling? (Score:3, Funny)

Sounds familiar (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Malicious code (Score:3)

Don't simply block obnoxious AI crawlers (Score:5, Informative)

Re: (Score:2)

send some goons over (Score:2)

This is the same guy... (Score:5, Interesting)

Interesting (Score:2)

There are other solutions to stop these bots (Score:4, Informative)

another sociopath on the loose (Score:2)

Can I use it to scrape pron? (Score:3)

Blacklist the IPs (Score:2)

Re: (Score:2)

They are on a hiding to nothing but turnarounds (Score:2)

Malarky (Score:2)

Blacklisting Works (Score:1)

Malicious compliance opportunity (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals