Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI The Internet

An AI Scraping Tool Is Overwhelming Websites With Traffic 60

An anonymous reader quotes a report from Motherboard: The creator of a tool that scrapes the internet for images in order to power artificial intelligence image generators like Stable Diffusion is telling website owners who want him to stop that they have to actively opt out, and that it's "sad" that they are fighting the inevitable rise of AI. "It is sad that several of you are not understanding the potential of AI and open AI and as a consequence have decided to fight it," Romain Beaumont, the creator of the image scraping tool img2dataset, said on its GitHub page. "You will have many opportunities in the years to come to benefit from AI. I hope you see that sooner rather than later. As creators you have even more opportunities to benefit from it."

Img2dataset is a free tool Beaumont shared on GitHub which allows users to automatically download, and resize a list of URLs. The result is an image dataset, the kind that trains image-generating AI models like Open AI's DALL-E, the open source Stable Diffusion model, and Google's Imagen. Beaumont is also an open source contributor to LAION-5B, one of the largest image datasets in the world that contains more than 5 billion images and is used by Imagen and Stable Diffusion. Img2dataset will attempt to scrape images from any site unless site owners add https headers like "X-Robots-Tag: noai," and "X-Robots-Tag: noindex." That means that the onus is on site owners, many of whom probably don't even know img2dataset exists, to opt out of img2dataset rather than opt in.
Beaumont defended img2dataset by comparing it to the way Google indexes all websites online in order to power its search engine, which benefits anyone who wants to search the internet.

"I directly benefit from search engines as they drive useful traffic to me," Eden told Motherboard. "But, more importantly, Google's bot is respectful and doesn't hammer my site. And most bots respect the robots.txt directive. Romain's tool doesn't. It seems to be deliberately set up to ignore the directives website owners have in place. And, frankly, it doesn't bring any direct benefit to me."

Motherboard notes: "A 'robots.txt' file tells search engine crawlers like Google which part of a site the crawler can access in order to prevent it from overloading the site with requests."
This discussion has been archived. No new comments can be posted.

An AI Scraping Tool Is Overwhelming Websites With Traffic

Comments Filter:
  • by micksam7 ( 1026240 ) * on Tuesday April 25, 2023 @07:17PM (#63476662)

    "Websites can pass the http headers X-Robots-Tag: noai, X-Robots-Tag: noindex , X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex By default img2dataset will ignore images with such headers."

    Followed directly by:
    "To disable this behavior and download all images, you may pass --disallowed_header_directives '[]'"

    I wonder what option most users will end up enabling. :)

    Also this tool doesn't seem to check robots.txt [from a quick source search, may be wrong.] Getting the impression they don't entirely care about this.

    • by forgottenusername ( 1495209 ) on Tuesday April 25, 2023 @07:44PM (#63476734)

      > Also this tool doesn't seem to check robots.txt

      Right, then he whinged on about it, closed the issue and in a linked issue offers to let other people do the coding for robots.txt

      What an epic douchelord. Entitled, expects the world to change to suit his behavior and ignores existing solutions while berating folks for not sharing his worldview.

      • by ArmoredDragon ( 3450605 ) on Tuesday April 25, 2023 @08:58PM (#63476844)

        Add honeypot URLs to robots.txt, and reference them in places where crawlers are likely to find them, but end users won't ever find them without inspecting the source code. For example, dead code segments of html or js. Any IP that hits one of those gets blackhole'd for 24 hours. That effectively forces the crawlers to honor robots.txt.

        • Any IP that hits one of those gets blackhole'd for 24 hours.

          Honestly as it is it is trivial to identify software scraping content vs humans doing so. Simply rate limiting unreasonable access already does a lot to prevent this behaviour.

          • Oh no that's easily fooled, particularly with tools like Selenium. And rate limiting is just going to piss off CGNAT users.

        • by AmiMoJo ( 196126 )

          Better than a blackhole, replace all images with goatse.jpeg. If the AI doesn't try to tear out it's metaphorical eyeballs, the unlucky user probably will.

          • I don't think that would really do much if they're all either the same or basically the same. One other thought I had is to run the JS through an obfuscator (assuming you're not already for regular traffic) and generate all requested images using other AI software. That would really fuck with it, especially if they're generated to match the context.

    • by sg_oneill ( 159032 ) on Tuesday April 25, 2023 @07:45PM (#63476738)

      The developer thinks its [i]unethical[/i] to do this lol. Because its "advancing science" (Despite the fact that consent has *always* been a pivotal part of scientific ethics, even if it hasnt always been followed). He does seem like he's at least open to the robots.txt thing.

      But yeah, the guys a scumbag.

    • by Brain-Fu ( 1274756 ) on Tuesday April 25, 2023 @07:52PM (#63476748) Homepage Journal

      If Romain Beaumont was a decent and respectful person, he would have started by honoring robots.txt. He clearly knew of its existence and he clearly knew that many websites use it and rely on it. His refusal to respect it is not some forward-looking visionary tech, it is straight-up selfishness.

      Romain Beaumont, show us you are not just another troll by apologizing and modifying your scraper to honor robots.txt.

      And, to everyone else, public shaming will not stop this sort of behavior. Even if Romain Beaumont turns out to be a decent person, others will not. You will need to design your websites more defensively if you cannot afford this level of profitless web traffic. It is really sad, but rude people with tech skills are the reason why we can't have nice things.

      • by Arethan ( 223197 ) on Tuesday April 25, 2023 @08:47PM (#63476824) Journal

        In a past life, I've written this sort of logic at the load balancer level on F5's. It's not trivial, but it's effective as fuck. I eventually self-named the project TheBanHammer, because it did exactly that.
        Scape my site? You get the BanHammer.
        Think you're clever with your AWS/Azure magic? Well, I hope you have a lot of money to spend on IP addresses, because if you abused it enough it starts to ban at the allocation block level.
        Oh but you're a legit user with a login cookie, but maybe still in the ban block? Oh okay, you may still pass. Thank you for your patronage.

        TLDR; scrapers are nothing new. If you're still reeling from this sort of abuse in 2023, I wish you all the best good fucking luck in 2024.

      • by sg_oneill ( 159032 ) on Tuesday April 25, 2023 @08:50PM (#63476830)

        Oh its worse than that. People *intentionally* write bots that not only disrespect robots, but will actively fuck your site over for spam. I still harbor *deep* resentment over these fuckers after their spambots destroyed a wiki devoted to a python library I was helping maintain late 1990s. They just flat out deleted the content and replaced it with dick pills. This was much earlier, and naiver, internet but even though we kept a history of pages, our homebrewed wiki only kept them for about 5 revisions before binning them because we just didnt have the storage (remember, 1990s, hard drives with storage measured in the megabytes not gigabytes)

        They destroyed 2 years of work just for short term profit and they didnt give a fuck. Worse, attempts at locking things down with hurdles , they seemingly would be manually working around it.

        In the end the wiki, and the library, died. Fuck spammers.

         

        • by AmiMoJo ( 196126 )

          It's even worse than that. Many of the big search engines ignore robots.txt because scammers were abusing it to hide their phishing pages and malware. SEOs were using it to serve different content to search engines and browsers, to avoid being down-ranked for spam.

          Robots.txt was never going to work. Like DRM, it might be an unsolvable problem where the only way it can kinda work is by making the experience far worse for legitimate users.

          • Like DRM, it might be an unsolvable problem where the only way it can kinda work is by making the experience far worse for legitimate users.

            No. DRM is an attempt to prevent legitimate users from accessing / using what they paid for. It's flawed design doesn't work because the "protected" content was already given to the "enemy" in a form that they can use at all. (Even if it's limited, the fact that it can be accessed at all means the device can remove the protections entirely if given the correct instructions.) Legitimate users are the only ones inconvenienced permanently by DRM, the "enemy" is only mildly inconvenienced until the smart cow a

  • by Tanman ( 90298 ) on Tuesday April 25, 2023 @07:20PM (#63476668)

    If this person is going to be so irresponsible with his program -- which appears to be the case based on his github comments -- then he better be ready for rights users to fight back with very real legal weapons. People will put a rights declaration img/txt/whatever file on their server. Then they will have a basis for suing him when he downloads that file but does not obey the legal rights of the images on the website.

    Arguments about not knowing what the content was won't hold water since he's explicitly downloading everything for use with AI which, by definition, can interpret such things.

    • What's irresponsible about it? Usage rights can say anything a publisher damn wants and then a court decides. If you publish something I can see for free, why can't I let my machine look at it also? Search engines disrupted the advertising industry scraping text. And by now they've got all the images and video in question in this case anyway. What's the difference? Google still can't grok machine learning and someone has published their own, possibly better, research and datasets in the open. Good luck to t
    • by tomz16 ( 992375 )

      People will put a rights declaration img/txt/whatever file on their server. Then they will have a basis for suing him when he downloads that file but does not obey the legal rights of the images on the website

      You can declare whatever you want. But once you make something available publicly on the internet, someone else simply downloading it is not legally actionable in any jurisdiction I have ever heard of. The only thing that legally governs usage subsequent to downloading are the limits imposed by copyright/trademark... and it has yet to be decided in the courts whether training a neural network on someone else's copyrighted material is itself a violation of their copyright (IMHO, if it is, then every huma

    • ... he better be ready for rights users to fight back with very real legal weapons. People will put a rights declaration img/txt/whatever file on their server. Then they will have a basis for suing him when he downloads that file but does not obey the legal rights of the images on the website.

      As I read TFA and posts following it:

      He's not just running it himself. He built it as an open-source project. And his documentation says it obeys his new robots.txt terms (unless you tell it not to) - but it doesn't

      • I strongly recommend that some members of the AI community (who are the motivated parties) head off this disaster by writing and submitting the invited robots.txt obeying mods.

        Also that they fork the project if he doesn't incorporate it immediately and accurately. Perhaps just fork it anyway, since he doesn't seem to realize the importance of the functionality and likely to keep it working.

  • by hdyoung ( 5182939 ) on Tuesday April 25, 2023 @07:29PM (#63476696)
    Aren’t there laws against that? People who build DDOS tools - can’t they be sued if they’re identified?

    Sounds like this guy is treading on extremely thin eggshells, thinking that he’s immune because of his “AI-bro” identity and the magical power of Internet 9.0

    Real world incoming in 3.2
  • what's sad is that you are a dirtbag. Suck eggs.

  • ... copies of the goatse guy.

    • Came to the comments to suggest just this. Massive website of nothing but AI generated versions of goatse so there's some variation.

  • by blue trane ( 110704 ) on Tuesday April 25, 2023 @07:45PM (#63476736) Homepage Journal

    Asking for its slashdotting back?

  • Sounds familiar (Score:5, Interesting)

    by Berkyjay ( 1225604 ) on Tuesday April 25, 2023 @07:55PM (#63476754)

    that it's "sad" that they are fighting the inevitable rise of AI.

    Just replace AI with crypto and you'll find it's the same people saying this shit.

  • by dskoll ( 99328 ) on Tuesday April 25, 2023 @08:01PM (#63476762) Homepage

    IMO, the code is actively malicious as it fakes a User-Agent: header that makes it seem as if it's Mozilla. Yes, you can add a "(compatible; img2dataset)" tag but that's optional.

    If I ever detected this robot on my site, I'd fail2ban the offending client IP.

  • by WaffleMonster ( 969671 ) on Tuesday April 25, 2023 @08:01PM (#63476766)

    Return garbage to poison their models. If they won't follow the rules they deserve what they get.

    Years ago we had issues with a certain IPR narc spider. Got to the point where they were indiscriminately crawling through folders containing terabytes of data. A single crawler was single handedly consuming a quarter of global traffic even though these folders were explicitly excluded from crawling for all agents in robots.txt. Their entire netblock ended up getting routed thru /dev/null.

    • by Barny ( 103770 )

      Gets expensive these days. To properly deal with them, you need to send valid images that aren't identical. That means you need to have an image host/generator that doesn't charge for data egress.

      If it were just text, it wouldn't be nearly so much of a problem.

  • to smash his computers and kick him in the pants, and quit sounding like a helpless crybaby on the internet
  • by Gideon Fubar ( 833343 ) on Tuesday April 25, 2023 @08:39PM (#63476816) Journal

    That claims ownership of the LAION-5B dataset used to train Stable Diffusion.

    Something tells me ignoring copyright is an essential part of his business model.

  • by mabu ( 178417 ) on Tuesday April 25, 2023 @10:26PM (#63476948)

    I use a system on my servers called Web-Shield [github.com] that has proven to GREAT at stopping unauthorized bots from scraping my web sites as well as hacks and system probes. It's a simple shell-script based system that uses ipset/iptables to block specific areas of the web where there aren't legit users - just super cheap web hosting platforms that are the source of most of this activity. It's free. You should check it out. There's also a companion system called "login-shield" that does the same for login ports. Hooray for open source solutions to stuff like this.

  • He's a Shkreli, and deserves some lawsuits.
  • by Askmum ( 1038780 ) on Wednesday April 26, 2023 @12:43AM (#63477096)
    Just asking for a friend. I believe his interest is furry diaper hentai.
  • Just block the IPs hitting your network if it's overwhelming your bandwidth or server capacity. There's automated versions of this like an HTTP version of Fail2Ban.

    If they don't care about our existing privacy directives and settings that the web has agreed upon for decades, they shouldn't be surprised when their bot gets blocked by several large ISPs or AS's for ignoring them. This kid sounds like an entitled douche. Probably an ex-cryptobro who switched to the AI fad when he lost his dogecoins in a rug
    • by raynet ( 51803 )

      Or use any fair load balancer which prioritises initial requests and small files over large ones and throttles IPs that open multiple streams. And use proper caching and hosts with unlimited traffic.

  • abitch The 'have to opt out part' made me laugh.
  • "it's "sad" that they are fighting the inevitable rise of AI"

    Wow, what a load of horse shit.
    If I don't want my doorbell being rung 24/7 so people can see my flight simulator rig, you wouldn't say I'm "against getting more people into aviation".
    You fuckwit.
  • This is just downright disrespectful. I admin a site with a model car forum. We get this problem for a long time. Every month or so, we identify IP addresses taking way too much bandwidth and block them if they don't end up being a search engine. There have been multiple entities, and they keep scraping the same stuff. I'm not upgrading my server to support their projects.

  • Just make bogus links that real web-browsers won't follow because they get erased via script, and if their scraper follows those links, return pornographic imagery. Mess with their AI. They will learn to follow robots.txt like all other scrapers did when it was made almost 30 years ago...

    This is in fact the standard web approach -- aim malicious scrapers at porn.

A committee takes root and grows, it flowers, wilts and dies, scattering the seed from which other committees will bloom. -- Parkinson

Working...