Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI Security The Internet

OpenAI's Bot Crushes Seven-Person Company's Website 'Like a DDoS Attack' 48

An anonymous reader quotes a report from TechCrunch: On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company's e-commerce site was down. It looked to be some kind of distributed denial-of-service attack. He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. "We have over 65,000 products, each product has a page," Tomchuk told TechCrunch. "Each page has at least three photos." OpenAI was sending "tens of thousands" of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. "OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it's way more," he said of the IP addresses the bot used to attempt to consume his site. "Their crawlers were crushing our site," he said "It was basically a DDoS attack."

Triplegangers' website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of "human digital doubles" on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos -- everything from hands to hair, skin, and full bodies -- to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics. [...] To add insult to injury, not only was Triplegangers knocked offline by OpenAI's bot during U.S. business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot.
Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot to freely scrape its site since the system interprets the absence of such a file as permission. It's not an opt-in system.

Once the file was updated with specific tags to block OpenAI's bot, along with additional defenses like Cloudflare, the scraping stopped. However, robots.txt is not foolproof since compliance by AI companies is voluntary, leaving the burden on website owners to monitor and block unauthorized access proactively. "[Tomchuk] wants other small online business to know that the only way to discover if an AI bot is taking a website's copyrighted belongings is to actively look," reports TechCrunch.

OpenAI's Bot Crushes Seven-Person Company's Website 'Like a DDoS Attack'

Comments Filter:
  • FEED THE BOT!!!!! Now! OpenAI demands it!

    • FEED THE BOT!!!!! Now! OpenAI demands it!

      Look, it might cost many of you your livelihoods but that’s a risk OpenAI is willing to take.

    • Next up: everyone and their cat putting up gibberish generators to help Sam Altman get to the 100b "AGI" milestone.

      • by chthon ( 580889 )

        That is actually an idea that I had some fifteen years, ago to indeed poison the internet with random noise.

        However, when thinking about this, there always seem to be practical and technical obstacles.

        • Everyone's had it, alas we have neither the time, nor the bandwidth nor the processor time to waste on the bots of the likes of Altman.

          Luckily, his company's quite successful in collecting bullshit anyway, judging by the output.

  • Fuck OpenAI (Score:1, Insightful)

    by jdawgnoonan ( 718294 )
    Sam’s plagiarism machine holds nothing good for anyone.
  • So ... (Score:3, Insightful)

    by cascadingstylesheet ( 140919 ) on Saturday January 11, 2025 @08:16AM (#65080629) Journal

    ... they didn't have any caching set up?

    Also, I would recommend having a website firewall (like Sucuri, or Cloudflare if you must), if your website is essential to your business ... having one ahead of time is a good idea.

    (No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)

    • by Barny ( 103770 )

      Yeah, I was a bit perplexed by that too. A good CDN and reverse proxy (as you say, Cloudflare, which they did install after the attack) are kinda mandatory on the net these days—and are free with basic features.

      Cloudflare in particular has specific "Anti-AI Bot" options.

  • Look I basically despise OpenAI for a lot of reasons, but there's no story here.

    "Website has improperly configured robots.txt, gets fucked by Web bots" is all this is.

    The sob-context of "they're just a little company" is fluff.

    The "it's not opt-in" and "web bots have to choose to comply"... Also fluff. Show me where bots are ignoring it, then it's an actionable story.

    Feels like this is a simple case where whoever built their site is liable for some mitigation and AWS costs, end of story.

    • by Entrope ( 68843 ) on Saturday January 11, 2025 @08:29AM (#65080651) Homepage

      Is your argument that OpenAI is too stupid to scale its bots' traffic based on how well-known a web site is? Or that it is reasonable to use 600+ IPs to crawl a single random web site, with no apparent rate-limiting?

      • I await websites having automated responses to AI scrapping to be all-but-fork bombing the goatsce image with rando,m filenames and minor filesize/checksum tweaks for the AI's to "learn" from.

        • by Luthair ( 847766 )
          I think the real answer is to detect bots and start serving poison pills. They're scraping the site because they want nicely tagged data, so serve them random junk that would be bad for a model to use.
    • by gweihir ( 88907 )

      Nope. This is for commercial use. For that, you need explicit, positive consent. Search-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.

      • earch-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.

        What "specific exception" are you referring to and where can I read it?

        Also, ignoring the poor behavior of the 600+ IPs for a moment, how is OpenAI different from a search engine in that they both scrape sites of content for the express purpose of private commercial gain?

      • Nope. This is for commercial use. For that, you need explicit, positive consent.

        No. https://en.wikipedia.org/wiki/... [wikipedia.org]

        Search-engines have a specific exception.

        No.

        OpenAI does not. They just take because they can. Like a shoplifter.

        No.No.No.

        See for example https://fairuse.stanford.edu/p... [stanford.edu] (trying to sue google for indexing and caching but gets thrown out).

    • If you leave your home unlocked yes it can be ransacked.

      However, using that analogy, you're blaming the person for not locking the door while 600 people walked right in and stole everything down to the bare nubs. Yes, he should have locked the door. That does not at all justify the 600 people who came in and took everything. Yes you should lock your doors, but protection from theft should not be opt-in.

      • by Barny ( 103770 )

        The issue here is that each and every AI bot requires its own specific config in robots.txt.

        They aren't obeying blanket "block all" settings.

        • It's easy to deny everything.

          User-agent: *
          Disallow: /

          Are you saying OpenAI is ignoring that? Because that's not what TFA says.

    • Also the whole point of AWS is that you can provision more server capacity...
    • by jsonn ( 792303 )
      Most "AI" scambots (and Chinese bots in general) do not honor robots.txt. They are the reason we can't have nice content without login or captcha barriers anymore...
    • by AmiMoJo ( 196126 )

      Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.

      • Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.

        Citation needed.

        Google explicitly says that they honor robots.txt.

        My own experience is that the major search engines do not crawl excluded directories.

        TFA says the site didn't have a proper robots.txt, which implies that OpenAI would've honored it if it had one.

  • Fucking criminals (Score:2, Insightful)

    by gweihir ( 88907 )

    Not only do they steal whatever they can get their hands on, they also destroy. All commercially, which makes them a criminal enterprise.

    Why are they allowed to do this?

  • by xack ( 5304745 ) on Saturday January 11, 2025 @09:01AM (#65080699)
    As a human I have speed limit on how fast I can browse the web, a bot can view millions of websites at a time. It seems time that AWS should bill the crawlers and not the innocent web site owners.
  • by davide marney ( 231845 ) on Saturday January 11, 2025 @09:10AM (#65080715) Journal

    Music is constructed from repeating patterns, exactly the kind of patterns that LLMs excel at reproducing. The AI companies clearly are charging ahead with training on anything they can get their electronic hands on, copyrights be damned. No doubt the chickens will come home to roost, but it's going to be difficult in the extreme to show exactly where one's copied material was used to create the fake composite.

    Having my life's work being sucked out of my control and into the unregulated (and uncompensated) maw of AI is among my greatest worries.

    • by Khyber ( 864651 )

      "Music is constructed from repeating patterns"

      Sure, if you're a no-talent pop icon.

      Meanwhile Paul Gilbert just laughs and wanks that 7-string.

    • As a fellow human and music lover, I feel your concerns. I suppose the real question is what might the human audience not only accept, but ultimately prefer?

      Once AI-generated music steps up to the mic, will the audience embrace singers hitting notes no human could? Will they prefer insane guitar solos coming from virtual 24-string double-neck guitars at warp speed? Classical compositions designed in ways we haven’t even thought of before, using virtual instruments with unique sound signatures? An

  • Just Why? (Score:5, Interesting)

    by walkerp1 ( 523460 ) on Saturday January 11, 2025 @09:30AM (#65080755)
    Wow. How do you have an e-commerce site with an improperly configured robots.txt file? It sounds like they learned a relatively inexpensive lesson. Perhaps robots.txt should have a configuration to limit bandwidth and/or access frequency. That would allow crawlers to have responsible access without having to lock them out entirely. Still, a large crawler like OpenAI also needs some self-throttling heuristics to keep from crushing poorly-configured Websites. It's pretty easy to tell when your latency suddenly increases or packets start getting dropped that you might be stressing a target. For that matter, why not interleave queries rather than dropping the bomb site by site?

    Anyway, it's good to see these stories every once in a while so that those who learn by example have some grist for their mills.
    • Re:Just Why? (Score:4, Insightful)

      by medusa-v2 ( 3669719 ) on Saturday January 11, 2025 @11:32AM (#65080967)

      Eh. I don't have a no trespassing sign up, that doesn't mean I should have to worry that an entire fleet of self driving cars is going to run me over while I'm sleeping in my bed.

      If you're in the scraping business, the absolute minimum you can do is rate limit your own tools so you don't run people over just because they are there. OpenAI is way out of line.

  • by tommycw1 ( 3529625 ) on Saturday January 11, 2025 @09:54AM (#65080789)
    I also run a small website and it got take offline due to exceeding our plan's rate limit for database queries.
    Our plan't limit is 150,000 database queries per hour and Claude was going way beyond that. We went from a normal background of ~10k hits per day to ~1.2M hits per day for 4 days.

    Yeah, some dudz are gonna say: "Well, you should have configured your robots.txt properly."
    1. Sure, it is now, however this has been fine for ~15 years and I suspect large swaths of the internet are configured this way.
    2. There are tons or reports all over the internet reporting that Claudebot doesn't respect robots.txt.

    So, yes, I have not configured my robots.txt properly, but I also contacted Anthropic who, to their credit apologized and put me on their do not crawl list, and I've seen almost 0 traffic from them since.

  • by dskoll ( 99328 ) on Saturday January 11, 2025 @10:20AM (#65080829) Homepage

    I make use of robots.txt, but I also return 403 for various bot user-agents.

    If OpenAI ever spoofs its user-agent to be a normal browser, that'll mean war and I'll have to do IP-level blocking.

    This is an excerpt from my Apache config:

    RewriteEngine on
    RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} DataForSeoBot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} meta-externalagent [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} CensysInspect [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} openai [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} GenomeCrawlerd [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} imagesift [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} claudebot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} Barkrowler [NC]
    RewriteRule . - [R=403,L]

  • Google would do the same "back in the day". So why is this news? Every new search engine runs into issues like this. Google did, microsoft did, altavista did, inktomi did, teoma/ask did, lycos did, infoseek was notorious for it, excite did...

    So why is this news?

    Oh right, it is in vogue to trash on Openai, give Google a pass, and throw roses at the feet of Apple. Got it... Grow the F up people and mark the original post as a troll.

  • Copyright BS (Score:1, Insightful)

    by gavron ( 1300111 )

    > ...an AI bot is taking a website's copyrighted belongings...

    If you make content available on the Internet, anyone who views your page has --by definition-- downloaded it.

    Don't whine about copyrights when you chose to put it up there. FIre your IT guy who couldn't be bothered to do rate-limiting, and NO, "robots.txt" is not the answer. As the story says, it's a TOTALLY VOLUNTARY thing that nobody HAS TO ADHERE TO.

    But hey, Weekend Slashdot and BeauSD. Every week.

    • Rate limit does nothing for Chinese bots who use thousands of IPs simultaneously, basically a DDOS. FYI some companies are scaling just for regular traffic and don't have resources to cope with traffic spikes. Of course they can use CloudFlare, but still I don't think it's fair to abuse the servers just because they're public. It's the same with the mailbox, just because I have one in the front of my house, I don't want to find it filled with spam. And in my country there's a legislation for that, so I don'

  • "Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot..."

    "Allowed the bot" is the wrong phrase; you meant "politely asked the bot not to". In fact, you correct yourself in the next paragraph, so why did you misspeak at all?

  • Sit back on the site that inspired the word Slashdotting and set the wayback machine for 1999. Substitute the word Google for OpenAI, and this is the same thing.

    Aside from whether OpenAI should be scraping at all, mistakes happen, unexpected traffic happens.

    When using AWS, make sure to avail yourself of logging, alarms, and most importantly, the billing cost controls. Better to be offline and have to figure out what happened than to have an enormous bill.

"Those who will be able to conquer software will be able to conquer the world." -- Tadahiro Sekimoto, president, NEC Corp.

Working...