Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
AI Security The Internet

OpenAI's Bot Crushes Seven-Person Company's Website 'Like a DDoS Attack' 40

An anonymous reader quotes a report from TechCrunch: On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company's e-commerce site was down. It looked to be some kind of distributed denial-of-service attack. He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. "We have over 65,000 products, each product has a page," Tomchuk told TechCrunch. "Each page has at least three photos." OpenAI was sending "tens of thousands" of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. "OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it's way more," he said of the IP addresses the bot used to attempt to consume his site. "Their crawlers were crushing our site," he said "It was basically a DDoS attack."

Triplegangers' website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of "human digital doubles" on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos -- everything from hands to hair, skin, and full bodies -- to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics. [...] To add insult to injury, not only was Triplegangers knocked offline by OpenAI's bot during U.S. business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot.
Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot to freely scrape its site since the system interprets the absence of such a file as permission. It's not an opt-in system.

Once the file was updated with specific tags to block OpenAI's bot, along with additional defenses like Cloudflare, the scraping stopped. However, robots.txt is not foolproof since compliance by AI companies is voluntary, leaving the burden on website owners to monitor and block unauthorized access proactively. "[Tomchuk] wants other small online business to know that the only way to discover if an AI bot is taking a website's copyrighted belongings is to actively look," reports TechCrunch.

OpenAI's Bot Crushes Seven-Person Company's Website 'Like a DDoS Attack'

Comments Filter:
  • FEED THE BOT!!!!! Now! OpenAI demands it!

    • FEED THE BOT!!!!! Now! OpenAI demands it!

      Look, it might cost many of you your livelihoods but that’s a risk OpenAI is willing to take.

    • Next up: everyone and their cat putting up gibberish generators to help Sam Altman get to the 100b "AGI" milestone.

      • by chthon ( 580889 )

        That is actually an idea that I had some fifteen years, ago to indeed poison the internet with random noise.

        However, when thinking about this, there always seem to be practical and technical obstacles.

  • Fuck OpenAI (Score:1, Flamebait)

    by jdawgnoonan ( 718294 )
    Sam’s plagiarism machine holds nothing good for anyone.
  • So ... (Score:2, Insightful)

    ... they didn't have any caching set up?

    Also, I would recommend having a website firewall (like Sucuri, or Cloudflare if you must), if your website is essential to your business ... having one ahead of time is a good idea.

    (No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)

    • by Barny ( 103770 )

      Yeah, I was a bit perplexed by that too. A good CDN and reverse proxy (as you say, Cloudflare, which they did install after the attack) are kinda mandatory on the net these days—and are free with basic features.

      Cloudflare in particular has specific "Anti-AI Bot" options.

  • Look I basically despise OpenAI for a lot of reasons, but there's no story here.

    "Website has improperly configured robots.txt, gets fucked by Web bots" is all this is.

    The sob-context of "they're just a little company" is fluff.

    The "it's not opt-in" and "web bots have to choose to comply"... Also fluff. Show me where bots are ignoring it, then it's an actionable story.

    Feels like this is a simple case where whoever built their site is liable for some mitigation and AWS costs, end of story.

    • by Entrope ( 68843 ) on Saturday January 11, 2025 @08:29AM (#65080651) Homepage

      Is your argument that OpenAI is too stupid to scale its bots' traffic based on how well-known a web site is? Or that it is reasonable to use 600+ IPs to crawl a single random web site, with no apparent rate-limiting?

      • I await websites having automated responses to AI scrapping to be all-but-fork bombing the goatsce image with rando,m filenames and minor filesize/checksum tweaks for the AI's to "learn" from.

        • by Luthair ( 847766 )
          I think the real answer is to detect bots and start serving poison pills. They're scraping the site because they want nicely tagged data, so serve them random junk that would be bad for a model to use.
    • by gweihir ( 88907 )

      Nope. This is for commercial use. For that, you need explicit, positive consent. Search-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.

      • earch-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.

        What "specific exception" are you referring to and where can I read it?

        Also, ignoring the poor behavior of the 600+ IPs for a moment, how is OpenAI different from a search engine in that they both scrape sites of content for the express purpose of private commercial gain?

    • If you leave your home unlocked yes it can be ransacked.

      However, using that analogy, you're blaming the person for not locking the door while 600 people walked right in and stole everything down to the bare nubs. Yes, he should have locked the door. That does not at all justify the 600 people who came in and took everything. Yes you should lock your doors, but protection from theft should not be opt-in.

      • by Barny ( 103770 )

        The issue here is that each and every AI bot requires its own specific config in robots.txt.

        They aren't obeying blanket "block all" settings.

        • It's easy to deny everything.

          User-agent: *
          Disallow: /

          Are you saying OpenAI is ignoring that? Because that's not what TFA says.

    • Also the whole point of AWS is that you can provision more server capacity...
    • by jsonn ( 792303 )
      Most "AI" scambots (and Chinese bots in general) do not honor robots.txt. They are the reason we can't have nice content without login or captcha barriers anymore...
    • by AmiMoJo ( 196126 )

      Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.

      • Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.

        Citation needed.

        Google explicitly says that they honor robots.txt.

        My own experience is that the major search engines do not crawl excluded directories.

        TFA says the site didn't have a proper robots.txt, which implies that OpenAI would've honored it if it had one.

  • Fucking criminals (Score:2, Insightful)

    by gweihir ( 88907 )

    Not only do they steal whatever they can get their hands on, they also destroy. All commercially, which makes them a criminal enterprise.

    Why are they allowed to do this?

  • As a human I have speed limit on how fast I can browse the web, a bot can view millions of websites at a time. It seems time that AWS should bill the crawlers and not the innocent web site owners.
  • by davide marney ( 231845 ) on Saturday January 11, 2025 @09:10AM (#65080715) Journal

    Music is constructed from repeating patterns, exactly the kind of patterns that LLMs excel at reproducing. The AI companies clearly are charging ahead with training on anything they can get their electronic hands on, copyrights be damned. No doubt the chickens will come home to roost, but it's going to be difficult in the extreme to show exactly where one's copied material was used to create the fake composite.

    Having my life's work being sucked out of my control and into the unregulated (and uncompensated) maw of AI is among my greatest worries.

  • Just Why? (Score:4, Interesting)

    by walkerp1 ( 523460 ) on Saturday January 11, 2025 @09:30AM (#65080755)
    Wow. How do you have an e-commerce site with an improperly configured robots.txt file? It sounds like they learned a relatively inexpensive lesson. Perhaps robots.txt should have a configuration to limit bandwidth and/or access frequency. That would allow crawlers to have responsible access without having to lock them out entirely. Still, a large crawler like OpenAI also needs some self-throttling heuristics to keep from crushing poorly-configured Websites. It's pretty easy to tell when your latency suddenly increases or packets start getting dropped that you might be stressing a target. For that matter, why not interleave queries rather than dropping the bomb site by site?

    Anyway, it's good to see these stories every once in a while so that those who learn by example have some grist for their mills.
  • I also run a small website and it got take offline due to exceeding our plan's rate limit for database queries.
    Our plan't limit is 150,000 database queries per hour and Claude was going way beyond that. We went from a normal background of ~10k hits per day to ~1.2M hits per day for 4 days.

    Yeah, some dudz are gonna say: "Well, you should have configured your robots.txt properly."
    1. Sure, it is now, however this has been fine for ~15 years and I suspect large swaths of the internet are configured this way
    • I should have mentioned that the website I manage is for a small .org non-profit, so we are not a commercial entity of any type. Our site just hosts information for our members and has nothing for sale.
  • I make use of robots.txt, but I also return 403 for various bot user-agents.

    If OpenAI ever spoofs its user-agent to be a normal browser, that'll mean war and I'll have to do IP-level blocking.

    This is an excerpt from my Apache config:

    RewriteEngine on
    RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} DataForSeoBot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} meta-externalagent [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} CensysI

  • Google would do the same "back in the day". So why is this news? Every new search engine runs into issues like this. Google did, microsoft did, altavista did, inktomi did, teoma/ask did, lycos did, infoseek was notorious for it, excite did...

    So why is this news?

    Oh right, it is in vogue to trash on Openai, give Google a pass, and throw roses at the feet of Apple. Got it... Grow the F up people and mark the original post as a troll.

  • > ...an AI bot is taking a website's copyrighted belongings...

    If you make content available on the Internet, anyone who views your page has --by definition-- downloaded it.

    Don't whine about copyrights when you chose to put it up there. FIre your IT guy who couldn't be bothered to do rate-limiting, and NO, "robots.txt" is not the answer. As the story says, it's a TOTALLY VOLUNTARY thing that nobody HAS TO ADHERE TO.

    But hey, Weekend Slashdot and BeauSD. Every week.

  • "Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot..."

    "Allowed the bot" is the wrong phrase; you meant "politely asked the bot not to". In fact, you correct yourself in the next paragraph, so why did you misspeak at all?

If I have not seen as far as others, it is because giants were standing on my shoulders. -- Hal Abelson

Working...