Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
AI Security The Internet

OpenAI's Bot Crushes Seven-Person Company's Website 'Like a DDoS Attack' 78

An anonymous reader quotes a report from TechCrunch: On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company's e-commerce site was down. It looked to be some kind of distributed denial-of-service attack. He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. "We have over 65,000 products, each product has a page," Tomchuk told TechCrunch. "Each page has at least three photos." OpenAI was sending "tens of thousands" of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. "OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it's way more," he said of the IP addresses the bot used to attempt to consume his site. "Their crawlers were crushing our site," he said "It was basically a DDoS attack."

Triplegangers' website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of "human digital doubles" on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos -- everything from hands to hair, skin, and full bodies -- to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics. [...] To add insult to injury, not only was Triplegangers knocked offline by OpenAI's bot during U.S. business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot.
Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot to freely scrape its site since the system interprets the absence of such a file as permission. It's not an opt-in system.

Once the file was updated with specific tags to block OpenAI's bot, along with additional defenses like Cloudflare, the scraping stopped. However, robots.txt is not foolproof since compliance by AI companies is voluntary, leaving the burden on website owners to monitor and block unauthorized access proactively. "[Tomchuk] wants other small online business to know that the only way to discover if an AI bot is taking a website's copyrighted belongings is to actively look," reports TechCrunch.
This discussion has been archived. No new comments can be posted.

OpenAI's Bot Crushes Seven-Person Company's Website 'Like a DDoS Attack'

Comments Filter:
  • FEED THE BOT!!!!! Now! OpenAI demands it!

    • by burtosis ( 1124179 ) on Saturday January 11, 2025 @08:24AM (#65080643)

      FEED THE BOT!!!!! Now! OpenAI demands it!

      Look, it might cost many of you your livelihoods but that’s a risk OpenAI is willing to take.

    • Next up: everyone and their cat putting up gibberish generators to help Sam Altman get to the 100b "AGI" milestone.

      • by chthon ( 580889 )

        That is actually an idea that I had some fifteen years, ago to indeed poison the internet with random noise.

        However, when thinking about this, there always seem to be practical and technical obstacles.

        • Everyone's had it, alas we have neither the time, nor the bandwidth nor the processor time to waste on the bots of the likes of Altman.

          Luckily, his company's quite successful in collecting bullshit anyway, judging by the output.

  • Fuck OpenAI (Score:2, Insightful)

    by jdawgnoonan ( 718294 )
    Sam’s plagiarism machine holds nothing good for anyone.
  • So ... (Score:3, Insightful)

    by cascadingstylesheet ( 140919 ) on Saturday January 11, 2025 @08:16AM (#65080629) Journal

    ... they didn't have any caching set up?

    Also, I would recommend having a website firewall (like Sucuri, or Cloudflare if you must), if your website is essential to your business ... having one ahead of time is a good idea.

    (No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)

    • by Barny ( 103770 )

      Yeah, I was a bit perplexed by that too. A good CDN and reverse proxy (as you say, Cloudflare, which they did install after the attack) are kinda mandatory on the net these days—and are free with basic features.

      Cloudflare in particular has specific "Anti-AI Bot" options.

    • (No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)

      Great, great, just what we need - another Big Water apologist.

      • (No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)

        Great, great, just what we need - another Big Water apologist.

        Okay, that did make me laugh out loud :)

  • Look I basically despise OpenAI for a lot of reasons, but there's no story here.

    "Website has improperly configured robots.txt, gets fucked by Web bots" is all this is.

    The sob-context of "they're just a little company" is fluff.

    The "it's not opt-in" and "web bots have to choose to comply"... Also fluff. Show me where bots are ignoring it, then it's an actionable story.

    Feels like this is a simple case where whoever built their site is liable for some mitigation and AWS costs, end of story.

    • by Entrope ( 68843 ) on Saturday January 11, 2025 @08:29AM (#65080651) Homepage

      Is your argument that OpenAI is too stupid to scale its bots' traffic based on how well-known a web site is? Or that it is reasonable to use 600+ IPs to crawl a single random web site, with no apparent rate-limiting?

      • I await websites having automated responses to AI scrapping to be all-but-fork bombing the goatsce image with rando,m filenames and minor filesize/checksum tweaks for the AI's to "learn" from.

      • I didn't have "an argument" though it sounds like you really want one?

        None of those things would have happened if their robots.txt was configured properly, unless you're answering that openai is breaking norms of behavior which WOULD be an interesting news story.

        This one is about as interesting as "man cuts himself on broken glass" or "woman falls on icy steps" - inadvertent but predictable, and hardly newsworthy.

        Why are people so fucking tetchy?

        • by Entrope ( 68843 )

          People are tetchy because some of us realize that if a company is going to devote 600+ IPs and lots of network resources to a web crawler, they should implement rate limiting that isn't absolutely moronic. We get annoyed when people try to excuse that by saying "oh well the target should have set up a robots.txt (and hope that all the crawlers obey crawl-delay or whatever other extensions to that non-standard file format exist) so that these non-paying leeches can download everything on the web site".

          • None of those 600 ips would have even touched this innocent little victim if their shit was set up right.

            I didn't - even by implication - exonerate anyone or excuse it. This company fucked up and suffered for it. OpenAI shouldn't swamp the web with their bots. BOTH CAN BE TRUE, DUH?

            (Don't keep banging on about "hoping that they obey the rules" when there's no sign here that they didn't - that's bullshit by allusion, he said, hoping you haven't murdered any children today....)

    • by gweihir ( 88907 )

      Nope. This is for commercial use. For that, you need explicit, positive consent. Search-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.

      • earch-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.

        What "specific exception" are you referring to and where can I read it?

        Also, ignoring the poor behavior of the 600+ IPs for a moment, how is OpenAI different from a search engine in that they both scrape sites of content for the express purpose of private commercial gain?

      • by ISayWeOnlyToBePolite ( 721679 ) on Saturday January 11, 2025 @11:47AM (#65080987)

        Nope. This is for commercial use. For that, you need explicit, positive consent.

        No. https://en.wikipedia.org/wiki/... [wikipedia.org]

        Search-engines have a specific exception.

        No.

        OpenAI does not. They just take because they can. Like a shoplifter.

        No.No.No.

        See for example https://fairuse.stanford.edu/p... [stanford.edu] (trying to sue google for indexing and caching but gets thrown out).

        • by gweihir ( 88907 )

          You and the morons that moded you up have your heads up your asses. Indexing is _fundamentally_ different from training AI models on the data. The most important thing about indexing is that all found data refers back to the source via links. AI models do not do that.

          How can you people be so disconnected that you overlook the most obvious things?

          • And you realize that a distinction between a se and llm is currently varying shades of gray. We are receiving traffic off links in SearchGPT/ChatGPT. Clearly, the line between LLM Database and a Search Engine Index is blurry. That said, OpenAI does run a bunch of different bots with different agent names. All I could find: https://www.searchengineworld.... [searchengineworld.com]
            • by gweihir ( 88907 )

              You are ignorant with regards to how LLMs work. No, you do _not_ receive traffic off ChatGPT because you were in the training data. LLMs fake that by attaching a conventional search. And that means the training data was illegally obtained unless there was explicit permission.

              • Ever hear of Google? They've been "training" their AI on the web for a very long time. You could argue about copyright until the cows come home. If you put it on the web and not behind a paywall, the rule of law in this country is pretty simple, "he with the most money for the best lawyers...wins"
    • If you leave your home unlocked yes it can be ransacked.

      However, using that analogy, you're blaming the person for not locking the door while 600 people walked right in and stole everything down to the bare nubs. Yes, he should have locked the door. That does not at all justify the 600 people who came in and took everything. Yes you should lock your doors, but protection from theft should not be opt-in.

      • by Barny ( 103770 )

        The issue here is that each and every AI bot requires its own specific config in robots.txt.

        They aren't obeying blanket "block all" settings.

        • It's easy to deny everything.

          User-agent: *
          Disallow: /

          Are you saying OpenAI is ignoring that? Because that's not what TFA says.

    • Also the whole point of AWS is that you can provision more server capacity...
    • Re: (Score:3, Interesting)

      by jsonn ( 792303 )
      Most "AI" scambots (and Chinese bots in general) do not honor robots.txt. They are the reason we can't have nice content without login or captcha barriers anymore...
    • by AmiMoJo ( 196126 )

      Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.

      • by ShanghaiBill ( 739463 ) on Saturday January 11, 2025 @10:15AM (#65080823)

        Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.

        Citation needed.

        Google explicitly says that they honor robots.txt.

        My own experience is that the major search engines do not crawl excluded directories.

        TFA says the site didn't have a proper robots.txt, which implies that OpenAI would've honored it if it had one.

    • Nope. Robots.txt has nothing to do with this issue. SearchEngines can still crawl a site even with a full robots.txt block on it. I have sites that have portions fully blocked off in robots.txt and both google and bing still crawl them. OpenAI is not under any restrictions to crawl even if there is a full bots ban. The only way to stop them is banning their ip's and/or agent names.

      RewriteCond %{HTTP_USER_AGENT} ^.*(DataForSeoBot|AhrefsBot|wp_is_mobile|AppleBot|meerkatseo|LWP|AppleNewsBot|yacy|infotiger|a

    • by jmke ( 776334 )
      > Show me where bots are ignoring it, then it's an actionable story.

      I see plenty of traffic indexing and downloading the sites with bad user agents, thousands of hits, ignoring any robots.txt or disallowed tags; so there are enough bad actors out there and this will only increase as everybody wants to crawl and build/train their "bots" for "AI" purposes
  • Fucking criminals (Score:3, Insightful)

    by gweihir ( 88907 ) on Saturday January 11, 2025 @08:27AM (#65080645)

    Not only do they steal whatever they can get their hands on, they also destroy. All commercially, which makes them a criminal enterprise.

    Why are they allowed to do this?

  • by xack ( 5304745 ) on Saturday January 11, 2025 @09:01AM (#65080699)
    As a human I have speed limit on how fast I can browse the web, a bot can view millions of websites at a time. It seems time that AWS should bill the crawlers and not the innocent web site owners.
    • by chrish ( 4714 )

      That's the great thing (for AWS)... they get to charge the crawlers AND the web site owners. And also sell infra for "AI" training, hosting, etc

  • Music is constructed from repeating patterns, exactly the kind of patterns that LLMs excel at reproducing. The AI companies clearly are charging ahead with training on anything they can get their electronic hands on, copyrights be damned. No doubt the chickens will come home to roost, but it's going to be difficult in the extreme to show exactly where one's copied material was used to create the fake composite.

    Having my life's work being sucked out of my control and into the unregulated (and uncompensated)

    • by Khyber ( 864651 )

      "Music is constructed from repeating patterns"

      Sure, if you're a no-talent pop icon.

      Meanwhile Paul Gilbert just laughs and wanks that 7-string.

    • As a fellow human and music lover, I feel your concerns. I suppose the real question is what might the human audience not only accept, but ultimately prefer?

      Once AI-generated music steps up to the mic, will the audience embrace singers hitting notes no human could? Will they prefer insane guitar solos coming from virtual 24-string double-neck guitars at warp speed? Classical compositions designed in ways we haven’t even thought of before, using virtual instruments with unique sound signatures? An

  • Just Why? (Score:5, Insightful)

    by walkerp1 ( 523460 ) on Saturday January 11, 2025 @09:30AM (#65080755)
    Wow. How do you have an e-commerce site with an improperly configured robots.txt file? It sounds like they learned a relatively inexpensive lesson. Perhaps robots.txt should have a configuration to limit bandwidth and/or access frequency. That would allow crawlers to have responsible access without having to lock them out entirely. Still, a large crawler like OpenAI also needs some self-throttling heuristics to keep from crushing poorly-configured Websites. It's pretty easy to tell when your latency suddenly increases or packets start getting dropped that you might be stressing a target. For that matter, why not interleave queries rather than dropping the bomb site by site?

    Anyway, it's good to see these stories every once in a while so that those who learn by example have some grist for their mills.
    • Re:Just Why? (Score:5, Insightful)

      by medusa-v2 ( 3669719 ) on Saturday January 11, 2025 @11:32AM (#65080967)

      Eh. I don't have a no trespassing sign up, that doesn't mean I should have to worry that an entire fleet of self driving cars is going to run me over while I'm sleeping in my bed.

      If you're in the scraping business, the absolute minimum you can do is rate limit your own tools so you don't run people over just because they are there. OpenAI is way out of line.

  • by tommycw1 ( 3529625 ) on Saturday January 11, 2025 @09:54AM (#65080789)
    I also run a small website and it got take offline due to exceeding our plan's rate limit for database queries.
    Our plan't limit is 150,000 database queries per hour and Claude was going way beyond that. We went from a normal background of ~10k hits per day to ~1.2M hits per day for 4 days.

    Yeah, some dudz are gonna say: "Well, you should have configured your robots.txt properly."
    1. Sure, it is now, however this has been fine for ~15 years and I suspect large swaths of the internet are configured this way.
    2. There are tons or reports all over the internet reporting that Claudebot doesn't respect robots.txt.

    So, yes, I have not configured my robots.txt properly, but I also contacted Anthropic who, to their credit apologized and put me on their do not crawl list, and I've seen almost 0 traffic from them since.

    • I should have mentioned that the website I manage is for a small .org non-profit, so we are not a commercial entity of any type. Our site just hosts information for our members and has nothing for sale.
      • There's exactly one line of code in Apache/NGINX config to block by User Agent. Out of curiosity how do you cope with chinese bots who impersonate legit user agents? They're thousands of IPs from China/US/Taiwan etc...
        https://stackoverflow.com/ques... [stackoverflow.com]

        • Good comments. I do now block by UA, and I've blocked tons of IP block as well. Since my .org is essentially only applicable regionally in the US, its easy to block large blocks of IPs without concern of blocking many legit users.

          Essentially all of the random bots that hit the site constantly (whether they are Chinese or otherwise) have been smart enough to not hit the site so hard they take it offline. This is the background noise of the ~10k hits per day with likely >90% of that being bots. When
          • Cannot block IPs forever, they keep changing thus difficult to keep a reputation list. I recently started to check for a solution, could not find something open source, it seem Crowdstrike Falcon Pro is useful, but quite pricey at $100/device/year. Until then CloudFlare free works well, it's blocking a lot of bot traffic.

  • Belt & Suspenders (Score:5, Informative)

    by dskoll ( 99328 ) on Saturday January 11, 2025 @10:20AM (#65080829) Homepage

    I make use of robots.txt, but I also return 403 for various bot user-agents.

    If OpenAI ever spoofs its user-agent to be a normal browser, that'll mean war and I'll have to do IP-level blocking.

    This is an excerpt from my Apache config:

    RewriteEngine on
    RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} DataForSeoBot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} meta-externalagent [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} CensysInspect [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} openai [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} GenomeCrawlerd [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} imagesift [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} claudebot [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} Barkrowler [NC]
    RewriteRule . - [R=403,L]

    • Yes this is good advice. The problem is that using this we are good until one of these companies slightly changes their UA (say from "BaiduSpyder" ->"BaiduBot" for example) to where it no longer matches and/or when a new/existing company comes online with a new bot you never heard of with a brand new UA. That's what happend to me with Claude and it took down my site.
  • Google would do the same "back in the day". So why is this news? Every new search engine runs into issues like this. Google did, microsoft did, altavista did, inktomi did, teoma/ask did, lycos did, infoseek was notorious for it, excite did...

    So why is this news?

    Oh right, it is in vogue to trash on Openai, give Google a pass, and throw roses at the feet of Apple. Got it... Grow the F up people and mark the original post as a troll.

    • by jsonn ( 792303 )
      I never had noticeable load spikes due to the Google bot. Bing was somewhat noticeable at some point, but they still honor robots.txt. Never had a problem with the rest either. That's the point: Search engine bots overall tried to be good netizens. The AI scammers don't. The old search engine bots even often documented their IP ranges to make it possible to throttle or block them if you cared enough. The new AI scam bots behave like a DDoS. If you can't tell the difference, I can't help you.
  • Copyright BS (Score:1, Insightful)

    by gavron ( 1300111 )

    > ...an AI bot is taking a website's copyrighted belongings...

    If you make content available on the Internet, anyone who views your page has --by definition-- downloaded it.

    Don't whine about copyrights when you chose to put it up there. FIre your IT guy who couldn't be bothered to do rate-limiting, and NO, "robots.txt" is not the answer. As the story says, it's a TOTALLY VOLUNTARY thing that nobody HAS TO ADHERE TO.

    But hey, Weekend Slashdot and BeauSD. Every week.

    • Rate limit does nothing for Chinese bots who use thousands of IPs simultaneously, basically a DDOS. FYI some companies are scaling just for regular traffic and don't have resources to cope with traffic spikes. Of course they can use CloudFlare, but still I don't think it's fair to abuse the servers just because they're public. It's the same with the mailbox, just because I have one in the front of my house, I don't want to find it filled with spam. And in my country there's a legislation for that, so I don'

    • by allo ( 1728082 )

      > "robots.txt" is not the answer. As the story says, it's a TOTALLY VOLUNTARY thing that nobody HAS TO ADHERE TO.

      It does matter. Even though it is no law, it is a good argument in court if the situation is unclear otherwise. But OpenAI respects robots.txt and even has an unique bot name, so you can exclude their bot while allowing others if you want to.

  • "Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot..."

    "Allowed the bot" is the wrong phrase; you meant "politely asked the bot not to". In fact, you correct yourself in the next paragraph, so why did you misspeak at all?

  • by kamitchell ( 1090511 ) on Saturday January 11, 2025 @10:59AM (#65080919)

    Sit back on the site that inspired the word Slashdotting and set the wayback machine for 1999. Substitute the word Google for OpenAI, and this is the same thing.

    Aside from whether OpenAI should be scraping at all, mistakes happen, unexpected traffic happens.

    When using AWS, make sure to avail yourself of logging, alarms, and most importantly, the billing cost controls. Better to be offline and have to figure out what happened than to have an enormous bill.

    • by jsonn ( 792303 )
      Google does not bombard a random web site with 100s of concurrent requests. AI scam bots do. The main issue in 1999 with slashdotting was not even the CPU load either, but running out of bandwidth.
  • > "We have over 65,000 products, each product has a page," Tomchuk told TechCrunch. "Each page has at least three photos."

    So around 200K files times 1MB. Some "businesses" are not meant to be.

  • if their company name didn't sound like some porn site.
  • People are limited in their actions in person. Should that be true on the internet too?

    Just realize if you claim "yes" there is likely a lot of work to actually do anything. Part of why the internet works as well as it does is by ignoring physical location. How do you layer that on top without breaking the original?

Do not meddle in the affairs of troff, for it is subtle and quick to anger.

Working...