OpenAI's Bot Crushes Seven-Person Company's Website 'Like a DDoS Attack' 56
An anonymous reader quotes a report from TechCrunch: On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company's e-commerce site was down. It looked to be some kind of distributed denial-of-service attack. He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. "We have over 65,000 products, each product has a page," Tomchuk told TechCrunch. "Each page has at least three photos." OpenAI was sending "tens of thousands" of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. "OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it's way more," he said of the IP addresses the bot used to attempt to consume his site. "Their crawlers were crushing our site," he said "It was basically a DDoS attack."
Triplegangers' website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of "human digital doubles" on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos -- everything from hands to hair, skin, and full bodies -- to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics. [...] To add insult to injury, not only was Triplegangers knocked offline by OpenAI's bot during U.S. business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot. Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot to freely scrape its site since the system interprets the absence of such a file as permission. It's not an opt-in system.
Once the file was updated with specific tags to block OpenAI's bot, along with additional defenses like Cloudflare, the scraping stopped. However, robots.txt is not foolproof since compliance by AI companies is voluntary, leaving the burden on website owners to monitor and block unauthorized access proactively. "[Tomchuk] wants other small online business to know that the only way to discover if an AI bot is taking a website's copyrighted belongings is to actively look," reports TechCrunch.
Triplegangers' website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of "human digital doubles" on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos -- everything from hands to hair, skin, and full bodies -- to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics. [...] To add insult to injury, not only was Triplegangers knocked offline by OpenAI's bot during U.S. business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot. Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot to freely scrape its site since the system interprets the absence of such a file as permission. It's not an opt-in system.
Once the file was updated with specific tags to block OpenAI's bot, along with additional defenses like Cloudflare, the scraping stopped. However, robots.txt is not foolproof since compliance by AI companies is voluntary, leaving the burden on website owners to monitor and block unauthorized access proactively. "[Tomchuk] wants other small online business to know that the only way to discover if an AI bot is taking a website's copyrighted belongings is to actively look," reports TechCrunch.
Feed the bot! (Score:2)
FEED THE BOT!!!!! Now! OpenAI demands it!
Re:Feed the bot! (Score:4, Funny)
FEED THE BOT!!!!! Now! OpenAI demands it!
Look, it might cost many of you your livelihoods but that’s a risk OpenAI is willing to take.
Re: (Score:2)
Next up: everyone and their cat putting up gibberish generators to help Sam Altman get to the 100b "AGI" milestone.
Re: (Score:2)
That is actually an idea that I had some fifteen years, ago to indeed poison the internet with random noise.
However, when thinking about this, there always seem to be practical and technical obstacles.
Re: (Score:2)
Everyone's had it, alas we have neither the time, nor the bandwidth nor the processor time to waste on the bots of the likes of Altman.
Luckily, his company's quite successful in collecting bullshit anyway, judging by the output.
Fuck OpenAI (Score:1, Insightful)
So ... (Score:3, Insightful)
... they didn't have any caching set up?
Also, I would recommend having a website firewall (like Sucuri, or Cloudflare if you must), if your website is essential to your business ... having one ahead of time is a good idea.
(No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)
Re: (Score:2)
Yeah, I was a bit perplexed by that too. A good CDN and reverse proxy (as you say, Cloudflare, which they did install after the attack) are kinda mandatory on the net these days—and are free with basic features.
Cloudflare in particular has specific "Anti-AI Bot" options.
Re: (Score:2)
(No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)
Great, great, just what we need - another Big Water apologist.
Re: (Score:1)
(No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)
Great, great, just what we need - another Big Water apologist.
Okay, that did make me laugh out loud :)
that's why there is a robots.txt (Score:2, Insightful)
Look I basically despise OpenAI for a lot of reasons, but there's no story here.
"Website has improperly configured robots.txt, gets fucked by Web bots" is all this is.
The sob-context of "they're just a little company" is fluff.
The "it's not opt-in" and "web bots have to choose to comply"... Also fluff. Show me where bots are ignoring it, then it's an actionable story.
Feels like this is a simple case where whoever built their site is liable for some mitigation and AWS costs, end of story.
Re:that's why there is a robots.txt (Score:5, Interesting)
Is your argument that OpenAI is too stupid to scale its bots' traffic based on how well-known a web site is? Or that it is reasonable to use 600+ IPs to crawl a single random web site, with no apparent rate-limiting?
Re: (Score:2)
I await websites having automated responses to AI scrapping to be all-but-fork bombing the goatsce image with rando,m filenames and minor filesize/checksum tweaks for the AI's to "learn" from.
Re:that's why there is a robots.txt (Score:4, Interesting)
Re: (Score:1)
Nope. This is for commercial use. For that, you need explicit, positive consent. Search-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.
Re: (Score:3)
earch-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.
What "specific exception" are you referring to and where can I read it?
Also, ignoring the poor behavior of the 600+ IPs for a moment, how is OpenAI different from a search engine in that they both scrape sites of content for the express purpose of private commercial gain?
Re:that's why there is a robots.txt (Score:4, Insightful)
Nope. This is for commercial use. For that, you need explicit, positive consent.
No. https://en.wikipedia.org/wiki/... [wikipedia.org]
Search-engines have a specific exception.
No.
OpenAI does not. They just take because they can. Like a shoplifter.
No.No.No.
See for example https://fairuse.stanford.edu/p... [stanford.edu] (trying to sue google for indexing and caching but gets thrown out).
Re: (Score:1)
However, using that analogy, you're blaming the person for not locking the door while 600 people walked right in and stole everything down to the bare nubs. Yes, he should have locked the door. That does not at all justify the 600 people who came in and took everything. Yes you should lock your doors, but protection from theft should not be opt-in.
Re: (Score:3)
The issue here is that each and every AI bot requires its own specific config in robots.txt.
They aren't obeying blanket "block all" settings.
Re: (Score:3)
It's easy to deny everything.
User-agent: *
Disallow: /
Are you saying OpenAI is ignoring that? Because that's not what TFA says.
Re: (Score:3)
Re: (Score:1)
Re: (Score:2)
Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.
Re: (Score:2)
Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.
Citation needed.
Google explicitly says that they honor robots.txt.
My own experience is that the major search engines do not crawl excluded directories.
TFA says the site didn't have a proper robots.txt, which implies that OpenAI would've honored it if it had one.
Re: (Score:1)
RewriteCond %{HTTP_USER_AGENT} ^.*(DataForSeoBot|AhrefsBot|wp_is_mobile|AppleBot|meerkatseo|LWP|AppleNewsBot|yacy|infotiger|a
Fucking criminals (Score:2, Insightful)
Not only do they steal whatever they can get their hands on, they also destroy. All commercially, which makes them a criminal enterprise.
Why are they allowed to do this?
Re: (Score:2)
Because the law is 5-10 years behind the technology.
Re: (Score:2)
Re:I wonder what he thought... (Score:5, Insightful)
Re: (Score:2)
The search engines aren't using 600+ IPs to scrape all at once. So, no, it wouldn't have happened with any of those. OpenAI is a cancerous tumor.
I've seen Bing overwhelm a few sites that didn't have caching. And there are plenty of random lesser bots and spiders out there.
No, OpenAI shouldn't do that, but it still doesn't mean you shouldn't have an umbrella when it's raining.
Re: (Score:1)
High speed crawling is vandalism (Score:3, Insightful)
This is my greatest worry as a songwriter (Score:3)
Music is constructed from repeating patterns, exactly the kind of patterns that LLMs excel at reproducing. The AI companies clearly are charging ahead with training on anything they can get their electronic hands on, copyrights be damned. No doubt the chickens will come home to roost, but it's going to be difficult in the extreme to show exactly where one's copied material was used to create the fake composite.
Having my life's work being sucked out of my control and into the unregulated (and uncompensated) maw of AI is among my greatest worries.
Re: (Score:2)
"Music is constructed from repeating patterns"
Sure, if you're a no-talent pop icon.
Meanwhile Paul Gilbert just laughs and wanks that 7-string.
Re: (Score:2)
As a fellow human and music lover, I feel your concerns. I suppose the real question is what might the human audience not only accept, but ultimately prefer?
Once AI-generated music steps up to the mic, will the audience embrace singers hitting notes no human could? Will they prefer insane guitar solos coming from virtual 24-string double-neck guitars at warp speed? Classical compositions designed in ways we haven’t even thought of before, using virtual instruments with unique sound signatures? An
Just Why? (Score:5, Interesting)
Anyway, it's good to see these stories every once in a while so that those who learn by example have some grist for their mills.
Re:Just Why? (Score:5, Insightful)
Eh. I don't have a no trespassing sign up, that doesn't mean I should have to worry that an entire fleet of self driving cars is going to run me over while I'm sleeping in my bed.
If you're in the scraping business, the absolute minimum you can do is rate limit your own tools so you don't run people over just because they are there. OpenAI is way out of line.
I had the same shit from Anthropic's Claudebot (Score:5, Interesting)
Our plan't limit is 150,000 database queries per hour and Claude was going way beyond that. We went from a normal background of ~10k hits per day to ~1.2M hits per day for 4 days.
Yeah, some dudz are gonna say: "Well, you should have configured your robots.txt properly."
So, yes, I have not configured my robots.txt properly, but I also contacted Anthropic who, to their credit apologized and put me on their do not crawl list, and I've seen almost 0 traffic from them since.
Re: (Score:1)
Re: I had the same shit from Anthropic's Claudebot (Score:2)
There's exactly one line of code in Apache/NGINX config to block by User Agent. Out of curiosity how do you cope with chinese bots who impersonate legit user agents? They're thousands of IPs from China/US/Taiwan etc...
https://stackoverflow.com/ques... [stackoverflow.com]
Re: (Score:1)
Essentially all of the random bots that hit the site constantly (whether they are Chinese or otherwise) have been smart enough to not hit the site so hard they take it offline. This is the background noise of the ~10k hits per day with likely >90% of that being bots. When
Belt & Suspenders (Score:4, Interesting)
I make use of robots.txt, but I also return 403 for various bot user-agents.
If OpenAI ever spoofs its user-agent to be a normal browser, that'll mean war and I'll have to do IP-level blocking.
This is an excerpt from my Apache config:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DataForSeoBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} meta-externalagent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CensysInspect [NC,OR]
RewriteCond %{HTTP_USER_AGENT} openai [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GenomeCrawlerd [NC,OR]
RewriteCond %{HTTP_USER_AGENT} imagesift [NC,OR]
RewriteCond %{HTTP_USER_AGENT} claudebot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Barkrowler [NC]
RewriteRule . - [R=403,L]
10mins I will never get back, or why is this on /. (Score:2, Insightful)
So why is this news?
Oh right, it is in vogue to trash on Openai, give Google a pass, and throw roses at the feet of Apple. Got it... Grow the F up people and mark the original post as a troll.
Re: (Score:2)
Copyright BS (Score:1, Insightful)
> ...an AI bot is taking a website's copyrighted belongings...
If you make content available on the Internet, anyone who views your page has --by definition-- downloaded it.
Don't whine about copyrights when you chose to put it up there. FIre your IT guy who couldn't be bothered to do rate-limiting, and NO, "robots.txt" is not the answer. As the story says, it's a TOTALLY VOLUNTARY thing that nobody HAS TO ADHERE TO.
But hey, Weekend Slashdot and BeauSD. Every week.
Re: Copyright BS (Score:2)
Rate limit does nothing for Chinese bots who use thousands of IPs simultaneously, basically a DDOS. FYI some companies are scaling just for regular traffic and don't have resources to cope with traffic spikes. Of course they can use CloudFlare, but still I don't think it's fair to abuse the servers just because they're public. It's the same with the mailbox, just because I have one in the front of my house, I don't want to find it filled with spam. And in my country there's a legislation for that, so I don'
proofreading (Score:2)
"Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot..."
"Allowed the bot" is the wrong phrase; you meant "politely asked the bot not to". In fact, you correct yourself in the next paragraph, so why did you misspeak at all?
What's old is new (Score:2)
Sit back on the site that inspired the word Slashdotting and set the wayback machine for 1999. Substitute the word Google for OpenAI, and this is the same thing.
Aside from whether OpenAI should be scraping at all, mistakes happen, unexpected traffic happens.
When using AWS, make sure to avail yourself of logging, alarms, and most importantly, the billing cost controls. Better to be offline and have to figure out what happened than to have an enormous bill.
Re: (Score:2)
"We cant handle hosting ~250MB of data!!!" (Score:2)
> "We have over 65,000 products, each product has a page," Tomchuk told TechCrunch. "Each page has at least three photos."
So around 200K files times 1MB. Some "businesses" are not meant to be.
Re: (Score:1)