OpenAI's Bot Crushes Seven-Person Company's Website 'Like a DDoS Attack' 40
An anonymous reader quotes a report from TechCrunch: On Saturday, Triplegangers CEO Oleksandr Tomchuk was alerted that his company's e-commerce site was down. It looked to be some kind of distributed denial-of-service attack. He soon discovered the culprit was a bot from OpenAI that was relentlessly attempting to scrape his entire, enormous site. "We have over 65,000 products, each product has a page," Tomchuk told TechCrunch. "Each page has at least three photos." OpenAI was sending "tens of thousands" of server requests trying to download all of it, hundreds of thousands of photos, along with their detailed descriptions. "OpenAI used 600 IPs to scrape data, and we are still analyzing logs from last week, perhaps it's way more," he said of the IP addresses the bot used to attempt to consume his site. "Their crawlers were crushing our site," he said "It was basically a DDoS attack."
Triplegangers' website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of "human digital doubles" on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos -- everything from hands to hair, skin, and full bodies -- to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics. [...] To add insult to injury, not only was Triplegangers knocked offline by OpenAI's bot during U.S. business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot. Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot to freely scrape its site since the system interprets the absence of such a file as permission. It's not an opt-in system.
Once the file was updated with specific tags to block OpenAI's bot, along with additional defenses like Cloudflare, the scraping stopped. However, robots.txt is not foolproof since compliance by AI companies is voluntary, leaving the burden on website owners to monitor and block unauthorized access proactively. "[Tomchuk] wants other small online business to know that the only way to discover if an AI bot is taking a website's copyrighted belongings is to actively look," reports TechCrunch.
Triplegangers' website is its business. The seven-employee company has spent over a decade assembling what it calls the largest database of "human digital doubles" on the web, meaning 3D image files scanned from actual human models. It sells the 3D object files, as well as photos -- everything from hands to hair, skin, and full bodies -- to 3D artists, video game makers, anyone who needs to digitally recreate authentic human characteristics. [...] To add insult to injury, not only was Triplegangers knocked offline by OpenAI's bot during U.S. business hours, but Tomchuk expects a jacked-up AWS bill thanks to all of the CPU and downloading activity from the bot. Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot to freely scrape its site since the system interprets the absence of such a file as permission. It's not an opt-in system.
Once the file was updated with specific tags to block OpenAI's bot, along with additional defenses like Cloudflare, the scraping stopped. However, robots.txt is not foolproof since compliance by AI companies is voluntary, leaving the burden on website owners to monitor and block unauthorized access proactively. "[Tomchuk] wants other small online business to know that the only way to discover if an AI bot is taking a website's copyrighted belongings is to actively look," reports TechCrunch.
Feed the bot! (Score:2)
FEED THE BOT!!!!! Now! OpenAI demands it!
Re: (Score:2)
FEED THE BOT!!!!! Now! OpenAI demands it!
Look, it might cost many of you your livelihoods but that’s a risk OpenAI is willing to take.
Re: (Score:2)
Next up: everyone and their cat putting up gibberish generators to help Sam Altman get to the 100b "AGI" milestone.
Re: (Score:2)
That is actually an idea that I had some fifteen years, ago to indeed poison the internet with random noise.
However, when thinking about this, there always seem to be practical and technical obstacles.
Fuck OpenAI (Score:1, Flamebait)
So ... (Score:2, Insightful)
... they didn't have any caching set up?
Also, I would recommend having a website firewall (like Sucuri, or Cloudflare if you must), if your website is essential to your business ... having one ahead of time is a good idea.
(No, I'm not "defending OpenAI" any more than I'm "defending water" by suggesting you have a sump pump.)
Re: (Score:2)
Yeah, I was a bit perplexed by that too. A good CDN and reverse proxy (as you say, Cloudflare, which they did install after the attack) are kinda mandatory on the net these days—and are free with basic features.
Cloudflare in particular has specific "Anti-AI Bot" options.
that's why there is a robots.txt (Score:1, Insightful)
Look I basically despise OpenAI for a lot of reasons, but there's no story here.
"Website has improperly configured robots.txt, gets fucked by Web bots" is all this is.
The sob-context of "they're just a little company" is fluff.
The "it's not opt-in" and "web bots have to choose to comply"... Also fluff. Show me where bots are ignoring it, then it's an actionable story.
Feels like this is a simple case where whoever built their site is liable for some mitigation and AWS costs, end of story.
Re:that's why there is a robots.txt (Score:5, Interesting)
Is your argument that OpenAI is too stupid to scale its bots' traffic based on how well-known a web site is? Or that it is reasonable to use 600+ IPs to crawl a single random web site, with no apparent rate-limiting?
Re: (Score:2)
I await websites having automated responses to AI scrapping to be all-but-fork bombing the goatsce image with rando,m filenames and minor filesize/checksum tweaks for the AI's to "learn" from.
Re: (Score:2)
Re: (Score:2)
Nope. This is for commercial use. For that, you need explicit, positive consent. Search-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.
Re: (Score:2)
earch-engines have a specific exception. OpenAI does not. They just take because they can. Like a shoplifter.
What "specific exception" are you referring to and where can I read it?
Also, ignoring the poor behavior of the 600+ IPs for a moment, how is OpenAI different from a search engine in that they both scrape sites of content for the express purpose of private commercial gain?
Re: (Score:1)
However, using that analogy, you're blaming the person for not locking the door while 600 people walked right in and stole everything down to the bare nubs. Yes, he should have locked the door. That does not at all justify the 600 people who came in and took everything. Yes you should lock your doors, but protection from theft should not be opt-in.
Re: (Score:3)
The issue here is that each and every AI bot requires its own specific config in robots.txt.
They aren't obeying blanket "block all" settings.
Re: (Score:2)
It's easy to deny everything.
User-agent: *
Disallow: /
Are you saying OpenAI is ignoring that? Because that's not what TFA says.
Re: (Score:3)
Re: (Score:1)
Re: (Score:2)
Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.
Re: (Score:2)
Nobody takes any notice of robots.txt. None of the search engines do, and open AI doesn't.
Citation needed.
Google explicitly says that they honor robots.txt.
My own experience is that the major search engines do not crawl excluded directories.
TFA says the site didn't have a proper robots.txt, which implies that OpenAI would've honored it if it had one.
Fucking criminals (Score:2, Insightful)
Not only do they steal whatever they can get their hands on, they also destroy. All commercially, which makes them a criminal enterprise.
Why are they allowed to do this?
Re: (Score:2)
Because the law is 5-10 years behind the technology.
Re: (Score:2)
Re: (Score:3, Insightful)
Re: (Score:2)
The search engines aren't using 600+ IPs to scrape all at once. So, no, it wouldn't have happened with any of those. OpenAI is a cancerous tumor.
I've seen Bing overwhelm a few sites that didn't have caching. And there are plenty of random lesser bots and spiders out there.
No, OpenAI shouldn't do that, but it still doesn't mean you shouldn't have an umbrella when it's raining.
High speed crawling is vandalism (Score:2, Insightful)
This is my greatest worry as a songwriter (Score:3)
Music is constructed from repeating patterns, exactly the kind of patterns that LLMs excel at reproducing. The AI companies clearly are charging ahead with training on anything they can get their electronic hands on, copyrights be damned. No doubt the chickens will come home to roost, but it's going to be difficult in the extreme to show exactly where one's copied material was used to create the fake composite.
Having my life's work being sucked out of my control and into the unregulated (and uncompensated) maw of AI is among my greatest worries.
Just Why? (Score:4, Interesting)
Anyway, it's good to see these stories every once in a while so that those who learn by example have some grist for their mills.
I had the same shit from Anthropic's Claudebot (Score:2)
Our plan't limit is 150,000 database queries per hour and Claude was going way beyond that. We went from a normal background of ~10k hits per day to ~1.2M hits per day for 4 days.
Yeah, some dudz are gonna say: "Well, you should have configured your robots.txt properly."
Re: (Score:1)
Belt & Suspenders (Score:2)
I make use of robots.txt, but I also return 403 for various bot user-agents.
If OpenAI ever spoofs its user-agent to be a normal browser, that'll mean war and I'll have to do IP-level blocking.
This is an excerpt from my Apache config:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BaiduSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DataForSeoBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} meta-externalagent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CensysI
10mins I will never get back, or why is this on /. (Score:2)
So why is this news?
Oh right, it is in vogue to trash on Openai, give Google a pass, and throw roses at the feet of Apple. Got it... Grow the F up people and mark the original post as a troll.
Copyright BS (Score:2)
> ...an AI bot is taking a website's copyrighted belongings...
If you make content available on the Internet, anyone who views your page has --by definition-- downloaded it.
Don't whine about copyrights when you chose to put it up there. FIre your IT guy who couldn't be bothered to do rate-limiting, and NO, "robots.txt" is not the answer. As the story says, it's a TOTALLY VOLUNTARY thing that nobody HAS TO ADHERE TO.
But hey, Weekend Slashdot and BeauSD. Every week.
proofreading (Score:2)
"Triplegangers initially lacked a properly configured robots.txt file, which allowed the bot..."
"Allowed the bot" is the wrong phrase; you meant "politely asked the bot not to". In fact, you correct yourself in the next paragraph, so why did you misspeak at all?