Bluesky's Open API Means Anyone Can Scrape Your Data for AI Training. It's All Public (techcrunch.com) 95
Bluesky says it will never train generative AI on its users' data. But despite that, "one million public Bluesky posts — complete with identifying user information — were crawled and then uploaded to AI company Hugging Face," reports Mashable (citing an article by 404 Media).
"Shortly after the article's publication, the dataset was removed from Hugging Face," the article notes, with the scraper at Hugging Face posting an apology. "While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake." But TechCrunch noted the incident's real lesson. "Bluesky's open API means anyone can scrape your data for AI training," calling it a timely reminder that everything you post on Bluesky is public. Bluesky might not be training AI systems on user content as other social networks are doing, but there's little stopping third parties from doing so...
Bluesky said that it's looking at ways to enable users to communicate their consent preferences externally, [but] the company posted: "Bluesky won't be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We're having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!"
Mashable notes Bluesky's response to 404Media — that Bluesky is like a website, and "Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here."
So "While many commentators said that data collection should be opt in, others argued that Bluesky data is publicly available anyway and so the dataset is fair use," according to SiliconRepublic.com.
"Shortly after the article's publication, the dataset was removed from Hugging Face," the article notes, with the scraper at Hugging Face posting an apology. "While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake." But TechCrunch noted the incident's real lesson. "Bluesky's open API means anyone can scrape your data for AI training," calling it a timely reminder that everything you post on Bluesky is public. Bluesky might not be training AI systems on user content as other social networks are doing, but there's little stopping third parties from doing so...
Bluesky said that it's looking at ways to enable users to communicate their consent preferences externally, [but] the company posted: "Bluesky won't be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We're having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!"
Mashable notes Bluesky's response to 404Media — that Bluesky is like a website, and "Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here."
So "While many commentators said that data collection should be opt in, others argued that Bluesky data is publicly available anyway and so the dataset is fair use," according to SiliconRepublic.com.
Nice. Give out (Score:2)
Please also give out publicly the stats on any organization using the API extensively.
Re: Nice. Give out (Score:1)
Re: (Score:2)
it's not a hole. it's a window.
And it should be transparent in both directions.
What's wrong with this? (Score:5, Insightful)
It's public data, available for anyone, including AI bots, to peruse and learn from at their will. All this hubbub about AI stealing my shit is just that -- shit. AI, just like anyone, should have the right to view/read/scan any publicly available data, including copyrighted data if available publicly, to learn and grow. What it should not be able to do, just like real people cannot do, is plagiarize that data by using word for word quotes without proper citations. Authors/creators of data have the right to go after plagiarizing AI, just as they do with plagiarizing humans, if they find their work used without proper credit.
Again, if your work is out there for others to freely access and learn from, then those who can learn from it include AI. If you don't like it, don't publicly publish your work.
Re: What's wrong with this? (Score:4, Interesting)
I know the crypto bros are super upset that their NFTs didnâ€(TM)t go anywhere and now they want to grift on AI, but this is patently not the case.
Each US poster on Bluesky patently owns their content whether theyâ€(TM)ve asserted the copyright or not.
Re: (Score:2)
By sharing User Content through Bluesky Social, you grant us permission to:
Use User Content to develop, provide, and improve Bluesky Social, the AT Protocol, and any of our future offerings. For example, we can store and present User Content to other users in Bluesky Social. This allows us to show your posts in the Bluesky app to other users;
Modify or otherwise utilize User Content in any media. This includes reproducing, preparing derivative works, distributing, performing, and displaying your User Content. For example, we can resize your posts to fit the Bluesky mobile or desktop app, or feature examples of User Content for promotional purposes; or
Grant others the right to take the actions above. For example, we can grant content moderation tools access to User Content in order to monitor Bluesky Social;
Re: What's wrong with this? (Score:1)
Re: What's wrong with this? (Score:2)
But still doesnâ€(TM)t give grifters the â€oeright†to train their AI models.
Re: (Score:2)
Re: What's wrong with this? (Score:2)
If someone were to catch an AI platform that grifted off their copyrighted materials, they could sue. Thatâ€(TM)s just the facts in the U.S.
Re: (Score:2)
If someone were to catch an AI platform that grifted off their copyrighted materials, they could sue. Thatâ€(TM)s just the facts in the U.S.
You maybe surprised to hear this but a great deal of the world isn't the US and I get that while you are saying it is technically illegal good luck proving I took your post (in combination as many others as I could) and used it to make my own.
Re: (Score:2)
You maybe surprised to hear this but a great deal of the world isn't the US
Copyright law is mostly standardized between countries.
Berne Convention [wikipedia.org]
Re: (Score:2)
TECHNICALLY speaking you are able to learn from it and use that information to form opinions and views if not why are you positing at all.
You are also able to link to that information, since it is posted on the internet so why can't you quote it on the page, what difference does it make apart from making it more convenient for the reader. By posting on a public forum you have implicitly granted the right for the public to view that post if you don't want to do that then don't post publicly.
Re: What's wrong with this? (Score:1)
Re: (Score:1)
Re: (Score:2)
This statement seems to imply just because you post it on the internet you relinquish all copyright rights
No, it implies that reading isn't copying.
If a human reads a website, no one considers that copying. Incidental caching doesn't count.
If a computer reads a website, is that "copying"? So far, that has not been tested in court.
crypto bros are super upset that their NFTs didn't go anywhere and now they want to grift on AI
NFT "crypto bros" and AI developers are different sets of people with little or no overlap.
Re: (Score:2)
The computer isn't "reading it" in anything approximating a human fashion. What is happening is a company is incorporating the content into a statistical model--they are creating something from the content.
Anthropomorphizing an AI model doesn't mean you can spout your "it's reading" BS and expect people to believe it.
Re: (Score:2)
What is happening is a company is incorporating the content into a statistical model
Which is probably what human brains do, too. Brains are neural networks. They aren't exactly the same as AI neural networks, but the foundational concepts of the neural nets we build are based on our understanding of how biological neurons work.
Re: (Score:2)
"No, it implies that reading isn't copying."
That used to be true. With computerized data storage, it is not true any longer.
Re: (Score:2)
With computerized data storage, it is not true any longer.
Human reading of websites causes caching in "computerized data storage". That is not considered copying.
If an AI learned by re-downloading the page each time it was scanned, without caching, would you drop your objections?
Re: (Score:2)
"Human reading of websites causes caching in "computerized data storage". That is not considered copying."
By whom? It certainly looks like copying to me.
Re: (Score:2)
By whom?
By the courts and by law.
Specifically, by Section 512 of Title 17 of the United States Civil Code.
Other countries have their own laws, but browsers are not illegal in any country, and all browsers use caching.
Re: (Score:2)
The law has been trying to stretch laws written when reading and copying were different things by creating arbitrary definitions to classify "copying" as "not copying". This is working about as well as you might expect.
Re: (Score:2)
The law has been trying to stretch laws written when reading and copying were different things by creating arbitrary definitions to classify "copying" as "not copying".
Not really. 17 U.S. Code 512 doesn't attempt to define copying as not copying, it just specifies that there is no liability for "infringement of copyright by reason of the provider’s transmitting, routing, or providing connections for, material through a system or network controlled or operated by or for the service provider, or by reason of the intermediate and transient storage of that material in the course of such transmitting, routing, or providing connections".
So, it's still copying, but the
Re: (Score:2)
It's considered copying if you read a book, then use sentences, phrases, characters, and to some extent concepts present in that book as part of my own work.
AI is not merely "reading" the text, it is ingesting the text explicitly for the purpose of puking it back out upon request. It doesn't even creatively add to the text it eats, just mixes it with other digested words in a grammatically correct order that, to an non-discerning user, appears to be a coherent thought.
That's copying. it also does this witho
Re: (Score:2)
When you read a text book or do research and use that information to write your own paper is that considered copying? Are you cheating or are you learning, pretty much all books, films, pictures, music do this artist, students everyone take information they find and formulate ideas. When a student reads a book for an exam they are reading for the specific purpose of describing it back on request. If they are asked analyze it they will also do that just like an AI would do.
Re: (Score:2)
When you read a text book or do research and use that information to write your own paper is that considered copying?
It depends. The rules are not clear-cut, precisely because there are endless possible variations.
For example, if I publish a book about a young wizard named "Harry Potter" who attends a magic school called "Hogwarts" with his friends "Hermione Granger" and "Ron Weasley", I'll definitely be sued and found liable for producing a derivative work which infringes Rowling's copyrights. If I change enough elements of the characters and setting, eventually I'll end up with something so different that it's cons
Re: (Score:2)
> When you read a text book or do research and use that information to write your own paper is that considered copying?
That's not what AI is doing.
When you write a book report or whatever, you are processing the concepts and meaning of the source material into original thoughts and new meanings in a new context. AI does not understand concepts and is not capable of forming original ideas. It doesn't "know" things and it doesn't "learn" in any sense comparable to what humans do. All it does is build a sta
Re: (Score:2)
The AI grifters and shills are the same people who were shilling blockchain stuff last year. Those aren't the same people as the developers, as the grifters and shills wouldn't know how to program a hello world never mind an AI model.
Re: (Score:2)
Re: (Score:2)
This statement seems to imply just because you post it on the internet you relinquish all copyright rights to your content because itâ€(TM)s available on a website. In the U.S. at least, this is legitimately not true.
It most demonstrably is.
Re: (Score:3)
If you don’t like the public nature of the internet, then don’t post on social media. It doesn’t matter what contract or belief in copyright you have, when you’ve put something out in public it’s there for all to see, whether it is by a bot or human.
Feel free to hire a lawyer, but i would suggest avoiding public speech first, to save on those bills.
Re: (Score:2)
What good is copyright when you post the content to a web site where you're agreeing to grant the site a forever non-exclusive license to your content for free? Sure, you own it. But you're also not the only one who can sell it.
Re: (Score:2)
>This statement seems to imply just because you post it on the internet you relinquish all copyright rights to your content because itâ(TM)s available on a website. In the U.S. at least, this is legitimately not true.
It absolutely IS true when talking about restricting others from seeing or using said content. If you post your masterpiece painting in a public place, that self same public has every right to photograph and / or view the posted painting. You also can't exclude picked and chosen portion
Re:What's wrong with this? (Score:5, Interesting)
This is the only thing that makes sense. Social networks are for being social. That means putting the info out into the world. If I wanted to make sure nobody was reading what I was writing, with automated tools or manually, I would use E2E encrypted messages, probably using public key cryptography. And then probably not even the recipient would bother to read them :)
The only things people can publish to Bluesky are 1) short text messages, 2) very poor quality images*, and 3) links. Links are by definition to published content, very poor quality images have little value for AI training, and your short text messages are ostensibly intended for public consumption so there was never going to be any stopping people from using them for training no matter where you posted them. You don't need an API to scrape public comments.
* Not only does Bluesky crunch images up at least as badly as Faceboot but when I post images they are replaced by a black square. I'm told this happens with high-res images, but of the three images I've tried to post, only one of them was over XGA resolution. Maybe it's a result of something I'm doing with ublock origin? Irritating AF.
Re: (Score:2)
"Social" and "public" are related, but distinct concepts.
When my wife and I engage in intimate affairs in our bedroom, it is a social activity but it also very VERY private.
I don't know anything about Bluesky other than what I've heard in the news the last few weeks. So you probably make a very valid point about the type of content that people post on Bluesky and whether or not that content is something that a reasonable person would feel protective about. But I do use Facebook to keep in touch with distant
Re: What's wrong with this? (Score:2)
On Bsky you do not control distribution, it is really just Twitter without Elno in most ways and that is one of them. Unlike Twitter, block still works like you expect. Also unlike Twitter you can create lists of users and anyone can subscribe to your list and either block or follow the members. I have two such block lists (you can also just block people without a list if you want) and one is for blocking the really offensive people (mostly MAGA trolls) and the other is for blocking people whose habits irri
Re: (Score:2)
There is a difference between putting something out into public, even if for free, and relinquishing all rights to it. If I freely distribut
Re: (Score:2)
Re: (Score:2)
People are actually posting videos directly to Bsky? Every social media platform's media player is inferior to Youtube's, I don't understand why anyone would do that to begin with, unless they hate the people who would watch.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Public data on a privately-owned website? Yeah, that's not public data.
Sorry, if your data is viewable by the public, either by posting on the internet by you, allowing a public library to digitally loan out, or any other means, your data is available to the public to access and learn from. If you don't agree with that, don't publish or allow your data to be viewed by the public.
Re:What's wrong with this? (Score:4, Insightful)
OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
Re:What's wrong with this? (Score:5, Insightful)
>>OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
You are freely permitted to record OTA signals regardless of copyright (see Sony Corp. of America v. Universal City Studios, Inc. 1984). Distributing is another matter (and it is also an open question of whether AI systems "distribute" the data they have analyzed).
Re: What's wrong with this? (Score:2)
Re: (Score:2)
Hard to imagine a "commercial or profitmaking purpose" that doesn't involve distribution in some form.
Re: (Score:2)
OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
Try watching some tv shows and then making more like them because there are entire fucking industries based on that.
Re: (Score:2)
"others argued that Bluesky data is publicly available anyway and so the dataset is fair use" OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
And it is fair use. Anyone can watch/listen to TV/radio/online video/podcast and use the information learned to write/create more data, as long as they don't directly copy verbatim the data. If you find an AI spewing out your data word for word, you have every right to sue the company in control of that AI, just as you would have every right to sue an individual or corporation who directly copied your data and passed it on without crediting you.
I really don't understand how people can't grasp this. Our e
Re: (Score:2)
But that's exactly what they're doing, copying the data and then using it as input to an "AI." Your whole educational use rationalization is a red herring - the copyright exception only applies to _non-commercial_ educational use. Beyond that, it's hard to argue that training an AI is bona fide "education."
Re: (Score:2)
Re: (Score:2)
There's a difference between a computer "reading" or "learning from" a work and a human doing the same. There is a certain amount of copying necessary just to transmit the website to you and display it on your screen. Copies in your computer's ram, the graphic's card screen buffer, and the pixels on the screen. Those copies are generally agreed to be implicitly authorized as part of being distributed on a website.
However, if you right click on that page and select "save as...", you've now technically cr
Re: (Score:2)
OK. (Score:1)
Just fill it up with nonsense.
Re: (Score:2)
Re: (Score:3, Insightful)
Trumpers...Trump isn't REALLY a criminal, all those illegal things he did wrong are FINE, because...well, he is Trump! That is where the true Trump Derangement is, giving him a pass for being a con artist and criminal. You know that if you lie on your taxes, that CAN get you thrown in prison, don't you?
Re: (Score:2)
Yep. Its pure projection.
They paint onto you what they are guilty of.
Re: (Score:2)
I don't think we need to bring in your ability to believe el Bunko, the Artist. I have to admit that is an amazing ability, do you have any others?
Re: (Score:2)
Epic (Score:2)
Bet your ass AI startups are already doing it. (Score:2)
So dumping the Bluesky data is 1) Free to do, and 2) Legally and morally ambiguous due to intertwined licenses etc.
The true question is why wouldn't they?
Re: (Score:1)
Re: (Score:2)
Bluesky say they it will never train generative AI on its users' data
And advertising companies say they never sell your data. Your data is far to valuable to sell. They use it to target ads themselves.
Flashback (Score:3)
We were talking about this in the comments a month ago.
https://slashdot.org/comments.... [slashdot.org]
Somebody is going to get your data (Score:2)
Your data is going to get used to train AI to replace you. That's just a fact of modern life. The real problem is we never get a piece of the action.
Re: (Score:2)
The one thing that I haven't pointed out to the Bluesky crowd: They're having a discussion with the person who made the dataset. Rather than pushing the guy to block the dataset (which anybody else can secretly make anyway), it's an opportunity to have some grass-roots discussions about ethical use, like "Hey, it's OK, but please anonymize user names, etc."
No casual user without a legal budget has a chance at having a discussion with Meta, OpenAI, Anthropic or Google about their data collection procedures.
You know what? I wouldn't mind, if not... (Score:3)
I wouldn't so much mind all my data being sucked up by the AI training / aggregation routines if not for the fact that they are "owned" by some of the greediest, most self-centered assholes to have ever crawled up out of the slime of the rest of humanity to positions of power. I'd happily feed my manuscripts, such as they are, to an open source / truly free AI, meant to be a public good. But all of these fucking things right now are owned by massive capitalist institutions with mouthpieces that make the Gilded Age masters look like kind-hearted liberal-oriented humanitarians. Yes, I get that it takes money to run these "eats more power per second that entire neighborhoods use in a year" systems, but what good is it doing other than continuing to pull wealth from the entirety of society in order to continue to feed those who have plenty? If AI is going to replace us all, what's the benefit to those of us not already in the owner class? Like it or not, society is built on the shoulders of the lower and middle classes. If the owner class manages to find a way to not need the lower and middle classes through AI or any other means, what's the end-game for us?
The small price of interoperability (Score:4, Insightful)
Whining that the data is accessible is something I expect from movie execs. Now techies too?
Oh noes, we have access to the data, because it's not locked down in a secure enclave! (Data that 100% of the users deliberately uploaded so that it could be read [wikipedia.org] by others.)
Re: (Score:2)
Whining that the data is accessible is something I expect from movie execs. Now techies too?
Yeah, publicly accessible data is publicly accessible. Shocker.
Yep (Score:2)
And your website if you have one. You can bet someone's ignoring your robots file. And google and X, microsoft and all the Meta and your phone, good god, y'all. Every app. Oh and email, never been private.
If I want to find out I can. If they do, they can. If you do, you can, hire a PI. I'm a little more than over it. This is fear mongering, if you weren't aware, here it is. If you're just now afraid. Sorry kid, it gets worse. The heart grows cold.
Never fear! It's fine, it's fine. (Score:2)
Dr. Kleiner says the huggy face humper has been fully debeaked.
Transfering guilt (Score:2)
Re: (Score:2)
Information wants to be free! (Score:2)
When did /. get infected? (Score:2)
Can't have it both ways (Score:2)
People complained when Twitter/X restricted their APIs. Now people are complaining that Bluesky doesn't restrict their APIs. Which one do you want?
Twitter / X (Score:2)
Ever heard of the X (formerly Twitter) firehose API? Everybody who pays enough gets all of X in realtime.
And for federated networks ... everything using ActivityPub even pushes new content to your node.
Ha (Score:1)