Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Social Networks AI

Bluesky's Open API Means Anyone Can Scrape Your Data for AI Training. It's All Public (techcrunch.com) 109

Bluesky says it will never train generative AI on its users' data. But despite that, "one million public Bluesky posts — complete with identifying user information — were crawled and then uploaded to AI company Hugging Face," reports Mashable (citing an article by 404 Media).

"Shortly after the article's publication, the dataset was removed from Hugging Face," the article notes, with the scraper at Hugging Face posting an apology. "While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake." But TechCrunch noted the incident's real lesson. "Bluesky's open API means anyone can scrape your data for AI training," calling it a timely reminder that everything you post on Bluesky is public. Bluesky might not be training AI systems on user content as other social networks are doing, but there's little stopping third parties from doing so...

Bluesky said that it's looking at ways to enable users to communicate their consent preferences externally, [but] the company posted: "Bluesky won't be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We're having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!"

Mashable notes Bluesky's response to 404Media — that Bluesky is like a website, and "Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here."

So "While many commentators said that data collection should be opt in, others argued that Bluesky data is publicly available anyway and so the dataset is fair use," according to SiliconRepublic.com.

Bluesky's Open API Means Anyone Can Scrape Your Data for AI Training. It's All Public

Comments Filter:
  • Please also give out publicly the stats on any organization using the API extensively.

  • by bjwest ( 14070 ) on Monday December 02, 2024 @07:59AM (#64984791)

    It's public data, available for anyone, including AI bots, to peruse and learn from at their will. All this hubbub about AI stealing my shit is just that -- shit. AI, just like anyone, should have the right to view/read/scan any publicly available data, including copyrighted data if available publicly, to learn and grow. What it should not be able to do, just like real people cannot do, is plagiarize that data by using word for word quotes without proper citations. Authors/creators of data have the right to go after plagiarizing AI, just as they do with plagiarizing humans, if they find their work used without proper credit.

    Again, if your work is out there for others to freely access and learn from, then those who can learn from it include AI. If you don't like it, don't publicly publish your work.

    • by magamiako1 ( 1026318 ) on Monday December 02, 2024 @08:03AM (#64984797)
      This statement seems to imply just because you post it on the internet you relinquish all copyright rights to your content because itâ€(TM)s available on a website. In the U.S. at least, this is legitimately not true.

      I know the crypto bros are super upset that their NFTs didnâ€(TM)t go anywhere and now they want to grift on AI, but this is patently not the case.

      Each US poster on Bluesky patently owns their content whether theyâ€(TM)ve asserted the copyright or not.
      • What's the point of owning the content if you freely license it out to be used? From https://bsky.social/about/supp... [bsky.social]:

        By sharing User Content through Bluesky Social, you grant us permission to:

        Use User Content to develop, provide, and improve Bluesky Social, the AT Protocol, and any of our future offerings. For example, we can store and present User Content to other users in Bluesky Social. This allows us to show your posts in the Bluesky app to other users;

        Modify or otherwise utilize User Content in any media. This includes reproducing, preparing derivative works, distributing, performing, and displaying your User Content. For example, we can resize your posts to fit the Bluesky mobile or desktop app, or feature examples of User Content for promotional purposes; or

        Grant others the right to take the actions above. For example, we can grant content moderation tools access to User Content in order to monitor Bluesky Social;

        • This grants permission to Bluesky, but does not automatically give permission to anyone else. Most of these provisions are necessary for normal operation. I do wish these were not as broad though.
          • Yeah. The unfortunate legalize that exists in order to cover the concept of having an app to view the content is a dumb requirement, but has to be there in order to cover themselves.

            But still doesnâ€(TM)t give grifters the â€oeright†to train their AI models.
          • The permission is the same one that lets me copy and paste your comment (and anything else displayed on the internet) and do whatever I want with it. You put it out there and they took it even though you asked nicely not to.
            • TECHNICALLY speaking you arenâ€(TM)t legally allowed to do that. I know itâ€(TM)s generally not something people follow up on due to limitations of effort, cost, and time; but the point stands.

              If someone were to catch an AI platform that grifted off their copyrighted materials, they could sue. Thatâ€(TM)s just the facts in the U.S.
              • If someone were to catch an AI platform that grifted off their copyrighted materials, they could sue. Thatâ€(TM)s just the facts in the U.S.

                You maybe surprised to hear this but a great deal of the world isn't the US and I get that while you are saying it is technically illegal good luck proving I took your post (in combination as many others as I could) and used it to make my own.

              • TECHNICALLY speaking you are able to learn from it and use that information to form opinions and views if not why are you positing at all.

                You are also able to link to that information, since it is posted on the internet so why can't you quote it on the page, what difference does it make apart from making it more convenient for the reader. By posting on a public forum you have implicitly granted the right for the public to view that post if you don't want to do that then don't post publicly.

                • You're deliberately skipping a ton of case law and nuance. There's the case of attribution too. It's one thing to quote Harry Potter, link to source or summarise the plot. It's a whole different thing if you copy paste the entire book and claim it as something you made. Even if some steamy shipping was hallucinated into the output, the result is hardly transformative enough to fly under fair use. There's also other pillars of copyright that are affected. How substantial the similarity to existing work is.
                  • by Anonymous Coward

                    You would be surprised how much of Harry Potter most models do not know. Especially for the smaller ones, you can do simple estimates. Take a model with 10 GB weights (e.g. Llama3 8B), then take content like let's say a compressed Wikipedia dump (24GB compressed*) and wonder how much you can lossy compress it more. Then think about the size of the training material (e.g. 15 trillion tokens for Llama3) and you can choose which 10 GB you put into the weights (minus the amount that is needed for the smarts of

          • That depends on your interpretation of their legalese; is AI training "preparing derivative works"? Or is sharing the content with AI models "distributing"? IANAL... If training AI models is allowed under those terms, then Bluesky can make your data available to others to train AI models ("Grant others the right to take the actions above.").
      • This statement seems to imply just because you post it on the internet you relinquish all copyright rights

        No, it implies that reading isn't copying.

        If a human reads a website, no one considers that copying. Incidental caching doesn't count.

        If a computer reads a website, is that "copying"? So far, that has not been tested in court.

        crypto bros are super upset that their NFTs didn't go anywhere and now they want to grift on AI

        NFT "crypto bros" and AI developers are different sets of people with little or no overlap.

        • by Barny ( 103770 )

          The computer isn't "reading it" in anything approximating a human fashion. What is happening is a company is incorporating the content into a statistical model--they are creating something from the content.

          Anthropomorphizing an AI model doesn't mean you can spout your "it's reading" BS and expect people to believe it.

          • What is happening is a company is incorporating the content into a statistical model

            Which is probably what human brains do, too. Brains are neural networks. They aren't exactly the same as AI neural networks, but the foundational concepts of the neural nets we build are based on our understanding of how biological neurons work.

        • "No, it implies that reading isn't copying."

          That used to be true. With computerized data storage, it is not true any longer.

          • With computerized data storage, it is not true any longer.

            Human reading of websites causes caching in "computerized data storage". That is not considered copying.

            If an AI learned by re-downloading the page each time it was scanned, without caching, would you drop your objections?

            • "Human reading of websites causes caching in "computerized data storage". That is not considered copying."

              By whom? It certainly looks like copying to me.

              • By whom?

                By the courts and by law.

                Specifically, by Section 512 of Title 17 of the United States Civil Code.

                Other countries have their own laws, but browsers are not illegal in any country, and all browsers use caching.

                • The law has been trying to stretch laws written when reading and copying were different things by creating arbitrary definitions to classify "copying" as "not copying". This is working about as well as you might expect.

                  • The law has been trying to stretch laws written when reading and copying were different things by creating arbitrary definitions to classify "copying" as "not copying".

                    Not really. 17 U.S. Code 512 doesn't attempt to define copying as not copying, it just specifies that there is no liability for "infringement of copyright by reason of the provider’s transmitting, routing, or providing connections for, material through a system or network controlled or operated by or for the service provider, or by reason of the intermediate and transient storage of that material in the course of such transmitting, routing, or providing connections".

                    So, it's still copying, but the

            • It's considered copying if you read a book, then use sentences, phrases, characters, and to some extent concepts present in that book as part of my own work.

              AI is not merely "reading" the text, it is ingesting the text explicitly for the purpose of puking it back out upon request. It doesn't even creatively add to the text it eats, just mixes it with other digested words in a grammatically correct order that, to an non-discerning user, appears to be a coherent thought.

              That's copying. it also does this witho

              • When you read a text book or do research and use that information to write your own paper is that considered copying? Are you cheating or are you learning, pretty much all books, films, pictures, music do this artist, students everyone take information they find and formulate ideas. When a student reads a book for an exam they are reading for the specific purpose of describing it back on request. If they are asked analyze it they will also do that just like an AI would do.

                • When you read a text book or do research and use that information to write your own paper is that considered copying?

                  It depends. The rules are not clear-cut, precisely because there are endless possible variations.

                  For example, if I publish a book about a young wizard named "Harry Potter" who attends a magic school called "Hogwarts" with his friends "Hermione Granger" and "Ron Weasley", I'll definitely be sued and found liable for producing a derivative work which infringes Rowling's copyrights. If I change enough elements of the characters and setting, eventually I'll end up with something so different that it's cons

                • > When you read a text book or do research and use that information to write your own paper is that considered copying?

                  That's not what AI is doing.

                  When you write a book report or whatever, you are processing the concepts and meaning of the source material into original thoughts and new meanings in a new context. AI does not understand concepts and is not capable of forming original ideas. It doesn't "know" things and it doesn't "learn" in any sense comparable to what humans do. All it does is build a sta

        • The AI grifters and shills are the same people who were shilling blockchain stuff last year. Those aren't the same people as the developers, as the grifters and shills wouldn't know how to program a hello world never mind an AI model.

          • If only it were that easy... you could just identify the set of grifters and shills and ignore them. Unfortunately, it's an almost -- but not quite -- entirely different set of grifters and shills.
      • This statement seems to imply just because you post it on the internet you relinquish all copyright rights to your content because itâ€(TM)s available on a website. In the U.S. at least, this is legitimately not true.

        It most demonstrably is.

      • If you don’t like the public nature of the internet, then don’t post on social media. It doesn’t matter what contract or belief in copyright you have, when you’ve put something out in public it’s there for all to see, whether it is by a bot or human.

        Feel free to hire a lawyer, but i would suggest avoiding public speech first, to save on those bills.

      • What good is copyright when you post the content to a web site where you're agreeing to grant the site a forever non-exclusive license to your content for free? Sure, you own it. But you're also not the only one who can sell it.

      • >This statement seems to imply just because you post it on the internet you relinquish all copyright rights to your content because itâ(TM)s available on a website. In the U.S. at least, this is legitimately not true.

        It absolutely IS true when talking about restricting others from seeing or using said content. If you post your masterpiece painting in a public place, that self same public has every right to photograph and / or view the posted painting. You also can't exclude picked and chosen portion

    • by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Monday December 02, 2024 @08:42AM (#64984847) Homepage Journal

      This is the only thing that makes sense. Social networks are for being social. That means putting the info out into the world. If I wanted to make sure nobody was reading what I was writing, with automated tools or manually, I would use E2E encrypted messages, probably using public key cryptography. And then probably not even the recipient would bother to read them :)

      The only things people can publish to Bluesky are 1) short text messages, 2) very poor quality images*, and 3) links. Links are by definition to published content, very poor quality images have little value for AI training, and your short text messages are ostensibly intended for public consumption so there was never going to be any stopping people from using them for training no matter where you posted them. You don't need an API to scrape public comments.

      * Not only does Bluesky crunch images up at least as badly as Faceboot but when I post images they are replaced by a black square. I'm told this happens with high-res images, but of the three images I've tried to post, only one of them was over XGA resolution. Maybe it's a result of something I'm doing with ublock origin? Irritating AF.

      • "Social" and "public" are related, but distinct concepts.

        When my wife and I engage in intimate affairs in our bedroom, it is a social activity but it also very VERY private.

        I don't know anything about Bluesky other than what I've heard in the news the last few weeks. So you probably make a very valid point about the type of content that people post on Bluesky and whether or not that content is something that a reasonable person would feel protective about. But I do use Facebook to keep in touch with distant

        • On Bsky you do not control distribution, it is really just Twitter without Elno in most ways and that is one of them. Unlike Twitter, block still works like you expect. Also unlike Twitter you can create lists of users and anyone can subscribe to your list and either block or follow the members. I have two such block lists (you can also just block people without a list if you want) and one is for blocking the really offensive people (mostly MAGA trolls) and the other is for blocking people whose habits irri

      • by flink ( 18449 )

        This is the only thing that makes sense. Social networks are for being social. That means putting the info out into the world.... ... and your short text messages are ostensibly intended for public consumption so there was never going to be any stopping people from using them for training no matter where you posted them. You don't need an API to scrape public comments.

        There is a difference between putting something out into public, even if for free, and relinquishing all rights to it. If I freely distribut

      • If you think images are poor quality, you haven't seen what it does to videos...
        • People are actually posting videos directly to Bsky? Every social media platform's media player is inferior to Youtube's, I don't understand why anyone would do that to begin with, unless they hate the people who would watch.

          • People do that, me included, for game development videos in this case. The answer is easy: uploading directly, you don't inflict anyone the pain of loading up the youtube app on their phone, as it's heavyweight and you get ads there probably. Youtube is far more flexible and allows better quality and length, but I'd rather post the video directly even if I don't get the stated benefits and the extra views recorded. On the desktop it doesn't matter all that much, but I bet the vast majority is browsing on th
    • Public data on a privately-owned website? Yeah, that's not public data.
      • by bjwest ( 14070 )

        Public data on a privately-owned website? Yeah, that's not public data.

        Sorry, if your data is viewable by the public, either by posting on the internet by you, allowing a public library to digitally loan out, or any other means, your data is available to the public to access and learn from. If you don't agree with that, don't publish or allow your data to be viewed by the public.

    • by msauve ( 701917 ) on Monday December 02, 2024 @08:44AM (#64984851)
      "others argued that Bluesky data is publicly available anyway and so the dataset is fair use"

      OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
      • by Comboman ( 895500 ) on Monday December 02, 2024 @09:04AM (#64984889)

        >>OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.

        You are freely permitted to record OTA signals regardless of copyright (see Sony Corp. of America v. Universal City Studios, Inc. 1984). Distributing is another matter (and it is also an open question of whether AI systems "distribute" the data they have analyzed).

        • "Sony" applied to personal, non-commercial use. As the ruling started: "If the Betamax were used to make copies for a commercial or profitmaking purpose, such use would presumptively be unfair."
          • Hard to imagine a "commercial or profitmaking purpose" that doesn't involve distribution in some form.

            • by jezwel ( 2451108 )

              Hard to imagine a "commercial or profitmaking purpose" that doesn't involve distribution in some form.

              IMDB/TVDB.

              It would surprise me if they aren't using LLCs to create new pages for human curation.

      • OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.

        Try watching some tv shows and then making more like them because there are entire fucking industries based on that.

      • by bjwest ( 14070 )

        "others argued that Bluesky data is publicly available anyway and so the dataset is fair use" OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.

        And it is fair use. Anyone can watch/listen to TV/radio/online video/podcast and use the information learned to write/create more data, as long as they don't directly copy verbatim the data. If you find an AI spewing out your data word for word, you have every right to sue the company in control of that AI, just as you would have every right to sue an individual or corporation who directly copied your data and passed it on without crediting you.

        I really don't understand how people can't grasp this. Our e

        • by msauve ( 701917 )
          "as long as they don't directly copy verbatim the data"

          But that's exactly what they're doing, copying the data and then using it as input to an "AI." Your whole educational use rationalization is a red herring - the copyright exception only applies to _non-commercial_ educational use. Beyond that, it's hard to argue that training an AI is bona fide "education."
    • Disagree. Humans have rights. Companies, machines, tools, kitchen appliances etc. exist for our use and benefit and are subject to any arbitrary rules we choose to impose. If you don't like that, find someplace to live that isn't subject to constitutional law.
    • by flink ( 18449 )

      There's a difference between a computer "reading" or "learning from" a work and a human doing the same. There is a certain amount of copying necessary just to transmit the website to you and display it on your screen. Copies in your computer's ram, the graphic's card screen buffer, and the pixels on the screen. Those copies are generally agreed to be implicitly authorized as part of being distributed on a website.

      However, if you right click on that page and select "save as...", you've now technically cr

    • by vlad30 ( 44644 )
      An AI getting data reminds me of a scene in Short Circuit where the robot enters a book store and reads all the books learning just as a human does only much faster, Now humans can read of the internet and learn so can a robot/AI which is essentially what the AI is doing but not at human speed at much faster speeds. One way or another AI's will get the information and better than that they will be able to share and transfer it to each other faster than humans ever could.
    • Agreed. I also have a website and a couple of blogs, those are public on the web and anyone can access them, human or bot. On the other hand, if I want to show a friend a picture on my Instagram, the friend is forced to make and account and, in the process, give a lot of personal data to Meta.

  • Just fill it up with nonsense.

  • Just epic.
  • So dumping the Bluesky data is 1) Free to do, and 2) Legally and morally ambiguous due to intertwined licenses etc.
    The true question is why wouldn't they?

    • by Bgward ( 7183574 )
      Maybe Bluesky say they it will never train generative AI on its users' data, so that users know they don't have an incentive to make Bluesky more suitable for that. Of course, because in the open, it is difficult to prevent 3d parties scraping for whatever reason (legal or not). If everybody clearly knows about all that: they can act as they see fit. Like not posting stuff on Bluesky that should remain more private.
      • Bluesky say they it will never train generative AI on its users' data

        And advertising companies say they never sell your data. Your data is far to valuable to sell. They use it to target ads themselves.

    • They do this because they do this with ALL the Bluesky competitors, and all the social media competitors. Why is this news? Is this just a story saying "1723rd company discovered to have its data mined by AI startups"?

  • by bill_mcgonigle ( 4333 ) * on Monday December 02, 2024 @08:35AM (#64984833) Homepage Journal

    We were talking about this in the comments a month ago.

    https://slashdot.org/comments.... [slashdot.org]

  • If you go to Twitter Elon musk gets it if you go to Facebook Mark Zuckerberg gets it. I'd rather everyone get it.

    Your data is going to get used to train AI to replace you. That's just a fact of modern life. The real problem is we never get a piece of the action.
    • The one thing that I haven't pointed out to the Bluesky crowd: They're having a discussion with the person who made the dataset. Rather than pushing the guy to block the dataset (which anybody else can secretly make anyway), it's an opportunity to have some grass-roots discussions about ethical use, like "Hey, it's OK, but please anonymize user names, etc."

      No casual user without a legal budget has a chance at having a discussion with Meta, OpenAI, Anthropic or Google about their data collection procedures.

    • I'd like to see some A/B testing. One AI that only trains on the conspiratorial nonsense that Twitter is spewing by default, and one trained on lots of different social media but without Twitter. (or X, you kow what I mean)

    • Your data is going to get used to train AI to replace you.

      Ha ha, wait until they find out I'm not really a professional shitposter.

  • by nightflameauto ( 6607976 ) on Monday December 02, 2024 @09:17AM (#64984909)

    I wouldn't so much mind all my data being sucked up by the AI training / aggregation routines if not for the fact that they are "owned" by some of the greediest, most self-centered assholes to have ever crawled up out of the slime of the rest of humanity to positions of power. I'd happily feed my manuscripts, such as they are, to an open source / truly free AI, meant to be a public good. But all of these fucking things right now are owned by massive capitalist institutions with mouthpieces that make the Gilded Age masters look like kind-hearted liberal-oriented humanitarians. Yes, I get that it takes money to run these "eats more power per second that entire neighborhoods use in a year" systems, but what good is it doing other than continuing to pull wealth from the entirety of society in order to continue to feed those who have plenty? If AI is going to replace us all, what's the benefit to those of us not already in the owner class? Like it or not, society is built on the shoulders of the lower and middle classes. If the owner class manages to find a way to not need the lower and middle classes through AI or any other means, what's the end-game for us?

  • by Sloppy ( 14984 ) on Monday December 02, 2024 @09:22AM (#64984929) Homepage Journal

    Whining that the data is accessible is something I expect from movie execs. Now techies too?

    Oh noes, we have access to the data, because it's not locked down in a secure enclave! (Data that 100% of the users deliberately uploaded so that it could be read [wikipedia.org] by others.)

    • Whining that the data is accessible is something I expect from movie execs. Now techies too?

      Yeah, publicly accessible data is publicly accessible. Shocker.

  • And your website if you have one. You can bet someone's ignoring your robots file. And google and X, microsoft and all the Meta and your phone, good god, y'all. Every app. Oh and email, never been private.
    If I want to find out I can. If they do, they can. If you do, you can, hire a PI. I'm a little more than over it. This is fear mongering, if you weren't aware, here it is. If you're just now afraid. Sorry kid, it gets worse. The heart grows cold.

  • Dr. Kleiner says the huggy face humper has been fully debeaked.

  • Well anything that has a public api can be used to train your data. Bluesky is actually cool to be open, how on earth is it related to bad actors training on public data
    • I was thinking same... If you publish/post stuff to be read by anyone, then that would be public domain information. I don't see how this is different than posting to Shitter. So the only difference between "reading" and scraping is that scraping is automated and large scale.
  • No! Not like that!!
  • I remember when slashdot was just intelligent discussion by intelligent people. When did the right wing idiots start showing up?!
  • People complained when Twitter/X restricted their APIs. Now people are complaining that Bluesky doesn't restrict their APIs. Which one do you want?

  • Ever heard of the X (formerly Twitter) firehose API? Everybody who pays enough gets all of X in realtime.

    And for federated networks ... everything using ActivityPub even pushes new content to your node.

  • What funded this "miracle site" for those desperate for a place they can ban anyone who disagree with them? Marketing!
  • I fail to see the problem, the open web was meant to be accessible by anyone. I am tired by walled gardens, where people are forced to sign-up and give a lot of personal data to some corporations. I recently made a Bluesky account... guess what? The only private info I had to provide was an email address, everything else is optional.

  • You put something on a web server for the world to see. That kinda was the point.

    Now with copyrighted pieces of art this is a different kind of story, but with text I don't have an issue that AIs are trained on it.

I'd rather just believe that it's done by little elves running around.

Working...