Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Social Networks AI

BlueSky Proposes 'New Standard' for When Scraping Data for AI Training (techcrunch.com) 44

An anonymous reader shared this article from TechCrunch: Social network Bluesky recently published a proposal on GitHub outlining new options it could give users to indicate whether they want their posts and data to be scraped for things like generative AI training and public archiving.

CEO Jay Graber discussed the proposal earlier this week, while on-stage at South by Southwest, but it attracted fresh attention on Friday night, after she posted about it on Bluesky. Some users reacted with alarm to the company's plans, which they saw as a reversal of Bluesky's previous insistence that it won't sell user data to advertisers and won't train AI on user posts.... Graber replied that generative AI companies are "already scraping public data from across the web," including from Bluesky, since "everything on Bluesky is public like a website is public." So she said Bluesky is trying to create a "new standard" to govern that scraping, similar to the robots.txt file that websites use to communicate their permissions to web crawlers...

If a user indicates that they don't want their data used to train generative AI, the proposal says, "Companies and research teams building AI training sets are expected to respect this intent when they see it, either when scraping websites, or doing bulk transfers using the protocol itself."

Over on Threads someone had a different wish for our AI-enabled future. "I want to be able to conversationally chat to my feed algorithm. To be able to explain to it the types of content I want to see, and what I don't want to see. I want this to be an ongoing conversation as it refines what it shows me, or my interests change."

"Yeah I want this too," posted top Instagram/Threads executive Adam Mosseri, who said he'd talked about the idea with VC Sam Lessin. "There's a ways to go before we can do this at scale, but I think it'll happen eventually."

BlueSky Proposes 'New Standard' for When Scraping Data for AI Training

Comments Filter:
  • Royalties (Score:5, Insightful)

    by Njovich ( 553857 ) on Monday March 17, 2025 @03:53AM (#65239325)

    The laws are clear, and that's why Google and the like are asking for an exemption from the law: https://www.theverge.com/news/... [theverge.com]

    We need a way where google and others can pay fair royalties to people whose work their AI is trained on, both for the training itself and for the token generation.

    Obviously that's a billion people and it won't amount to much per person, but it's ludicrous to just let them take other people's data and profit off of it without any form of compensation.

    The argument that 'chinese are doing it so we need to do it as well' sounds like begging for a race to the bottom. Chinese are allowing US films, software and games to be pirated without any action, does that mean we should do that as well?

    • Good article (Score:5, Insightful)

      by evanh ( 627108 ) on Monday March 17, 2025 @04:46AM (#65239341)

      And it's funny how the wording of the exemption requests are all future tense, as if the copyright infringements have not yet occurred.

    • The laws are clear

      No, they are not.

      The laws are ambiguous, and whether "training" is "copying" has not been established.

      • It's a digital representation so it's a copy. I don't see the difference. If it is so great then it can resemble the original comment. It just hasn't been allowed to yet.
        • You are stating an opinion rather than citing established case law, so you appear to agree with me that the law is not settled.

          • Re: Royalties (Score:2, Insightful)

            The law is now under control of greased palms so I can't predict that. It literally depends how much money they send Trump's way.
            • You seem to think that only the R are on the take. It is clear, everyone in DC is on the take. The only difference is which team you're on as to whether or not you care (or don't care).

              They are all Hypocrites, talking out of whichever side of the mouth gets them elected again, so they can grift off the taxpayers.

              "But the other side is worse" you might say. Sure, Fine. The shit sandwich is better over there with you.

    • by Rei ( 128717 )

      "The laws are clear" that AI can't be trained on copyrighted material without paying royalties? The only case that's finished (the LAION case) has ruled *against* a requirement that AI can't train on copyrighted material. And in general, of the (numerous) other cases in progress, they in general haven't been going very well for the plaintiffs. Automated processing of copyrighted material to make novel goods and services has long been legal. For an extreme case, read the Google Books ruling.

      You also miss

  • by Pinky's Brain ( 1158667 ) on Monday March 17, 2025 @04:43AM (#65239339)

    There is zero legal justification for opt in copyright protection. Either training on scraped data from public internet is fair use and only a click through license can protect the content or it isn't fair use and needs no further protrction.

    BlueSky forcing users to opt in or defacto give up their rights is a betrayal of their users.

  • by martin-boundary ( 547041 ) on Monday March 17, 2025 @04:47AM (#65239343)
    public != public domain.

    But I repeat myself.

  • Public is public (Score:3, Informative)

    by bradley13 ( 1118935 ) on Monday March 17, 2025 @05:54AM (#65239385) Homepage

    Do you know how human artists learn their craft? They spend a lot of time looking at work by other artists. Guess what: they don't buy copies or pay royalties for everything that they look at. I don't see why this should be different for AI.

    If you put something onto the public internet, it is going to get looked at. If you don't want it looked at, either put it behind a paywall, or don't put it on the internet. It's that easy.

    • by mccalli ( 323026 )
      OK, but then let me do the same. Can't have constant copyright strikes for regurgitating bits of music, while simultaneously allowing tracts of other copyright work to be output by an LLM-based AI.

      I regularly watch synthesiser reviews. Some a common sound as a filter sweep [youtube.com] (a single dial on most synths, hold down a note and turn it from high to low or low to high depending what you're going for) regularly gets copyright strikes. So ok - let it be both ways and not need to be some big company AI deal and
      • by Rei ( 128717 )

        Some a common sound as a filter sweep [youtube.com] (a single dial on most synths, hold down a note and turn it from high to low or low to high depending what you're going for) regularly gets copyright strikes"

        And it shouldn't - that's not copyrightable. You're right, that's an abusive system. But the fix isn't to make it even more abusive.

        Also, one thing I don't think a lot of people understand when they try to impose strict regulations on AI development is... you're hitting Open Source far harder than th

    • by fluffernutter ( 1411889 ) on Monday March 17, 2025 @07:43AM (#65239483)
      A lot of the stuff AI is scraping was downloaded through torrent tracking sites and not put there by the artist/author.
    • Re:Public is public (Score:4, Interesting)

      by ZiggyZiggyZig ( 5490070 ) on Monday March 17, 2025 @09:39AM (#65239621)

      Do you know how human artists learn their craft? They spend a lot of time looking at work by other artists.

      I am tired of seeing this false equivalence argument [wikipedia.org]. A human artist is not the same as an LLM. Let me take myself as a personal example here. I have trained for 10 years as an artist. I took formal education in art, I read many books, watched many movies, etc. I took inspiration from many sources, most if not all did not give me prior authorization from using them as a source of inspiration. That is exactly what happens when you say "we spend a lot of time looking at work by other artists."

      Now, with all this inspiration I took in myself, digested and interpreted, I make my own art (writing, movies, music). This art probably resembles some prior works that I have seen, read or heard at some point. It is not the same, just like a LLM does not make copies or directly plagiarizes its training data. Until this point, I human artist, and a LLM are the same - I agree with that.

      Now, this is me, an artist, an individual person. I promote my art, try to obtain exhibitions, to go at events, sell my stuff. Getting my art known and making a living out of it is a full-time job, leaving ironically little time for obtaining further sources of inspiration and making new art. And it took me about 10 years of training to reach a state where I have some financial stability.

      A LLM is not an artist. It is not going as its own person in the world, trying to promote its personal way of seeing the world (I am still stating the obvious up to now). Instead, it offers seemingly unlimited artworks, produced more or less instantly, to whoever is able to prompt for it correctly.

      Imagine that there would be some sort of Amazon Mechanical Turk for artists. Each human artist would be a single instance of this Turk, and each human artist would need to be able to deliver in an instant a drawing, a piece of music, a video, a photo... based on a prompt. Let's put aside the fact that no human artist can instantly create a piece of work, ever. Remains the fact that a human artist is only able to serve one prompt at a time. The capacity of a single human is very limited. The capacity of an AI is infinite. An AI is not a human artist.

      Therefore, placing the debate on whether an AI is doing the same thing as a human artist when it gets trained, is not OK. The fact that human artists don't pay royalties to train on copyrighted data; does not mean that an AI, which has infinitely more power, should be treated the same.

      And about the internet - we did not put our stuff online to be plundered upon by bots. We put it with the intent to share it with other human beings. Bots have been tolerated when they brought us added value - Google would send us visitors, so we would let it download, analyze and index our stuff. It's a kind of an informal contract - the internet used to be about informality. We give it something, it gives us something in return. (And yes, contracts can be informal).

      What does AI bring to us who put things online? Nothing. And if informality is broken, then a formal contract needs to be enforced. Sorry, my data is not meant to be used for training a machine that will not bring any visitors to my online realm, offloading to me the cost of data storage and maintenance (because that's what a LLM does, offloading to webmasters the cost of storage and maintenance of its training dataset), with nothing in return. If online actors can't self-regulate (robots.txt comes to mind), then we need the power of law.

      • Your argument is that even if the LLM uses the information the same way, as training data, it's different not because of how the information is used but because an AI can churn out content endlessly? Is that your argument?
  • by fluffernutter ( 1411889 ) on Monday March 17, 2025 @07:41AM (#65239477)
    If AI can download for free than everyone else had better be able to, that's all I can say. Capitalism should apply the same to everyone or we should switch to another system.
    • If AI can download for free than everyone else had better be able to

      The issue is stuff that everyone else can already download for free.

      Can an AI learn from material on the web that is already viewable by the public?

      Capitalism should apply the same to everyone

      That's what the capitalists are saying.

      It's the artists and authors claiming that "AI is different."

      • No one can download it for free. They can have it in their cache and they can read it. But no one is allowed to scrape it.
        • No one can download it for free. They can have it in their cache and they can read it.

          You can't have it in your cache without downloading it.

          But no one is allowed to scrape it.

          It is legal to scrape most public-facing websites. Search engines do it every day.

          • How many people can make use of the things in cache? The designed purpose of the cache is that it is there to serve the function of displaying the webpage, not for direct consumption. Also is the text from a site even in cache?
        • Transient copying for consumption is allowed under copyright law.

          They can have it in their cache and they can read it.

          So after retrieving it, they can keep it for a bit to read.

          But no one is allowed to scrape it.

          But they can't retrieve it first? "Scraping" is an HTTP request. Maybe a regex or two on top if you aren't using human eyes to focus on specific text.

    • Youre absolutely right. Capitalism has problems and thus we should switch to a better system. I for one, fully advocate for the immediate transition to perfecteconomyism. It was invented by Michael McDoesntExist in the late 1900s and its been suppressed by “the man” ever since.
  • This seems like a really terrible idea.

    I am not sure a general AI or even LLM training on material where the creators were allowed to self select if it should be included in training, and missing all the material from people who did not want to be included.

    I won't pretend to guess exactly how that breaks down along various ideological, profession vs hobbyist, reformer vs orthodoxy, age, ethnic, what have your fault lines but I have very little doubt such a flagging protocol will result in all kinds of unkno

  • The whole applied purpose of AI has been the wholesale stealing the work of others. It's not your data!
  • No

    "I want to be able to conversationally chat to my feed algorithm. To be able to explain to it the types of content I want to see, and what I don't want to see. I want this to be an ongoing conversation as it refines what it shows me, or my interests change."

    Minds this broken need to be... removed from rotation.

I don't have any use for bodyguards, but I do have a specific use for two highly trained certified public accountants. -- Elvis Presley

Working...