BlueSky Proposes 'New Standard' When Scraping Data for AI Training (techcrunch.com) 52

Posted by EditorDavid on Monday March 17, 2025 @03:34AM from the Robots.txt-2.0 dept.

An anonymous reader shared this article from TechCrunch: Social network Bluesky recently published a proposal on GitHub outlining new options it could give users to indicate whether they want their posts and data to be scraped for things like generative AI training and public archiving.

CEO Jay Graber discussed the proposal earlier this week, while on-stage at South by Southwest, but it attracted fresh attention on Friday night, after she posted about it on Bluesky. Some users reacted with alarm to the company's plans, which they saw as a reversal of Bluesky's previous insistence that it won't sell user data to advertisers and won't train AI on user posts.... Graber replied that generative AI companies are "already scraping public data from across the web," including from Bluesky, since "everything on Bluesky is public like a website is public." So she said Bluesky is trying to create a "new standard" to govern that scraping, similar to the robots.txt file that websites use to communicate their permissions to web crawlers...

If a user indicates that they don't want their data used to train generative AI, the proposal says, "Companies and research teams building AI training sets are expected to respect this intent when they see it, either when scraping websites, or doing bulk transfers using the protocol itself."
Over on Threads someone had a different wish for our AI-enabled future. "I want to be able to conversationally chat to my feed algorithm. To be able to explain to it the types of content I want to see, and what I don't want to see. I want this to be an ongoing conversation as it refines what it shows me, or my interests change."

"Yeah I want this too," posted top Instagram/Threads executive Adam Mosseri, who said he'd talked about the idea with VC Sam Lessin. "There's a ways to go before we can do this at scale, but I think it'll happen eventually."

BlueSky Proposes 'New Standard' When Scraping Data for AI Training

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 52 Comments Log In/Create an Account

Comments Filter:

Royalties (Score:5, Insightful)

by Njovich ( 553857 ) writes: on Monday March 17, 2025 @03:53AM (#65239325)

The laws are clear, and that's why Google and the like are asking for an exemption from the law: https://www.theverge.com/news/... [theverge.com]
We need a way where google and others can pay fair royalties to people whose work their AI is trained on, both for the training itself and for the token generation.
Obviously that's a billion people and it won't amount to much per person, but it's ludicrous to just let them take other people's data and profit off of it without any form of compensation.
The argument that 'chinese are doing it so we need to do it as well' sounds like begging for a race to the bottom. Chinese are allowing US films, software and games to be pirated without any action, does that mean we should do that as well?

- Good article (Score:5, Insightful)
  
  by evanh ( 627108 ) writes: on Monday March 17, 2025 @04:46AM (#65239341)
  
  And it's funny how the wording of the exemption requests are all future tense, as if the copyright infringements have not yet occurred.
  
- Re: (Score:3)
  
  by ShanghaiBill ( 739463 ) writes:
  
  The laws are clear
  No, they are not.
  The laws are ambiguous, and whether "training" is "copying" has not been established.
  - Re: Royalties (Score:2)
    
    by fluffernutter ( 1411889 ) writes:
    
    It's a digital representation so it's a copy. I don't see the difference. If it is so great then it can resemble the original comment. It just hasn't been allowed to yet.
    - Re: (Score:2)
      
      by ShanghaiBill ( 739463 ) writes:
      
      You are stating an opinion rather than citing established case law, so you appear to agree with me that the law is not settled.
      - Re: Royalties (Score:2, Insightful)
        
        by fluffernutter ( 1411889 ) writes:
        
        The law is now under control of greased palms so I can't predict that. It literally depends how much money they send Trump's way.
        
        Re: (Score:2)
        
        by Archangel Michael ( 180766 ) writes:
        
        You seem to think that only the R are on the take. It is clear, everyone in DC is on the take. The only difference is which team you're on as to whether or not you care (or don't care).
        They are all Hypocrites, talking out of whichever side of the mouth gets them elected again, so they can grift off the taxpayers.
        "But the other side is worse" you might say. Sure, Fine. The shit sandwich is better over there with you.
        
        Re: Royalties (Score:3)
        
        by fluffernutter ( 1411889 ) writes:
        
        I just know it was much better when they felt like they needed to hide it.
- Re: (Score:2)
  
  by Rei ( 128717 ) writes:
  
  "The laws are clear" that AI can't be trained on copyrighted material without paying royalties? The only case that's finished (the LAION case) has ruled *against* a requirement that AI can't train on copyrighted material. And in general, of the (numerous) other cases in progress, they in general haven't been going very well for the plaintiffs. Automated processing of copyrighted material to make novel goods and services has long been legal. For an extreme case, read the Google Books ruling.
  You also miss
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    If any output of the AI is clearly a derived work, then they need to pay royalties for that.
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      That's true for any medium you use to create a derived work.
      The point is, that most AI output is not clearly a derived work. And current cases like Andersons are struggling as she did not manage to get the AI to reproduce her works.
Tech bros cooperating on a psyop (Score:4, Insightful)

by Pinky's Brain ( 1158667 ) writes: on Monday March 17, 2025 @04:43AM (#65239339)

There is zero legal justification for opt in copyright protection. Either training on scraped data from public internet is fair use and only a click through license can protect the content or it isn't fair use and needs no further protrction.
BlueSky forcing users to opt in or defacto give up their rights is a betrayal of their users.

- Re: (Score:2)
  
  by kmoser ( 1469707 ) writes:
  
  This. That's why I plan to configure my web server so that when it detects an AI scraper, it inserts "blow me" at random points in the page, along with a link where they can pay me an astronomical sum of money.
public != public domain (Score:5, Insightful)

by martin-boundary ( 547041 ) writes: on Monday March 17, 2025 @04:47AM (#65239343)

public != public domain.
But I repeat myself.

- Re: (Score:2)
  
  by galgon ( 675813 ) writes:
  
  Training != Copying. But I repeat myself.
Public is public (Score:1, Redundant)

by bradley13 ( 1118935 ) writes:

Do you know how human artists learn their craft? They spend a lot of time looking at work by other artists. Guess what: they don't buy copies or pay royalties for everything that they look at. I don't see why this should be different for AI.
If you put something onto the public internet, it is going to get looked at. If you don't want it looked at, either put it behind a paywall, or don't put it on the internet. It's that easy.
- Re: (Score:3)
  
  by mccalli ( 323026 ) writes:
  
  OK, but then let me do the same. Can't have constant copyright strikes for regurgitating bits of music, while simultaneously allowing tracts of other copyright work to be output by an LLM-based AI.
  
  I regularly watch synthesiser reviews. Some a common sound as a filter sweep [youtube.com] (a single dial on most synths, hold down a note and turn it from high to low or low to high depending what you're going for) regularly gets copyright strikes. So ok - let it be both ways and not need to be some big company AI deal and
  - Re: (Score:2)
    
    by Rei ( 128717 ) writes:
    
    Some a common sound as a filter sweep [youtube.com] (a single dial on most synths, hold down a note and turn it from high to low or low to high depending what you're going for) regularly gets copyright strikes"
    And it shouldn't - that's not copyrightable. You're right, that's an abusive system. But the fix isn't to make it even more abusive.
    Also, one thing I don't think a lot of people understand when they try to impose strict regulations on AI development is... you're hitting Open Source far harder than th
- Re: Public is public (Score:5, Insightful)
  
  by fluffernutter ( 1411889 ) writes: on Monday March 17, 2025 @07:43AM (#65239483)
  
  A lot of the stuff AI is scraping was downloaded through torrent tracking sites and not put there by the artist/author.
  
- Re:Public is public (Score:5, Interesting)
  
  by ZiggyZiggyZig ( 5490070 ) writes: on Monday March 17, 2025 @09:39AM (#65239621)
  
  Do you know how human artists learn their craft? They spend a lot of time looking at work by other artists.
  I am tired of seeing this false equivalence argument [wikipedia.org]. A human artist is not the same as an LLM. Let me take myself as a personal example here. I have trained for 10 years as an artist. I took formal education in art, I read many books, watched many movies, etc. I took inspiration from many sources, most if not all did not give me prior authorization from using them as a source of inspiration. That is exactly what happens when you say "we spend a lot of time looking at work by other artists."
  Now, with all this inspiration I took in myself, digested and interpreted, I make my own art (writing, movies, music). This art probably resembles some prior works that I have seen, read or heard at some point. It is not the same, just like a LLM does not make copies or directly plagiarizes its training data. Until this point, I human artist, and a LLM are the same - I agree with that.
  Now, this is me, an artist, an individual person. I promote my art, try to obtain exhibitions, to go at events, sell my stuff. Getting my art known and making a living out of it is a full-time job, leaving ironically little time for obtaining further sources of inspiration and making new art. And it took me about 10 years of training to reach a state where I have some financial stability.
  A LLM is not an artist. It is not going as its own person in the world, trying to promote its personal way of seeing the world (I am still stating the obvious up to now). Instead, it offers seemingly unlimited artworks, produced more or less instantly, to whoever is able to prompt for it correctly.
  Imagine that there would be some sort of Amazon Mechanical Turk for artists. Each human artist would be a single instance of this Turk, and each human artist would need to be able to deliver in an instant a drawing, a piece of music, a video, a photo... based on a prompt. Let's put aside the fact that no human artist can instantly create a piece of work, ever. Remains the fact that a human artist is only able to serve one prompt at a time. The capacity of a single human is very limited. The capacity of an AI is infinite. An AI is not a human artist.
  Therefore, placing the debate on whether an AI is doing the same thing as a human artist when it gets trained, is not OK. The fact that human artists don't pay royalties to train on copyrighted data; does not mean that an AI, which has infinitely more power, should be treated the same.
  And about the internet - we did not put our stuff online to be plundered upon by bots. We put it with the intent to share it with other human beings. Bots have been tolerated when they brought us added value - Google would send us visitors, so we would let it download, analyze and index our stuff. It's a kind of an informal contract - the internet used to be about informality. We give it something, it gives us something in return. (And yes, contracts can be informal).
  What does AI bring to us who put things online? Nothing. And if informality is broken, then a formal contract needs to be enforced. Sorry, my data is not meant to be used for training a machine that will not bring any visitors to my online realm, offloading to me the cost of data storage and maintenance (because that's what a LLM does, offloading to webmasters the cost of storage and maintenance of its training dataset), with nothing in return. If online actors can't self-regulate (robots.txt comes to mind), then we need the power of law.
  
  - Re: (Score:1)
    
    by Iamthecheese ( 1264298 ) writes:
    
    Your argument is that even if the LLM uses the information the same way, as training data, it's different not because of how the information is used but because an AI can churn out content endlessly? Is that your argument?
- Re: Bluesky - The Social Network for LOSER SOY BOY (Score:5, Informative)
  
  by EldoranDark ( 10182303 ) writes: on Monday March 17, 2025 @07:24AM (#65239469)
  
  There's more and more people and companies on BSky every day. I think I migrated 70% of my follows now. And because there's no algorithmic content or ads yet, I actually get to see updates from accounts I follow. Sure, maybe Elon and other celebs didn't switch yet, and maybe if I might get into actual trouble for posting terrible things, but so far the experience is pretty great. No constant crypto spam and sex bots.
  
- Re: (Score:3)
  
  by Rei ( 128717 ) writes:
  
  What on Earth are you talking about? Bluesky grew 50% from November to January. And Bluesky's core moderation is minimal. The moderation system allows users to subscribe to whatever moderation lists they want. If your stance is that you should be able to force people to listen to you whenever they specifically have chosen not to, then yeah, best you go bugger off to 4chan.
- Re: (Score:2)
  
  by BeTeK ( 2035870 ) writes:
  
  Still think that best antisosical media is IRC. Prove me wrong :D
- Re: (Score:3, Informative)
  
  by GloryWacky ( 10410843 ) writes:
  
  So does that make the Democratic People's Republic of Korea a democracy then? After all it says it right in the name, so it must be true.
  The Nazi Party (National Socialist German Workers' Party, or NSDAP) co-opted socialist rhetoric early on to attract workers but was fundamentally opposed to core socialist principles. The Nazis crushed trade unions, outlawed leftist parties, and suppressed socialist and communist movements, they saw Marxism, socialism, and communism as enemies, sending many to concentrati
Capitalism is the problem (Score:3)

by fluffernutter ( 1411889 ) writes: on Monday March 17, 2025 @07:41AM (#65239477)

If AI can download for free than everyone else had better be able to, that's all I can say. Capitalism should apply the same to everyone or we should switch to another system.

- Re: (Score:2)
  
  by ShanghaiBill ( 739463 ) writes:
  
  If AI can download for free than everyone else had better be able to
  The issue is stuff that everyone else can already download for free.
  Can an AI learn from material on the web that is already viewable by the public?
  Capitalism should apply the same to everyone
  That's what the capitalists are saying.
  It's the artists and authors claiming that "AI is different."
  - Re: Capitalism is the problem (Score:2)
    
    by fluffernutter ( 1411889 ) writes:
    
    No one can download it for free. They can have it in their cache and they can read it. But no one is allowed to scrape it.
    - Re: (Score:2)
      
      by ShanghaiBill ( 739463 ) writes:
      
      No one can download it for free. They can have it in their cache and they can read it.
      You can't have it in your cache without downloading it.
      But no one is allowed to scrape it.
      It is legal to scrape most public-facing websites. Search engines do it every day.
      - Re: Capitalism is the problem (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        How many people can make use of the things in cache? The designed purpose of the cache is that it is there to serve the function of displaying the webpage, not for direct consumption. Also is the text from a site even in cache?
    - Re: (Score:2)
      
      by omnichad ( 1198475 ) writes:
      
      Transient copying for consumption is allowed under copyright law.
      They can have it in their cache and they can read it.
      
      So after retrieving it, they can keep it for a bit to read.
      But no one is allowed to scrape it.
      But they can't retrieve it first? "Scraping" is an HTTP request. Maybe a regex or two on top if you aren't using human eyes to focus on specific text.
- Re: (Score:2)
  
  by hdyoung ( 5182939 ) writes:
  
  Youre absolutely right. Capitalism has problems and thus we should switch to a better system. I for one, fully advocate for the immediate transition to perfecteconomyism. It was invented by Michael McDoesntExist in the late 1900s and its been suppressed by “the man” ever since.
  - Re: Capitalism is the problem (Score:2)
    
    by fluffernutter ( 1411889 ) writes:
    
    So stick with a system that has been around since way before everyone had the Internet in their pockets? Maybe Americans haven't learned anything since then?
    - Re: (Score:2)
      
      by hdyoung ( 5182939 ) writes:
      
      Whoosh. Theres no better alternative. Every other economic system has been shown to be inferior by a long shot.
      - Re: Capitalism is the problem (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        You are just showing an inability to think of new systems. Also, Marxist communism has not been tried. Communism is not totalitarianism.
      - Re: (Score:2)
        
        by HiThere ( 15173 ) writes:
        
        Sorry, but EVERY theoretical economic system that I know of, definitely including capitalism, has been shown to be unworkable. And you won't find any in use anywhere on the face of the Earth. ... Well, except communism, and that only in *some* small family groups...usually before any kids are born.
        
        Re: (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        There has not been a cure for cancer. Doesn't mean we don't keep looking for it.
        
        Re: (Score:2)
        
        by HiThere ( 15173 ) writes:
        
        Keeping looking is fine. But don't avoid the current treatments just because they aren't perfect.
        I haven't seen, in this thread, anyone propose some new theoretic economic system. I've my doubts that any new improvement is really possible that isn't a minor tweak on some existing theory. But none of the existing theories is actually workable at scale. So we need capitalism modified by anti-trust regulations modified by a good safety net modified by... No existing simple answer is going to work.
        
        Re: Capitalism is the problem (Score:2)
        
        by fluffernutter ( 1411889 ) writes:
        
        Because there is no point proposing anything until the civil war happens.
DANGER (Score:1)

by DarkOx ( 621550 ) writes:

This seems like a really terrible idea.
I am not sure a general AI or even LLM training on material where the creators were allowed to self select if it should be included in training, and missing all the material from people who did not want to be included.
I won't pretend to guess exactly how that breaks down along various ideological, profession vs hobbyist, reformer vs orthodoxy, age, ethnic, what have your fault lines but I have very little doubt such a flagging protocol will result in all kinds of unkno
Just Stop Stealing! (Score:2)

by BrendaEM ( 871664 ) writes:

The whole applied purpose of AI has been the wholesale stealing the work of others. It's not your data!
NO. NoNoNoNo (Score:2)

by bleedingobvious ( 6265230 ) writes:

No
"I want to be able to conversationally chat to my feed algorithm. To be able to explain to it the types of content I want to see, and what I don't want to see. I want this to be an ongoing conversation as it refines what it shows me, or my interests change."
Minds this broken need to be... removed from rotation.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Royalties (Score:5, Insightful)

Good article (Score:5, Insightful)

Re: (Score:3)

Re: Royalties (Score:2)

Re: (Score:2)

Re: Royalties (Score:2, Insightful)

Re: (Score:2)

Re: Royalties (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Tech bros cooperating on a psyop (Score:4, Insightful)

Re: (Score:2)

public != public domain (Score:5, Insightful)

Re: (Score:2)

Public is public (Score:1, Redundant)

Re: (Score:3)

Re: (Score:2)

Re: Public is public (Score:5, Insightful)

Re:Public is public (Score:5, Interesting)

Re: (Score:1)

Re: Bluesky - The Social Network for LOSER SOY BOY (Score:5, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3, Informative)

Capitalism is the problem (Score:3)

Re: (Score:2)

Re: Capitalism is the problem (Score:2)

Re: (Score:2)

Re: Capitalism is the problem (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Capitalism is the problem (Score:2)

Re: (Score:2)

Re: Capitalism is the problem (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Capitalism is the problem (Score:2)

DANGER (Score:1)

Just Stop Stealing! (Score:2)

NO. NoNoNoNo (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals