Microsoft's AI CEO: Web Content (Without a Robots.txt File) is 'Freeware' for AI Training (windowscentral.com) 136

Posted by EditorDavid on Sunday July 07, 2024 @12:34AM from the fair-vs-unfair-use dept.

Slashdot reader joshuark shared this report from Windows Central Microsoft may have opened a can of worms with recent comments made by the tech giant's CEO of AI Mustafa Suleyman. The CEO spoke with CNBC's Andrew Ross Sorkin at the Aspen Ideas Festival earlier this week. In his remarks, Suleyman claimed that all content shared on the web is available to be used for AI training unless a content producer says otherwise specifically.
The whole discussion was interesting — but this particular question was very direct. CNBC's interviewer specifically said, "There are a number of authors here... and a number of journalists as well. And it appears that a lot of the information that has been trained on over the years has come from the web — and some of it's the open web, and some of it's not, and we've heard stories about how OpenAI was turning YouTube videos into transcripts and then training on the transcripts."

The question becomes "Who is supposed to own the IP, who is supposed to get value from the IP, and whether, to put it in very blunt terms, whether the AI companies have effectively stolen the world's IP." Suleyman begins his answer — at the 14:40 mark — with "Yeah, I think — look, it's a very fair argument." SULEYMAN: "I think that with respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That's been the understanding.

"There's a separate category where a website or a publisher or a news organization had explicitly said, 'Do not scrape or crawl me for any other reason than indexing me so that other people can find that content.' That's a gray area and I think that's going to work its way through the courts."

Q: And what does that mean, when you say 'It's a gray area'?

SULEYMAN: "Well, if — so far, some people have taken that information... but that's going to get litigated, and I think that's rightly so...

"You know, look, the economics of information are about to radically change, because we're going to reduce the cost of production of knowledge to zero marginal cost. And this is just a very difficult thing for people to intuit — but in 15 or 20 years time, we will be producing new scientific cultural knowledge at almost zero marginal cost. It will be widely open sourced and available to everybody. And I think that is going to be, you know, a true inflection point in the history of our species. Because what are we, collectively, as an organism of humans, other than an intellectual production engine. We produce knowledge. Our science makes us better. And so what we really want in the world, in my opinion, are new engines that can turbocharge discovery and invention."

Microsoft's AI CEO: Web Content (Without a Robots.txt File) is 'Freeware' for AI Training

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 136 Comments Log In/Create an Account

Comments Filter:

Current Copyright Regime (Score:5, Insightful)

by bento ( 19178 ) writes: on Sunday July 07, 2024 @01:02AM (#64606271) Homepage

Not to say I'm a fan of the current copyright regime but if we were going to throw Aaron Swartz in jail for what he was doing then all the AI companies have a lot of explaining to do.

- Re: (Score:3, Interesting)
  
  by gweihir ( 88907 ) writes:
  
  Yep. These assholes are criminal commercial pirates, nothing else.
- Comment removed (Score:4, Interesting)
  
  by account_deleted ( 4530225 ) writes: on Sunday July 07, 2024 @03:41AM (#64606391)
  
  Comment removed based on user account deletion
  
- Re:Current Copyright Regime (Score:5, Informative)
  
  by thegarbz ( 1787294 ) writes: on Sunday July 07, 2024 @07:25AM (#64606565)
  
  Aaron Swartz was not thrown in jail for what AI is doing. In fact Aaron Swartz did do something similar to what AI is doing with the PACER access and the government specifically in that case did *not* pursue him.
  AI accessing a public website is not the same as connecting a foreign laptop into a secure private network (hidden in a closet to avoid detection so even Aaron Swartz knew this was wrong), and using that as a gateway with an specific account provided and ultimately used against its Terms of Service to bulk download data with the sole intent to distribute.
  Equating the two is like saying someone who accidentally has a car accident causing death is the same level as OJ Simpson.
  
- Re: (Score:3)
  
  by chas.williams ( 6256556 ) writes:
  
  You could argue that AI is a transformative use. The original text is converted into weights in the neural network. Naturally, I am not a lawyer. Whether this falls under fair use in copyright law would be up to the courts.
- Re: (Score:3)
  
  by bill_mcgonigle ( 4333 ) * writes:
  
  A judge just sealed the data of the Nashville school shooter on Byrne Act Copyright grounds.
  Which the Public actively needs access to to understand and prevent future occurrences.
  Now comes Microsoft arguing that the Corporations have a pressing need to ignore the Byrne Act.
  If we're backed intoimto this corner of saving lives vs. losing profits, then we just need to abolish copyright and see how it goes.
  Odds are we'll just get Anarcho-tyranny instead with different rules based on wealth.
- - Re:Current Copyright Regime (Score:5, Informative)
    
    by Narcocide ( 102829 ) writes: on Sunday July 07, 2024 @01:43AM (#64606319) Homepage
    
    They won't download if they are asked not to in robots.txt.
    Bullshit. [slashdot.org]
    "We" didn't throw Aaron Swartz in jail. We drove him to suicide.
    Uh, that's a paper-thin hair split. He was being held illegally in extra-judicial custody, while also being directly lied to about his rights and what was going to happen to him, which is a lot worse. And I don't see any reason why these "AI companies" should meet any more merciful of a fate.
    
  - Re:Current Copyright Regime (Score:4, Informative)
    
    by Anonymous Coward writes: on Sunday July 07, 2024 @01:49AM (#64606325)
    
    You're bending over backwards to apologize for them.
    They are *not* downloading stuff from the open web, since there is no open web. There are private servers, and those servers reserve the right to serve content to authorized people, as they have always done, with an expectation that you get to see the content as intended. It's like if I leave my car door unlocked, if you go in a steal stuff you're still a thief, even though my kid can go in and take stuff without permission.
    They are also not creating a temporary copy, they are processing the copy and extracting the details into their models. The proof is that you can ask the models to recreate the copies. When a person recreates a copy of an image or document, it's called copyright infringement. When an AI company does it it's also copyright infringement. And when an AI creates a new document that's almost like the training one it's also copyright infringement.
    They are also not saying that disallowing AI in robots.txt "should not be the default", they are saying it's a free for all and implying that they have no intention of following the law because they are not afraid of the public.
    
    - Re: (Score:3)
      
      by msauve ( 701917 ) writes:
      
      > It's like if I leave my car door unlocked, if you go in a steal stuff you're still a thief
      
      No, it's like someone asks your car to give them stuff, and it does.
    - Re: (Score:3)
      
      by quonset ( 4839537 ) writes:
      
      They are *not* downloading stuff from the open web, since there is no open web. There are private servers, and those servers reserve the right to serve content to authorized people, as they have always done, with an expectation that you get to see the content as intended.
      You people need to make up your mind and pick a side. It is without question the default position on this site is anything on the web is free game. That is how people justify stealing all their music, software, and movies. But now you're
      - Re: (Score:2)
        
        by quonset ( 4839537 ) writes:
        
        As a follow up, what crime did Mozart committ when he recreated, note for note, Miserere mei, Deus? The only copy was held at the Vatican, but just by listening to it, twice, he was able to recreate the entire composition. He did not go on to sell it or claim it was his own, and the original copy remained where it was at.
        This applies to anyone who has an eidetic memory. They can recreate, word for word, note for note, things they read/see. And yet, we have no problem with them. In fact, we put them on s
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        Copyright law applies differently to humans and computers, so your analogy doesn't match.
        
        Re: (Score:3)
        
        by Big Hairy Gorilla ( 9839972 ) writes:
        
        I think there is a difference which rarely is articulated or understood. Let me try.
        Mozart or anyone, who learns something and applies something has only so much reach. You can tell your co-workers, and they can use the information. You can tell your friends. That's basically the real world.
        
        But in the digital world, the Industrial scale of the bulk harvesting of any loose piece of data or website then using it for profit without attribution and without any limitations on a global scale due to the wide cover
      - Re:Current Copyright Regime (Score:5, Insightful)
        
        by Smidge204 ( 605297 ) writes: on Sunday July 07, 2024 @06:33AM (#64606523) Journal
        
        > It is without question the default position on this site is anything on the web is free game.
        No, it isn't. You are conflating free access with freedom to use for your own commercial purposes.
        I can go to a library and borrow a book without getting permission from the copyright holder of that book. That's free access.
        I can not copy significant parts of or entire books and re-publish it as part of my own book without permission, though. Copyright has some wiggle room for quoting in this particular scenario but that's about it. That is commercial exploitation.
        In the same way, although much of the web is free access in that you can view/access the content, that is not the same as having copyright license to use that content in your own works. What Microsoft is doing is taking other people's content and using it for their own commercial purposes without permission; e.g. copyright infringement.
        > That is how people justify stealing all their music, software, and movies.
        It absolutely isn't. Well, maybe it's how you do it... I'd wager most people who "steal" digital content have actually paid for it and want to maintain access (insert oblig xkcd comic here). Then there are people who just don't want to pay for it, and people who view piracy as a form of protest against the companies. I can't think of anyone who justifies piracy as "well it was there that makes it okay."
        Regardless, again there is a difference between access (legal or otherwise) and re-use. Piracy is not copyright infringement.
        =Smidge=
        
    - Re:Current Copyright Regime (Score:4, Informative)
      
      by drinkypoo ( 153816 ) writes: <drink@hyperlogos.org> on Sunday July 07, 2024 @09:45AM (#64606763) Homepage Journal
      
      You're bending over backwards to apologize for them.
      They are *not* downloading stuff from the open web, since there is no open web.
      I disagree. There IS an "open web", by multiple definitions of the word "open". There is an open web like an open door, in that you can gain access to it without passing or bypassing a locking mechanism. And there is a web like Open Systems which we used to hear about a lot with Unix, where open meant documented and interoperable; indeed, that type of openness is what makes the web work at all, or even exist!
      What this doesn't do is bypass copyright, but I would go on to argue that it should. Copyright has two purposes: One, economic, where the purpose is to put money into all of the "right" hands through licensing, sales, and above all taxation. Two, progress, which is supposed to be incentivized by the first thing. The copyrighted information becomes available to people, and then they can build on it, and produce their own information based on the knowledge gained which is also copyrighted. But the period of copyright protection has been stretched all out of proportion to the second goal to the point that it interferes with it. I am not against the existence of copyright, I am against its duration which exists strictly to permit profit through rent-seeking, which retards progress by limiting the scope of what can legally be done with information. Copyright was never meant to control access to facts, only to specific presentations thereof.
      When a person makes information freely available to the eyeballs of the users, they are placing the facts therein into the public domain if they weren't there already.
      They are also not creating a temporary copy, they are processing the copy and extracting the details into their models.
      That is a technically inaccurate description of what they are doing. As such it is worse than worthless, it is manipulative. They are creating a temporary copy for processing, that is the training corpus. They are then creating statistics on how documents are similar to other documents, and storing only that. They are in no way storing any details of the original documents, they are only storing details of their analyses.
      The proof is that you can ask the models to recreate the copies.
      No, you can't, which is why asking the models to recreate the copies produces proof that it's not doing what you think it's doing. Unless you ask for a copy in such a way that you provide almost enough information for a human to replicate the original, e.g. by providing as input tokens substantial portions of the original document, you can not get out a verbatim input result. This is neatly proven by recreations of the Mona Lisa even from LLMs which are overtrained on it, which frankly seems to be most of them. You will often get a recognizable result, but the closer you zoom in and compare to the most faithful reproductions of the original the more differences you can see. A text document is orders of magnitude less complex than a painting of similar physical size (imagine the document printed in a legible typeface, etc.) Even brush strokes are finite, there are only so many ways to make one, though there are very many. But valid sentences are far, far fewer in number, so it should not be surprising if overprompting plus overtraining produces highly similar outputs. Image hallucinations are unpredictable because of the freedom involved, but text hallucinations are highly predictable because there are rules of spelling and grammar.
      when an AI creates a new document that's almost like the training one it's also copyright infringement.
      This is not necessarily so. For example, when it comes to visual arts, the standard for a derivative work is that it has sufficient recognizable elements. Sometimes this even means ideas, in the case of recreation of some photographs. Most
      Read the rest of this comment...
      
      - Re: (Score:2)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        This whole debate seem to be dancing around the moral/ethical/economic incentives so far. What we really want is to incentivise the people who are doing useful things to share the results of those things with the rest of us so we all benefit.
        Copyright was one attempt to do that, which has served a purpose up to a point by creating an economic model to promote sharing creative works that is broadly similar to the economic model that has been successful with physical products (even though, as we all understan
    - Re: (Score:2)
      
      by Visarga ( 1071662 ) writes:
      
      > The proof is that you can ask the models to recreate the copies.
      
      Only with very low success rate. It works 1:100 times, and only if you already have the preceding paragraph as a key to trigger the model into reproducing the rest. It's actually a dumb way to reproduce content. It doesn't even fit - 15 trillion tokens don't fit in a 0.1 trillion parameter model. This infringement claim is grasping for straws. If you want real infringement, look at copying. LLMs create sufficiently distinct outputs from
    - - Re: (Score:3, Insightful)
        
        by Anonymous Coward writes:
        
        An AI is not human and has no rights.
        
        Re: (Score:2)
        
        by Travelsonic ( 870859 ) writes:
        
        Nobody is saying "AI" has rights, so ... strawman. Clearly (IMO) the person is asking about the process of gaining and retaining information, which IMO things like "AI is not human" is irrelevant (humans aren't the only ones who can learn - chimpanzees, dolphins, dogs and cats, to name a few other mammals), anyways).
        
        Re: (Score:2)
        
        by Visarga ( 1071662 ) writes:
        
        > An AI is not human and has no rights.
        
        Yes, you are 100% correct, but we are talking about humans using AI to do stuff. Humans have the right to analyze information.
        
        Re: (Score:2)
        
        by drinkypoo ( 153816 ) writes:
        
        That's what they said about black people not too long ago.
        They were wrong about that, but this is different from that. The computer doesn't think. It just processes. LLMs produce profound results, but they also produce hilarious results, and the nature of the hilarity proves the lack of thought. The process by which an LLM produces an image is more like digestion than cogitation. There's no reflection in the process that would shape it as intelligence does. As you draw something by hand using your brain you re-evaluate it in realtime. Every letter, every brush str
        
        Re: (Score:2)
        
        by penguinoid ( 724646 ) writes:
        
        That is true about LLMs but it cannot be true of computers in general nor artificial intelligence in general. After all, you're a computer, an arrangement of atoms, and a person.
        
        Re: (Score:2)
        
        by drinkypoo ( 153816 ) writes:
        
        That is true about LLMs but it cannot be true of computers in general nor artificial intelligence in general. After all, you're a computer, an arrangement of atoms, and a person.
        It's true so far, as there is no AGI worthy of the name.
        Maybe someday it won't be true.
        So far, arranging atoms into grid-arranged collections of logic gates hasn't produced intelligence. Maybe it never will. Or if it does, maybe it will only produce a very different kind of intelligence than ours.
        
        Re: (Score:2)
        
        by paiute ( 550198 ) writes:
        
        ...maybe it will only produce a very different kind of intelligence than ours.
        And the most horrifying statement to appear on the internet today passed by virtually ignored.
        
        Re: (Score:2)
        
        by Visarga ( 1071662 ) writes:
        
        > arranging atoms into grid arranged collections of logic gates hasn't produced intelligence
        
        Because intelligence is search, it has a search space and a goal, and learns from the outside environment. It is not happening without the search part. LLMs until now have crammed on logs of human experiences, they haven't done much searching of their own, but it will come.
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        After all, you're a computer, an arrangement of atoms, and a person.
        A human is a Turing machine, but we don't know if a computer has the mental capacity of a human. That is an unsolved problem of computer science.
        
        Relevantly, the law still distinguishes between humans and computers, so even if they are the same, the law will have to be changed to reflect that.
        
        Re: (Score:2)
        
        by penguinoid ( 724646 ) writes:
        
        Relevantly, the law still distinguishes between humans and computers, so even if they are the same, the law will have to be changed to reflect that.
        That was the point I was trying to make: that we can use reality to inspire our laws. But that we can't use our laws to inspire reality. Trying to legislate reality has always turned out atrociously, and using arbitrary laws as a description of AI's theoretical capability is a terrible idea.
        
        Re: (Score:2)
        
        by phantomfive ( 622387 ) writes:
        
        In current reality, computers are different than humans. Maybe in the future they will not be, but that is not now.
        
        Re: (Score:2)
        
        by Visarga ( 1071662 ) writes:
        
        They think about the learning capability of LLMs and claim it's not good enough. But they forget about search which is the basis of AI. A search agent can iteratively improve and make new discoveries. Like AlphaZero.
        
        Re: (Score:2)
        
        by Visarga ( 1071662 ) writes:
        
        You're completely wrong, LLMs after training continue to adapt by prompting, examples, references and guidance from the human. They can learn new things on the spot. They collect tons of experience assisting millions of people and will get better at it. They are not mere parrots, they contribute to creative activities and research.
      - Re: (Score:2)
        
        by erice ( 13380 ) writes:
        
        How is an AI engine extracting details to build a model different than a human reader extracting details to build a model in their mind and memory?
        Human memory is fallible and only accessible to the individual. If the human reader writes down what they read and publishes it for others to read, that is plagiarism.
Newer CEOs Come From Morally Bankrupt Cultures (Score:5, Interesting)

by theshowmecanuck ( 703852 ) writes: on Sunday July 07, 2024 @01:23AM (#64606283) Journal

While being an atheist, I do believe Jesus was a real person, and that at least according to his original followers, had some very good lessons to teach about how to treat others. I think it is what originally made the European and North American cultures while not free of corruption and greed, helped minimize it as Christian teaching frowned upon treating others like shit. And if you look at the state of affairs, it is still mostly true. But I've noticed the big tech companies are increasingly hiring people as CEOs from countries that have ruthlessly vicious differences in social strata including caste systems. Countries where the culture didn't have any issues with promoting rich cronies and not caring about millions living in poverty, and where finding dead bodies on the street sidewalks is still not unheard of. And while in North America you might cry, but it is bad here too, you'd only be mildly correct compared to say, India. Which is why so many from there want to come here. Anyway, I think there is a certain sense lately that tech giants are becoming increasingly brutal in creating the have, have-not divide. And having that disregard for average people and their rights is even more scary, to downright terrifying when put in the context of Artificial Intelligence development.

- Re: (Score:3)
  
  by Barny ( 103770 ) writes:
  
  But I've noticed the big tech companies are increasingly hiring people as CEOs from countries that have ruthlessly vicious differences in social strata including caste systems.
  Yes. Countries like—checks notes—the United Kingdom. The guy was born and grew up in a London suburb, Islington.
  You can stop your racist BS couched as moral high ground about now.
  - Re:Newer CEOs Come From Morally Bankrupt Cultures (Score:4, Interesting)
    
    by phantomfive ( 622387 ) writes: on Sunday July 07, 2024 @02:43AM (#64606369) Journal
    
    But I've noticed the big tech companies are increasingly hiring people as CEOs from countries that have ruthlessly vicious differences in social strata including caste systems.
    Yes. Countries like—checks notes—the United Kingdom.
    
    So...a country with ruthlessly vicious differences in social strata including caste systems, then?
    
    - Re: (Score:2)
      
      by Barny ( 103770 ) writes:
      
      From OP's own words, "European and North American culture" is fine.
- Re: (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  go lookup the ethnicity of who makes up the majority of the board of any US tech or media company, the Jamals and Sanjays are a mere rounding error compared to the altmans, hoffmans, bergs, jaxys, the bezos, the steins, the skis, the andressons.
  The Jamals and Sulymans ares just LARPing those without understanding they are the fall guys, Whoopie Goldberg worked that out in the 80s, Jamal is figuring it out now.
- Re: (Score:2)
  
  by itzdandy ( 183397 ) writes:
  
  It's a curious take on things considering the massive corporate system we live in was entirely invented by essentially 2 christian nations in the UK and the US with everyone else trying to catch up. Global imperialism nearly perfected by a christian nation. Economic imperialism by a functionally christian nation.
  IMO, this is basically coincidence, the first nations to gain industrialization being the cause of this. My point is that christianity has done very little to nothing to prevent corruption.
  Bringin
Nope (Score:5, Insightful)

by zkiwi34 ( 974563 ) writes: on Sunday July 07, 2024 @01:29AM (#64606289)

Copyright is still a thing

- Re: Nope (Score:2)
  
  by ToasterMonkey ( 467067 ) writes:
  
  Microsoft can argue fair use
  https://www.copyright.gov/fair... [copyright.gov]
  IDK, the model themselves and any archive copy of the source material used to train other models seem like different situations at least. I'm having a hard time seeing this failing fair use tests, like saving a copy of this page to your hard drive, or creating any statistical model of it. Web crawling would have been legally prohibited a long time ago.
  The archive is of non-commercial use, the models are not, neither get in the way of copyright's g
  - Re: (Score:2)
    
    by flink ( 18449 ) writes:
    
    The models frequently contain entire word for word copies of the training corpus. This has been shown many times by tricking them into spitting out full song lyrics, articles, transcripts of videos, etc. if that is not copyright violation, then I don't know what is.
- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  Yes it is. And when you publish things on the web, and you don't prohibit use of that information via robots.txt, you are granting permission for search engines to index that data. For all intents and purposes, AI is a fancy search engine.
- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  Copyright is still a thing
  In the world of AI it is an untested thing. So far no one has successfully claimed copyright infringement by an AI training set. People have tried, but not succeeded. So in that sense "copyright is still a thing,... and can be ignored".
So if a list of CD keys isn't behind robots.txt... (Score:4, Funny)

by Rendus ( 2430 ) writes: <rendus@NosPam.gmail.com> on Sunday July 07, 2024 @01:39AM (#64606313)

That means it's freeware right? Along with the ISOs?

- Re: (Score:2)
  
  by Barny ( 103770 ) writes:
  
  Like a list of Windows 10/11 enterprise OEM install keys along with the certificates that will unlock with them? I do believe you are correct.
  Also, I love his mention of "opensource" AI. If he opensources it, that means it's free game!
- Re: (Score:2)
  
  by shanen ( 462549 ) writes:
  
  That's the best Funny Slashdot can do now?
The New Understanding (Score:2)

by The Cat ( 19816 ) writes:

Not only are we going to take your job, we're going to take your copyrights too.
What is this BS about free knowledge? (Score:2)

by gweihir ( 88907 ) writes:

LLMs cannot "produce knowledge". Hence any promise of "we're going to reduce the cost of production of knowledge to zero marginal cost" is complete nonsense. All LLMs can do is repackage knowledge, find some statistical correlations to cluster it and then make it a bit easier to find. That has nothing to do with "producing knowledge". That is just data processing, but the data has to already be there and be labelled with conclusions.
This guy is an idiot or a liar or both.
Copyright (Score:2)

by DERoss ( 1919496 ) writes:

While U.S. law provides there is an implied copyright on any original text, even if it is made public, I put the copyright symbol on all of my Web pages. I just now added "bing" and "bingbot" to my robots.txt file. Those two user agents of Microsoft appeared in a log of visitors to my Web site.
However, there are bots, crawlers, and scrapers -- all of which are really the same -- that ignore robots.txt. Among them are bots run by Amazon and Google.
Wonder What the Venn Diagram Looks Like... (Score:3)

by Petersko ( 564140 ) writes: on Sunday July 07, 2024 @02:46AM (#64606373)

Slashdotters upset at Microsoft for stealing knowledge, and Slashdotters who can't be argued out of the idea that pirating music isn't theft.

He has to be right about viewing the information (Score:5, Interesting)

by Todd Knarr ( 15451 ) writes: on Sunday July 07, 2024 @02:50AM (#64606375) Homepage

That's just the way the Web works: unless content is explicitly locked away, anyone can view it. Without that, we can't browse the Web. And if that's the case, then if the AI isn't listed in robots.txt it can view the data and learn from it. If it is listed... well, robots.txt is more of a gentleman's agreement than enforceable law, but AI companies should remember that the consequences of rejecting that agreement aren't that they get to crawl the content but that the site isn't bound by that agreement and can just block the AI company's entire address range and to hell with them.
What happens after the AI is trained though is down to copyright law. If the AI crawled the content then it had access to the content. That's half the requirement for copyright infringement right there. That, BTW, is why authors and editors don't read unsolicited manuscripts: if they haven't read your story they can't have copied it and it's much easier to prove they returned your envelope unopened. If the AI is then used to generate content that can replace the creator's content, that opens the AI company up to infringement claims just like if they'd hired a person to do the same thing. Then it comes down to how close of a match the content is to the original and what rights are implicated (eg. you don't need to make an exact copy to infringe on a creator's trademark on character design and appearance). Yes, the AI company themselves. Whoever asked for the Ai company to create the content may also be on the hook depending on why they asked for it and what they used it for, but the AI company is the one who did the work.
Finally, any company that's considering using AI to produce content has to keep in mind that the people they're replacing with AI will remember this. We've seen how poorly AI performs creating artwork and written articles for publication, and if you think the artists and writers you laid off because AI could do the job are going to come crawling back to help fix the mess when it turns out AI can't in fact do the job without human help then you're going to be very, very surprised. Especially if you think you can low-ball the rate because they're "only" editing instead of creating.

- Re: (Score:2)
  
  by pauljlucas ( 529435 ) writes:
  
  That's just the way the Web works: unless content is explicitly locked away, anyone can view it. Without that, we can't browse the Web.
  True, but irrelevant. Copyright is literally and only about the right to copy -- it's right there in the name. The ability to access (or not) is separate issue.
  What happens after the AI is trained though is down to copyright law. If the AI crawled the content then it had access to the content.
  True.
  That's half the requirement for copyright infringement right there.
  There is
Poison the well (Score:3)

by FudRucker ( 866063 ) writes: on Sunday July 07, 2024 @03:10AM (#64606383)

Just make gigs of useless word salad for AI to consume since they think they can just siphon up any thing they want without regard to the server's owners wishes

- Re: (Score:2)
  
  by eneville ( 745111 ) writes:
  
  Nice - going to use AI generated salad for that? :)
- Re: (Score:2)
  
  by markjhood2003 ( 779923 ) writes:
  
  Yes, it reminds me of when we jokingly tried to poison the CIA feed by adding spooky words to the ends of innocuous posts about Emacs on Usenet.
  Spook food: Terrorism, bomb, poison, 9/11, torture, Assange, Iraq, cruise missiles
  AI information: Adding non-toxic glue to pizza keeps the cheese from sliding off. Limit your consumption of rocks to no more than 2 or three a day for optimal absorption of iron.
Technically correct, technically misinterpreted (Score:5, Insightful)

by Misagon ( 1135 ) writes: on Sunday July 07, 2024 @04:48AM (#64606437)

"Freeware" [wikipedia.org] is an old classic computing licensing term that means that users are free to copy and run the program. . It is "gratis", or "free as in free beer" if you will.
Freeware does not allow stripping it of copyright, transforming it, diluting it, and redistributing it in modified form.
So, yes, I'd agree that the social contract for the public web is similar to that of "freeware": You are allowed to download web pages in whole, read them, archive them, index them and run the Javascript code embedded in them.
But you are not allow to mash up through automatic means and redistribute, like what Microsoft and other scumbag AI companies are doing.

Hmmm (Score:4, Insightful)

by jd ( 1658 ) writes: <imipak AT yahoo DOT com> on Sunday July 07, 2024 @05:21AM (#64606461) Homepage Journal

Copyright exists the moment you write something.
The social contract that exists is that "fair use" constitutes a reasonable but small percentage of the material for personal use, or a much much smaller fraction for reuse in commercial works.
There is no such thing as automatic public domain, public domain must be asserted proactively or is asserted by the government after the expiry of copyright.
If this were not so, an AI would be legally entitled to obtain from Microsoft's online ISO repository a copy of Windows 11, disassemble it, and make any section of that code available to anyone who asked. It's under no more protection than most of the other copyrighted works AI has learned from.
What would you expect Microsoft's response to such a thing to be?
To blithely say "oh, yeah, Windows 11 is now public domain?"
No, chances are the AI company would be sued into oblivion and the directors found mysteriously unalived. And we all know this.
The face-eating leopards party should be careful with those leopards.

- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  Copyright exists the moment you write something.
  Copyright doesn't come without fair use exceptions, including transformative use. You can't copyright something from being learned anymore than I can copyright this post in a way to force you to forget what I wrote and not make a reply to it. Whatever you created isn't copied verbatim when you train an AI model on it anymore than you will be able to remember this post word for word.
What is a "human"? (Score:3)

by xack ( 5304745 ) writes: on Sunday July 07, 2024 @05:36AM (#64606475)

For those who say only humans should be able to access content. What about people with physical disabilities that need computer assistance to access content? Are they bots? What about people with learning disabilities who struggle with increasingly difficult captchas? What about people who have poor English or literacy skills? I feel in the war against Bots, we will be sacrificing humans. I've already had to give up going to a popular site since they have too strict a captcha (blaming Hcaptcha specifically here). Eventually we will have to live in a society where Bots and Humans live together. When you start talking about banning "non-humans" from the internet, it will inevitably devolve into ableism as well.

- Re: (Score:2)
  
  by tabrisnet ( 722816 ) writes:
  
  This point has been made already by a few other posters under this story... but I have to say it again for your context.
  It doesn't have to be about humans vs computers, AIs having rights or much else.
  Suleyman mentioned a "social contract", and the thing about social contracts is they're unwritten rules, so this is NOT about the law.
  "Fair use" is generally taken to NOT be about commercial use. Sure, the law allows some minimal fair use for various things. But these companies are
  a) downloading the *entire* co
MS web material (Score:2)

by Registered Coward v2 ( 447531 ) writes:

So if the rule is:
SULEYMAN: "I think that with respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That's been the understanding./P>
Then all of MS stuff on the web is free to "copy it, recreate with it, reproduce with it?"
robots.txt (Score:3)

by eneville ( 745111 ) writes: on Sunday July 07, 2024 @08:40AM (#64606653) Homepage

I'm going to block the crawlers, not because of the copyright issues, but because I dislike the energy they're burning. This is like a bitcoin landrush but this time people want to see the compute of their LLM rather than bitcoins.
Will it make a difference? Probably not, I'll not be able to keep up with the UA strings.

thinkk newspapers and (Score:3)

by Growlley ( 6732614 ) writes: on Sunday July 07, 2024 @09:08AM (#64606705)

other media groups have been disputing it's 'free' for the taking over the last 10 years

Microsoft went all in (Score:3)

by Pinky's Brain ( 1158667 ) writes: on Sunday July 07, 2024 @10:24AM (#64606851)

If the Supreme Court rules copying for training is not fair use, the entire large model AI industry is dead. Statutory fines on all the registered works they copied alone would bankrupt them.
Microsoft is insane to get into that industry as a first party. As always Microsoft is dedicated to showing just how inferior their planning and execution is to Apple, which is wisely keeping large models at arms length.

Backward (Score:2)

by StormReaver ( 59959 ) writes:

The understanding has always been that what's available on the Web is copyrighted, and is not available for commercial use unless explicitly authorized.
Microsoft has just admitted to massive theft with intent. It is going to be fun watching the backpedaling in court.
Publishers want it both ways (Score:3)

by Tony Isaac ( 1301187 ) writes: on Sunday July 07, 2024 @11:23AM (#64606971) Homepage

They want their content to be indexed by search engines, but they don't want you to be able to actually USE (view) that information without going through their paywall. So they essentially lie, they give *everything* to the bots, but then when you go to the same URL with a regular browser, they hide it until you give them what they want.
AI is, in many ways, a fancy search engine. It just indexes the data in a different, more sophisticated way.

AI Honey Traps (Score:3, Insightful)

by firewrought ( 36952 ) writes: on Sunday July 07, 2024 @12:28PM (#64607085)

Before robots.txt, people who didn't like web scrapers would build honey pots to position their results. Essentially, there would be a link to a page of garbage auto-generated content that itself contained more links to more auto-generated content, ad infinitum. Challenge for those playing with modern locally-hosted LLMs: build fake auto-generated websites that produce content that sounds helpful, authoritative, and correct but that is obviously bogus to human readers. For bonus points, add images that are mislabeled.

This guy needs a crash course in copyright law (Score:2)

by brunes69 ( 86786 ) writes:

Copyright law supersedes any "social contract"
See that little text at the bottom of almost every website you visit that says "Copyright" along with a link to "Legal" or "Terms of Use" ? (even this site has it, take a look)
This is why "free as in beer" vs "free as in speech" is a thing. You can't just do whatever you want with "free" content. That is not how the law works.
Open season for freeware claims (Score:2)

by larryjoe ( 135075 ) writes:

Following in Microsoft's footsteps, I hereby proclaim that all Microsoft software without a Robots.txt file is now freeware. Enjoy! Thank you, Microsoft!
what about GPLv3'd content? (Score:2)

by toutankh ( 1544253 ) writes:

I coded a web app licensed under GPLv3. If Microsoft trains their AI on it, do they have to release the results of their training? It is a derivative work.
- If It's there, they will take it. (Score:5, Insightful)
  
  by Brain-Fu ( 1274756 ) writes: on Sunday July 07, 2024 @01:02AM (#64606269) Homepage Journal
  
  We all know the Golden Rule: those who have the gold, rule.
  They will train their AI on whatever they can get their hands on. You can class-action lawsuit them later, if you find grounds. The settlement will give you pennies at most, and will in no way limit their use of the AI or your data.
  This is the world we live in.
  
  - Re:If It's there, they will take it. (Score:4, Insightful)
    
    by Bongo ( 13261 ) writes: on Sunday July 07, 2024 @08:17AM (#64606619)
    
    And if you think about the purpose of making that statement, what purpose does it serve, other than Microsoft saying look... we will stop at nothing when building our AI therefore our AI is the one you should be backing as investors. It's part of the AI hype.
    
- Re:Terms of Service (Score:5, Insightful)
  
  by Tony Isaac ( 1301187 ) writes: on Sunday July 07, 2024 @11:18AM (#64606955) Homepage
  
  If you don't have to accept the TOS to access the data, you haven't actually accepted the TOS.
  
  - Re: (Score:2)
    
    by Anonymous Brave Guy ( 457657 ) writes:
    
    Well, that sort of question has been litigated in many jurisdictions by now, and I'm not sure all of the outcomes were as black and white as your statement there. In any case, content created by humans is subject to copyright by default almost everywhere, and there has been no shortage of advocacy over the years reminding people of this in the context of others claiming content on any public web site was public domain.
    Microsoft might get away with arguing that this kind of training is fair use under the US
    - Re: (Score:2)
      
      by dryeo ( 100693 ) writes:
      
      Even if Microsoft do get away with that argument in the US, the rest of the world has long turned a blind eye to those US fair use rules even though they pretty obviously don't meet international copyright treaty requirements. Similarly broad exceptions to copyright do not apply anywhere else in the world that I'm aware of,
      Well, here in Canada, we have fair dealing, copied from UK copyright law so likely similar in most of the Commonwealth.
      From https://en.wikipedia.org/wiki/... [wikipedia.org]
      Fair dealing is a statutory exception to copyright infringement, and is also referred to as a user's right (as opposed to an owner's right). According to the Supreme Court of Canada, it is more than a simple defence; it is an integral part of the Copyright Act of Canada, providing balance between the rights of owners and users. To qualify under the fai
      - Re: (Score:2)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        US-style fair use is far more flexible than English-style fair dealing and its various counterparts in other jurisdictions. The key distinction is that the US system relies on 4 general criteria to determine applicability, while the English system only applies to a few specifically enumerated situations. Even a right to private copying that was explicitly created by our legislature was rapidly quashed [www.gov.uk] when Big Copyright started a formal legal challenge. There is, as far as I can see, absolutely zero chance
        
        Re: (Score:2)
        
        by dryeo ( 100693 ) writes:
        
        Not sure how much UK and Canadian fair dealing have diverged, do agree that the commercial exploitation would make copying not a fair dealing thing.
    - Re: (Score:3)
      
      by Visarga ( 1071662 ) writes:
      
      It's a stretch to say training models is copyright infringement. First of all, the trained models don't look at all like the source data. They have fixed size, and the training data is much larger, by orders of magnitude. So it would only be able to store a small fraction if that. But in reality it doesn't repeat training data unless primed with a paragraph, and with very low success rate. Using LLMs to just replicate copyrighted works is dumb, it is much easier to find a copy. And the LLM would not be able
      - Re: (Score:2)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        Keep in mind that we're talking about legal technicalities here, not common sense. If the mere act of loading executable code from a disk I paid for into the RAM in my computer is making a copy for copyright purposes -- and keep in mind that this was essentially the argument underpinning the entire validity of EULAs in some places -- then the mere acts of downloading someone's content and of moving that content around using other software to train a model on it must surely be making copies for copyright pur
  - What comes of feeding the troll? (Score:2)
    
    by shanen ( 462549 ) writes:
    
    What are you trying to say? Some kind of infinite loop joke? I prefer recursive humor, myself.
- Re:He Is Right. (Score:4, Insightful)
  
  by theshowmecanuck ( 703852 ) writes: on Sunday July 07, 2024 @01:28AM (#64606287) Journal
  
  Just, no. Anything created by a person is copyrighted by that person, and does not need any symbols or statements to make it copyrighted. This is the law in almost all, if not all western countries; and in many other countries as well. It's why you can't copy news stories on the internet unless it meets fair use rules. Same for anyone. AI engines are created for the profit of people or companies, for their good, not the public's. Therefore fair use does not apply.
  
  - Re: (Score:2)
    
    by dknj ( 441802 ) writes:
    
    It was also the law to only have the amount of taxis on the road that your company had medallions for, but uber didn't GAF. Laws can change, especially when TPTB say so
    - Re: (Score:2)
      
      by Anonymous Brave Guy ( 457657 ) writes:
      
      Uber have been shot down repeatedly, all over the world, for thinking they were above the law and finding out that local licensing authorities disagreed. It took a while -- the wheels of government typically turn slowly and a lot of authorities didn't fully understand what Uber was or would become so they were cautious about interfering prematurely -- but certainly here in the UK, Uber are basically just another private hire firm with an app today, and the established local private hire firms are developing
- Re: (Score:3)
  
  by gweihir ( 88907 ) writes:
  
  That is because you are one of these idiots that are under the delusion that processing data in machines is the same as a human looking at it. That is the legal situation exactly nowhere.
- Re: (Score:3, Insightful)
  
  by theshowmecanuck ( 703852 ) writes:
  
  NO. Anything created by a person is copyrighted by that person, and does not need any symbols or statements to make it copyrighted. This is the law in almost all, if not all western countries; and in many other countries as well. It's why you can't copy news stories on the internet unless it meets fair use rules. Same for anyone. AI engines are created for the profit of people or companies, for their good, not the public's. Therefore fair use does not apply.
- Re:Stuff posted in public (Score:5, Interesting)
  
  by gweihir ( 88907 ) writes: on Sunday July 07, 2024 @02:09AM (#64606339)
  
  Nope. Incidentally, I have a plain-text prohibition on my site that forbids any AI training use. That is legally sound. They _cannot_ dictate what form a statement of will by the copyright owner (me) takes.
  
  - Re: (Score:2)
    
    by pahles ( 701275 ) writes:
    
    While legally sound, good luck enforcing it!
    - Re: (Score:2)
      
      by martin-boundary ( 547041 ) writes:
      
      While legally sound, good luck enforcing it!
      
      So what you're saying is that the US is effectively lawless? (there are ~200 countries in the world btw, they can each enforce their own laws as soon as an entity has a local address in the country. which is usually required if the company wants to do business in the country efficiently)
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      I am aware of that. As soon as the criminals are rich and important enough, they tend to get away with their evil.
  - Re: (Score:3)
    
    by thegarbz ( 1787294 ) writes:
    
    They _cannot_ dictate what form a statement of will by the copyright owner (me) takes.
    Actually they may be able to. Unless you gated your content behind a consent form your will have an uphill battle. Many Terms of Services are thrown out based on *how* they are presented.
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      This are not "Terms of Service" as no service is offered. This is a copyright statement and those are fundamentally different.
      - Re: (Score:2)
        
        by thegarbz ( 1787294 ) writes:
        
        Then you're in legally unchallenged waters. There's no copyright clause enforceable that allows someone to see something but not remember or learn from it. In fact copyright specifically carves out this concept for allowing things such as satire and transformation as it depends on the ability to learn from something.
        Your copyright claim can't specifically can't be made on the storage in an AI model, because it isn't stored as such, it is highly transformed. At best you can go after someone if they output of
        
        Re: (Score:2)
        
        by Anonymous Brave Guy ( 457657 ) writes:
        
        I would remind you that in some major jurisdictions, the entire premise of an EULA, which in turn has been considered legally binding and has been used to enforce all kinds of consumer-hostile provisions, has depended for its enforceability on the idea that copies as trivial as transferring executable binary data from disk to RAM -- in a way clearly necessary for its normal use for the intended purpose for which it was bought -- is still subject to copyright and therefore not permitted without such a licenc
        
        Re: (Score:2)
        
        by thegarbz ( 1787294 ) writes:
        
        the entire premise of an EULA, which in turn has been considered legally binding and has been used to enforce all kinds of consumer-hostile provisions, has depended for its enforceability on the idea that copies as trivial as transferring executable binary data from disk to RAM
        Citation Needed. No seriously it better be a very good citation to claim that an EULA is enforceable due to copyright laws that give the originator a right to dictate data transferred to RAM.
        While you look them up I'll bet you'll find that the EULA has zero to do with copyright at all (or at least not between disk and RAM). Again to my point: Gating via a EULA or Terms of Service may work, but there's no way to use copyright intrinsically for this protection.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Then you're in legally unchallenged waters. There's no copyright clause enforceable that allows someone to see something but not remember or learn from it.
        
        Complete nonsense. There is no jurisdiction on the planet where a machine processing data is equivalent at a person looking at that data.
    - Re: (Score:2)
      
      by flink ( 18449 ) writes:
      
      It's not terms of service, it's just how copyright works. If you aren't given explicit permission, you have no right to make an retain a copy nor use it to create derivative works.
      - Re: (Score:2)
        
        by thegarbz ( 1787294 ) writes:
        
        It's not terms of service, it's just how copyright works. If you aren't given explicit permission, you have no right to make an retain a copy nor use it to create derivative works.
        False. Copyright has plenty of carveouts. Incidentally AI models do not "retain a copy" anymore than you are "retaining a copy" of this post having read it. In fact if a simple copyright disclaimer works the way you say I could throw a disclaimer at the end of this post saying you need to immediately forget everything eliminating your chance to reply to my comment.
        So far no one successfully argued that AI is derivative, but there's plenty to suggest it is completely transformative, not just in output, but i
- Re: (Score:2)
  
  by tlhIngan ( 30335 ) writes:
  
  Just because it's public doesn't mean it's without copyright.
  And copyright and licensing dictates what you can do with it. After all, check out things like football games - the NFL broadcasts games for anyone to receive for free. But you'll be damned if you decide to post that video online.
  Just like GPL code is posted freely on the internet, but that doesn't make the code free to use in whatever you want - it's still copyrighted and under the GPL.
  So yes, it's accurate in that the term is freeware - it's sti
  - Re: Stuff posted in public (Score:2)
    
    by gl4ss ( 559668 ) writes:
    
    No man what you do with the radio recorded music is cut an paste it to make music videos you sell to your customers for them to publish and get into trouble.
    See the recent figma(app designing tool) ai debacle.
- Re: Stuff posted in public (Score:5, Insightful)
  
  by gl4ss ( 559668 ) writes: on Sunday July 07, 2024 @03:11AM (#64606385) Homepage Journal
  
  Robots.txt isn't some law thing.
  They don't even know if you had the rights to have the stuff on your site. Like just because the gamepass servers don't have a robot txt on the protocol doesn't make them somehow free for some other dude to use and make copies of.
  Their cd's that float around without a shrinkwrap don't have robots txt glued on the cd either etc. The ai boss is just making an exception because it fits what he needs. The really really dumb thing he did was to admit to it publicly, but then again ms's hiring and promotion has been seriously flawed for 25 years.
  
  - Re: (Score:2)
    
    by thegarbz ( 1787294 ) writes:
    
    Like just because the gamepass servers don't have a robot txt on the protocol doesn't make them somehow free for some other dude to use and make copies of.
    Partially true. You are free to use something presented to you. The copyright requirements then are related to reproduction. And training an AI model isn't so much "making copies" of actual content, as much as it is transforming them into a, completely unrecognisable from the original, algorithm. And you can't copyright block transformative works.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Current Copyright Regime (Score:5, Insightful)

Re: (Score:3, Interesting)

Comment removed (Score:4, Interesting)

Re:Current Copyright Regime (Score:5, Informative)

Re: (Score:3)

Re: (Score:3)

Re:Current Copyright Regime (Score:5, Informative)

Re:Current Copyright Regime (Score:4, Informative)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re:Current Copyright Regime (Score:5, Insightful)

Re:Current Copyright Regime (Score:4, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Newer CEOs Come From Morally Bankrupt Cultures (Score:5, Interesting)

Re: (Score:3)

Re:Newer CEOs Come From Morally Bankrupt Cultures (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2, Insightful)

Re: (Score:2)

Nope (Score:5, Insightful)

Re: Nope (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

So if a list of CD keys isn't behind robots.txt... (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

The New Understanding (Score:2)

What is this BS about free knowledge? (Score:2)

Copyright (Score:2)

Wonder What the Venn Diagram Looks Like... (Score:3)

He has to be right about viewing the information (Score:5, Interesting)

Re: (Score:2)

Poison the well (Score:3)

Re: (Score:2)

Re: (Score:2)

Technically correct, technically misinterpreted (Score:5, Insightful)

Hmmm (Score:4, Insightful)

Re: (Score:2)

What is a "human"? (Score:3)

Re: (Score:2)

MS web material (Score:2)

robots.txt (Score:3)

thinkk newspapers and (Score:3)

Microsoft went all in (Score:3)

Backward (Score:2)

Publishers want it both ways (Score:3)

AI Honey Traps (Score:3, Insightful)

This guy needs a crash course in copyright law (Score:2)

Open season for freeware claims (Score:2)

what about GPLv3'd content? (Score:2)

If It's there, they will take it. (Score:5, Insightful)

Re:If It's there, they will take it. (Score:4, Insightful)

Re:Terms of Service (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

What comes of feeding the troll? (Score:2)

Re:He Is Right. (Score:4, Insightful)