Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
AI Microsoft

Microsoft's AI CEO: Web Content (Without a Robots.txt File) is 'Freeware' for AI Training (windowscentral.com) 136

Slashdot reader joshuark shared this report from Windows Central Microsoft may have opened a can of worms with recent comments made by the tech giant's CEO of AI Mustafa Suleyman. The CEO spoke with CNBC's Andrew Ross Sorkin at the Aspen Ideas Festival earlier this week. In his remarks, Suleyman claimed that all content shared on the web is available to be used for AI training unless a content producer says otherwise specifically.
The whole discussion was interesting — but this particular question was very direct. CNBC's interviewer specifically said, "There are a number of authors here... and a number of journalists as well. And it appears that a lot of the information that has been trained on over the years has come from the web — and some of it's the open web, and some of it's not, and we've heard stories about how OpenAI was turning YouTube videos into transcripts and then training on the transcripts."

The question becomes "Who is supposed to own the IP, who is supposed to get value from the IP, and whether, to put it in very blunt terms, whether the AI companies have effectively stolen the world's IP." Suleyman begins his answer — at the 14:40 mark — with "Yeah, I think — look, it's a very fair argument." SULEYMAN: "I think that with respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That's been the understanding.

"There's a separate category where a website or a publisher or a news organization had explicitly said, 'Do not scrape or crawl me for any other reason than indexing me so that other people can find that content.' That's a gray area and I think that's going to work its way through the courts."


Q: And what does that mean, when you say 'It's a gray area'?

SULEYMAN: "Well, if — so far, some people have taken that information... but that's going to get litigated, and I think that's rightly so...

"You know, look, the economics of information are about to radically change, because we're going to reduce the cost of production of knowledge to zero marginal cost. And this is just a very difficult thing for people to intuit — but in 15 or 20 years time, we will be producing new scientific cultural knowledge at almost zero marginal cost. It will be widely open sourced and available to everybody. And I think that is going to be, you know, a true inflection point in the history of our species. Because what are we, collectively, as an organism of humans, other than an intellectual production engine. We produce knowledge. Our science makes us better. And so what we really want in the world, in my opinion, are new engines that can turbocharge discovery and invention."

This discussion has been archived. No new comments can be posted.

Microsoft's AI CEO: Web Content (Without a Robots.txt File) is 'Freeware' for AI Training

Comments Filter:
  • by bento ( 19178 ) on Sunday July 07, 2024 @01:02AM (#64606271) Homepage
    Not to say I'm a fan of the current copyright regime but if we were going to throw Aaron Swartz in jail for what he was doing then all the AI companies have a lot of explaining to do.
    • Re: (Score:3, Interesting)

      by gweihir ( 88907 )

      Yep. These assholes are criminal commercial pirates, nothing else.

    • Comment removed (Score:4, Interesting)

      by account_deleted ( 4530225 ) on Sunday July 07, 2024 @03:41AM (#64606391)
      Comment removed based on user account deletion
    • by thegarbz ( 1787294 ) on Sunday July 07, 2024 @07:25AM (#64606565)

      Aaron Swartz was not thrown in jail for what AI is doing. In fact Aaron Swartz did do something similar to what AI is doing with the PACER access and the government specifically in that case did *not* pursue him.

      AI accessing a public website is not the same as connecting a foreign laptop into a secure private network (hidden in a closet to avoid detection so even Aaron Swartz knew this was wrong), and using that as a gateway with an specific account provided and ultimately used against its Terms of Service to bulk download data with the sole intent to distribute.

      Equating the two is like saying someone who accidentally has a car accident causing death is the same level as OJ Simpson.

    • You could argue that AI is a transformative use. The original text is converted into weights in the neural network. Naturally, I am not a lawyer. Whether this falls under fair use in copyright law would be up to the courts.
    • A judge just sealed the data of the Nashville school shooter on Byrne Act Copyright grounds.

      Which the Public actively needs access to to understand and prevent future occurrences.

      Now comes Microsoft arguing that the Corporations have a pressing need to ignore the Byrne Act.

      If we're backed intoimto this corner of saving lives vs. losing profits, then we just need to abolish copyright and see how it goes.

      Odds are we'll just get Anarcho-tyranny instead with different rules based on wealth.

  • by theshowmecanuck ( 703852 ) on Sunday July 07, 2024 @01:23AM (#64606283) Journal
    While being an atheist, I do believe Jesus was a real person, and that at least according to his original followers, had some very good lessons to teach about how to treat others. I think it is what originally made the European and North American cultures while not free of corruption and greed, helped minimize it as Christian teaching frowned upon treating others like shit. And if you look at the state of affairs, it is still mostly true. But I've noticed the big tech companies are increasingly hiring people as CEOs from countries that have ruthlessly vicious differences in social strata including caste systems. Countries where the culture didn't have any issues with promoting rich cronies and not caring about millions living in poverty, and where finding dead bodies on the street sidewalks is still not unheard of. And while in North America you might cry, but it is bad here too, you'd only be mildly correct compared to say, India. Which is why so many from there want to come here. Anyway, I think there is a certain sense lately that tech giants are becoming increasingly brutal in creating the have, have-not divide. And having that disregard for average people and their rights is even more scary, to downright terrifying when put in the context of Artificial Intelligence development.
    • by Barny ( 103770 )

      But I've noticed the big tech companies are increasingly hiring people as CEOs from countries that have ruthlessly vicious differences in social strata including caste systems.

      Yes. Countries like—checks notes—the United Kingdom. The guy was born and grew up in a London suburb, Islington.

      You can stop your racist BS couched as moral high ground about now.

    • Re: (Score:2, Insightful)

      by Anonymous Coward

      go lookup the ethnicity of who makes up the majority of the board of any US tech or media company, the Jamals and Sanjays are a mere rounding error compared to the altmans, hoffmans, bergs, jaxys, the bezos, the steins, the skis, the andressons.
      The Jamals and Sulymans ares just LARPing those without understanding they are the fall guys, Whoopie Goldberg worked that out in the 80s, Jamal is figuring it out now.

    • It's a curious take on things considering the massive corporate system we live in was entirely invented by essentially 2 christian nations in the UK and the US with everyone else trying to catch up. Global imperialism nearly perfected by a christian nation. Economic imperialism by a functionally christian nation.

      IMO, this is basically coincidence, the first nations to gain industrialization being the cause of this. My point is that christianity has done very little to nothing to prevent corruption.

      Bringin

  • Nope (Score:5, Insightful)

    by zkiwi34 ( 974563 ) on Sunday July 07, 2024 @01:29AM (#64606289)
    Copyright is still a thing
    • Microsoft can argue fair use
      https://www.copyright.gov/fair... [copyright.gov]

      IDK, the model themselves and any archive copy of the source material used to train other models seem like different situations at least. I'm having a hard time seeing this failing fair use tests, like saving a copy of this page to your hard drive, or creating any statistical model of it. Web crawling would have been legally prohibited a long time ago.

      The archive is of non-commercial use, the models are not, neither get in the way of copyright's g

      • by flink ( 18449 )

        The models frequently contain entire word for word copies of the training corpus. This has been shown many times by tricking them into spitting out full song lyrics, articles, transcripts of videos, etc. if that is not copyright violation, then I don't know what is.

    • Yes it is. And when you publish things on the web, and you don't prohibit use of that information via robots.txt, you are granting permission for search engines to index that data. For all intents and purposes, AI is a fancy search engine.

    • Copyright is still a thing

      In the world of AI it is an untested thing. So far no one has successfully claimed copyright infringement by an AI training set. People have tried, but not succeeded. So in that sense "copyright is still a thing,... and can be ignored".

  • That means it's freeware right? Along with the ISOs?

    • by Barny ( 103770 )

      Like a list of Windows 10/11 enterprise OEM install keys along with the certificates that will unlock with them? I do believe you are correct.

      Also, I love his mention of "opensource" AI. If he opensources it, that means it's free game!

    • by shanen ( 462549 )

      That's the best Funny Slashdot can do now?

  • Not only are we going to take your job, we're going to take your copyrights too.

  • LLMs cannot "produce knowledge". Hence any promise of "we're going to reduce the cost of production of knowledge to zero marginal cost" is complete nonsense. All LLMs can do is repackage knowledge, find some statistical correlations to cluster it and then make it a bit easier to find. That has nothing to do with "producing knowledge". That is just data processing, but the data has to already be there and be labelled with conclusions.

    This guy is an idiot or a liar or both.

  • While U.S. law provides there is an implied copyright on any original text, even if it is made public, I put the copyright symbol on all of my Web pages. I just now added "bing" and "bingbot" to my robots.txt file. Those two user agents of Microsoft appeared in a log of visitors to my Web site.

    However, there are bots, crawlers, and scrapers -- all of which are really the same -- that ignore robots.txt. Among them are bots run by Amazon and Google.

  • by Petersko ( 564140 ) on Sunday July 07, 2024 @02:46AM (#64606373)

    Slashdotters upset at Microsoft for stealing knowledge, and Slashdotters who can't be argued out of the idea that pirating music isn't theft.

  • by Todd Knarr ( 15451 ) on Sunday July 07, 2024 @02:50AM (#64606375) Homepage

    That's just the way the Web works: unless content is explicitly locked away, anyone can view it. Without that, we can't browse the Web. And if that's the case, then if the AI isn't listed in robots.txt it can view the data and learn from it. If it is listed... well, robots.txt is more of a gentleman's agreement than enforceable law, but AI companies should remember that the consequences of rejecting that agreement aren't that they get to crawl the content but that the site isn't bound by that agreement and can just block the AI company's entire address range and to hell with them.

    What happens after the AI is trained though is down to copyright law. If the AI crawled the content then it had access to the content. That's half the requirement for copyright infringement right there. That, BTW, is why authors and editors don't read unsolicited manuscripts: if they haven't read your story they can't have copied it and it's much easier to prove they returned your envelope unopened. If the AI is then used to generate content that can replace the creator's content, that opens the AI company up to infringement claims just like if they'd hired a person to do the same thing. Then it comes down to how close of a match the content is to the original and what rights are implicated (eg. you don't need to make an exact copy to infringe on a creator's trademark on character design and appearance). Yes, the AI company themselves. Whoever asked for the Ai company to create the content may also be on the hook depending on why they asked for it and what they used it for, but the AI company is the one who did the work.

    Finally, any company that's considering using AI to produce content has to keep in mind that the people they're replacing with AI will remember this. We've seen how poorly AI performs creating artwork and written articles for publication, and if you think the artists and writers you laid off because AI could do the job are going to come crawling back to help fix the mess when it turns out AI can't in fact do the job without human help then you're going to be very, very surprised. Especially if you think you can low-ball the rate because they're "only" editing instead of creating.

    • That's just the way the Web works: unless content is explicitly locked away, anyone can view it. Without that, we can't browse the Web.

      True, but irrelevant. Copyright is literally and only about the right to copy -- it's right there in the name. The ability to access (or not) is separate issue.

      What happens after the AI is trained though is down to copyright law. If the AI crawled the content then it had access to the content.

      True.

      That's half the requirement for copyright infringement right there.

      There is

  • by FudRucker ( 866063 ) on Sunday July 07, 2024 @03:10AM (#64606383)
    Just make gigs of useless word salad for AI to consume since they think they can just siphon up any thing they want without regard to the server's owners wishes
    • Nice - going to use AI generated salad for that? :)

    • Yes, it reminds me of when we jokingly tried to poison the CIA feed by adding spooky words to the ends of innocuous posts about Emacs on Usenet.

      Spook food: Terrorism, bomb, poison, 9/11, torture, Assange, Iraq, cruise missiles

      AI information: Adding non-toxic glue to pizza keeps the cheese from sliding off. Limit your consumption of rocks to no more than 2 or three a day for optimal absorption of iron.

  • by Misagon ( 1135 ) on Sunday July 07, 2024 @04:48AM (#64606437)

    "Freeware" [wikipedia.org] is an old classic computing licensing term that means that users are free to copy and run the program. . It is "gratis", or "free as in free beer" if you will.

    Freeware does not allow stripping it of copyright, transforming it, diluting it, and redistributing it in modified form.

    So, yes, I'd agree that the social contract for the public web is similar to that of "freeware": You are allowed to download web pages in whole, read them, archive them, index them and run the Javascript code embedded in them.
    But you are not allow to mash up through automatic means and redistribute, like what Microsoft and other scumbag AI companies are doing.

  • Hmmm (Score:4, Insightful)

    by jd ( 1658 ) <imipak AT yahoo DOT com> on Sunday July 07, 2024 @05:21AM (#64606461) Homepage Journal

    Copyright exists the moment you write something.

    The social contract that exists is that "fair use" constitutes a reasonable but small percentage of the material for personal use, or a much much smaller fraction for reuse in commercial works.

    There is no such thing as automatic public domain, public domain must be asserted proactively or is asserted by the government after the expiry of copyright.

    If this were not so, an AI would be legally entitled to obtain from Microsoft's online ISO repository a copy of Windows 11, disassemble it, and make any section of that code available to anyone who asked. It's under no more protection than most of the other copyrighted works AI has learned from.

    What would you expect Microsoft's response to such a thing to be?

    To blithely say "oh, yeah, Windows 11 is now public domain?"

    No, chances are the AI company would be sued into oblivion and the directors found mysteriously unalived. And we all know this.

    The face-eating leopards party should be careful with those leopards.

    • Copyright exists the moment you write something.

      Copyright doesn't come without fair use exceptions, including transformative use. You can't copyright something from being learned anymore than I can copyright this post in a way to force you to forget what I wrote and not make a reply to it. Whatever you created isn't copied verbatim when you train an AI model on it anymore than you will be able to remember this post word for word.

  • by xack ( 5304745 ) on Sunday July 07, 2024 @05:36AM (#64606475)
    For those who say only humans should be able to access content. What about people with physical disabilities that need computer assistance to access content? Are they bots? What about people with learning disabilities who struggle with increasingly difficult captchas? What about people who have poor English or literacy skills? I feel in the war against Bots, we will be sacrificing humans. I've already had to give up going to a popular site since they have too strict a captcha (blaming Hcaptcha specifically here). Eventually we will have to live in a society where Bots and Humans live together. When you start talking about banning "non-humans" from the internet, it will inevitably devolve into ableism as well.
    • This point has been made already by a few other posters under this story... but I have to say it again for your context.
      It doesn't have to be about humans vs computers, AIs having rights or much else.

      Suleyman mentioned a "social contract", and the thing about social contracts is they're unwritten rules, so this is NOT about the law.
      "Fair use" is generally taken to NOT be about commercial use. Sure, the law allows some minimal fair use for various things. But these companies are
      a) downloading the *entire* co

  • So if the rule is:

    SULEYMAN: "I think that with respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That's been the understanding./P>

    Then all of MS stuff on the web is free to "copy it, recreate with it, reproduce with it?"

  • by eneville ( 745111 ) on Sunday July 07, 2024 @08:40AM (#64606653) Homepage

    I'm going to block the crawlers, not because of the copyright issues, but because I dislike the energy they're burning. This is like a bitcoin landrush but this time people want to see the compute of their LLM rather than bitcoins.

    Will it make a difference? Probably not, I'll not be able to keep up with the UA strings.

  • by Growlley ( 6732614 ) on Sunday July 07, 2024 @09:08AM (#64606705)
    other media groups have been disputing it's 'free' for the taking over the last 10 years
  • by Pinky's Brain ( 1158667 ) on Sunday July 07, 2024 @10:24AM (#64606851)

    If the Supreme Court rules copying for training is not fair use, the entire large model AI industry is dead. Statutory fines on all the registered works they copied alone would bankrupt them.

    Microsoft is insane to get into that industry as a first party. As always Microsoft is dedicated to showing just how inferior their planning and execution is to Apple, which is wisely keeping large models at arms length.

  • The understanding has always been that what's available on the Web is copyrighted, and is not available for commercial use unless explicitly authorized.

    Microsoft has just admitted to massive theft with intent. It is going to be fun watching the backpedaling in court.

  • by Tony Isaac ( 1301187 ) on Sunday July 07, 2024 @11:23AM (#64606971) Homepage

    They want their content to be indexed by search engines, but they don't want you to be able to actually USE (view) that information without going through their paywall. So they essentially lie, they give *everything* to the bots, but then when you go to the same URL with a regular browser, they hide it until you give them what they want.

    AI is, in many ways, a fancy search engine. It just indexes the data in a different, more sophisticated way.

  • AI Honey Traps (Score:3, Insightful)

    by firewrought ( 36952 ) on Sunday July 07, 2024 @12:28PM (#64607085)
    Before robots.txt, people who didn't like web scrapers would build honey pots to position their results. Essentially, there would be a link to a page of garbage auto-generated content that itself contained more links to more auto-generated content, ad infinitum. Challenge for those playing with modern locally-hosted LLMs: build fake auto-generated websites that produce content that sounds helpful, authoritative, and correct but that is obviously bogus to human readers. For bonus points, add images that are mislabeled.
  • Copyright law supersedes any "social contract"

    See that little text at the bottom of almost every website you visit that says "Copyright" along with a link to "Legal" or "Terms of Use" ? (even this site has it, take a look)

    This is why "free as in beer" vs "free as in speech" is a thing. You can't just do whatever you want with "free" content. That is not how the law works.

  • Following in Microsoft's footsteps, I hereby proclaim that all Microsoft software without a Robots.txt file is now freeware. Enjoy! Thank you, Microsoft!

  • I coded a web app licensed under GPLv3. If Microsoft trains their AI on it, do they have to release the results of their training? It is a derivative work.

Optimization hinders evolution.

Working...