Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI Technology

Lawsuit Takes Aim at the Way AI Is Built (nytimes.com) 83

A programmer is suing Microsoft, GitHub and OpenAI over artificial intelligence technology that generates its own computer code. From a report: In late June, Microsoft released a new kind of artificial intelligence technology that could generate its own computer code. Called Copilot, the tool was designed to speed the work of professional programmers. As they typed away on their laptops, it would suggest ready-made blocks of computer code they could instantly add to their own. Many programmers loved the new tool or were at least intrigued by it. But Matthew Butterick, a programmer, designer, writer and lawyer in Los Angeles, was not one of them. This month, he and a team of other lawyers filed a lawsuit that is seeking class-action status against Microsoft and the other high-profile companies that designed and deployed Copilot.

Like many cutting-edge A.I. technologies, Copilot developed its skills by analyzing vast amounts of data. In this case, it relied on billions of lines of computer code posted to the internet. Mr. Butterick, 52, equates this process to piracy, because the system does not acknowledge its debt to existing work. His lawsuit claims that Microsoft and its collaborators violated the legal rights of millions of programmers who spent years writing the original code. The suit is believed to be the first legal attack on a design technique called "A.I. training," which is a way of building artificial intelligence that is poised to remake the tech industry. In recent years, many artists, writers, pundits and privacy activists have complained that companies are training their A.I. systems using data that does not belong to them.

The lawsuit has echoes in the last few decades of the technology industry. In the 1990s and into the 2000s, Microsoft fought the rise of open source software, seeing it as an existential threat to the future of the company's business. As the importance of open source grew, Microsoft embraced it and even acquired GitHub, a home to open source programmers and a place where they built and stored their code. Nearly every new generation of technology -- even online search engines -- has faced similar legal challenges. Often, "there is no statute or case law that covers it," said Bradley J. Hulbert, an intellectual property lawyer who specializes in this increasingly important area of the law.

This discussion has been archived. No new comments can be posted.

Lawsuit Takes Aim at the Way AI Is Built

Comments Filter:
  • I understand, but (Score:4, Insightful)

    by LordofWinterfell ( 90845 ) on Thursday November 24, 2022 @11:12AM (#63076928)

    Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?

    • by Misagon ( 1135 )

      This story is a dupe, so I'll just link to my previous answer [slashdot.org] to a similar question.

      • by znrt ( 2424692 )

        yeah, and it's still fundamentally wrong (seems you didn't bother to even consider some of the good answers you got), plus the story is still the same inane bullshit that yet another attention whoring snowflake is trying to pull off :-)

        • by znrt ( 2424692 )

          forgot to add: but anyway this is prime clickbait material that will probably generate lots and lots of passionate comments, which is why it is a dupe. slashdot has been reusing bullshit for a while now.

      • You wrote that humans read and don't copy? *Bulldoodoo*. At the project definition phase -- top level design, that has to be true on any project not previously done. But at the lower levels -- I'll copy examples out of a manpage rather than retype it. Note: manpages don't usually have anything more than a statement (i.e. less than a function or subroutine), but for trivial examples like

        ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,

    • by Joviex ( 976416 )

      Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?

      Synthesis isnt the same as regurgitation. Humans learn, comprehend, and synthesize new answers.

      AI doesnt learn, doesnt comprehend, nor is it synthesizing. It is akin to compositing, as much as "AI" engineers want to call it not compositing work previously "stored".

      • by dynamo ( 6127 )

        Good luck proving that not synthesizing part. It's as much of a black box as any other programmer except it's incapable of explaining what's going on inside.

        Also, compositing is a subset of synthesizing.

        • No, it's really not. If you've ever used the linear regression function on a graphing calculator in high school, as most of us have, in order to get an answer to a test question, then you've basically done what today's AI does.

          Basically, it's not synthesizing or even creating anything, it's extrapolating and intrapolating based on a provided mathematical model. Let me know when you've figured out a mathematical model for extrapolating the next best selling novel. And even if you do, by the time you have, yo

          • by Rei ( 128717 )

            If you've ever used the linear regression function on a graphing calculator in high school, as most of us have, in order to get an answer to a test question, then you've basically done what today's AI does.

            The key defining aspect of neural networks is that they're nonlinear. The activation function exists specifically to introduce nonlinearity into the network. A neural network without an activation function (or using a linear relation as the activation function) can be mathematically reduced to just a si

            • The key defining aspect of neural networks is that they're nonlinear.

              /facepalm

              That's not at all what I meant. You remind me of a situation a few days ago where I told a guy to use username@server format to log in to his server, and so in his console he literally typed username@server. Please use a bit of common sense, and don't be a moron.

              • by Rei ( 128717 )

                No, it is not some sort of cosmetic difference. Linear regressions can solve for noninterative linear relationships. Neural nets can solve for (far more complex) nonlinear relations with iterative feedback processes. That's literally why they're chosen over linear regression.

                You cannot build something like DALL-E or Github Copilot with linear regression. Neural nets are a subset of statistics solvers, and linear regressions solvers are also a subset of statistics solvers, but neural nets are not a subse

      • by ceoyoyo ( 59147 )

        AI doesnt learn, doesnt comprehend, nor is it synthesizing.

        It's easy to just write a sentence making claims hey? "Comprehend" is pretty loaded, good luck proving you comprehend in any way that's more rigorous than you can do for an AI, but doesn't learn and doesn't synthesize? Lol.

        • AI in the article does not "comprehend". For example, it cannot produce flow-chart of quick sort or write an article on the concept or pros and cons of different sorting algorithms. It is CliffNotes on programming. If you plagiarize the code it produces verbatim - it is cheating.
          • A search engine doesn't comprehend anything either, but imagine the web if we barred google and all the others from downloading and storing derivative information such as indexes. Sure there can be limits - google used to serve up verbatim copies from its cache which was struck down - but being barred from modeling publicly posted content would be terrible.
          • by ceoyoyo ( 59147 )

            You seem to have a problem with articles in your writing. Should we conclude you don't comprehend English? Or just recognize that it's probably not your native language, you lack a few specific skills, but you probably comprehend just fine?

            There's a long history of people pointing out specific "deficits" in AI systems, which researchers take as feedback for improving them. It wouldn't be difficult to train a system that can already synthesize code to make flow charts.

      • by Rei ( 128717 )

        AI doesnt learn, doesnt comprehend, nor is it synthesizing.

        And what are you backing that statement up with apart from tautology and metaphysics?

        Yes, neural networks ARE just incredibly complex statistics solvers.

        But so are we.

        We have a lot more compute power, larger networks, and probably better network architectures too (though also a number of disadvantages to our wetware). But at a fundamental level, we too are just solving statistical relationships between inputs and outputs, minimizing a loss function

      • Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?

        Synthesis isnt the same as regurgitation. Humans learn, comprehend, and synthesize new answers.

        Does not matter numbers and the various alphabets were invented by someone - and using them without compensation is theft. Change my mind - what synthesis would you do if you had to invent everything, lest you be infringing?

    • Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?

      Yep. This is WIPO taken to an extreme. Code shouldn't be subject to patents or copyright in the first place - there's nothing new/original under the sun - everything's similar to or a derivative of something else. If we let the trolls get away with this, we'll end up paying them for the privilege of being allowed to write code & only companies with big, expensive legal teams will be able to publish anything.

    • by ceoyoyo ( 59147 )

      Yes. It's silly, but I suppose if you have nothing better to do, silly lawsuits might be lucrative.

      • by Rei ( 128717 )

        It depends. I'm a big supporter of AI tools (I'm particularly into AI art), but one has to draw the line at how transformative the process is (as this is a basic element of copyright law). You can reproduce styles and motifs. But your action must be transformative. So the question comes down to, "How much is the network overtrained to the dataset?"

        With StableDiffusion, contrary to popular perception, you cannot reproduce specific images in the dataset, unless they're in the dataset thousands of times (li

        • by ceoyoyo ( 59147 )

          That's a good point. I was objecting more to the idea that "AI" generated works are infringement because the system learns from examples.

          If a method, human, AI, a million monkeys with typewriters, whatever, produces an infringing work, then its infringement, as the world's forgers know. If it doesn't, then it's not, as the world's legitimate artists know. We decide which is which on a case-by-case basis. A particular method that you can carefully goad into producing an infringing work isn't globally infring

        • by Rei ( 128717 )

          ** Might lose

    • by ewibble ( 1655195 ) on Thursday November 24, 2022 @02:53PM (#63077370)

      To me both should be acceptable, humans learn from copying things and then hopefully improving it. If AI can help then why not we all should benefit.

      I do hope the programmer wins the case though, not because I think its right, just because companies have been slowly copyrighting everything, charging us in perpetuity for everything they can but using the ideas that the humanity has come up with for free. Its just hypocrisy.

      While some copyright/patients are OK and possibly needed but it has reached crazy levels. But nothing will come from this at best they will have a trivial fine for them imposed and continue to try to take ownership of every they can.

      To me I am so disappointed in the internet. When I first saw it, I thought it would be a great way for people to share ideas, learn from each other for a better world. Unfortunately the internet has turn in into a big shop, that is used to control people, and make everything a product. Oh yes and put down people you disagree with.

      • Your baked in assumption is that only big for profit corporations will be able to release models like Copilot. In reality there are free implementations and they are getting better and better. The gap is just a year or a few months.

        Look at large language models this way - they can boost anyone's skills, they are a new form of code reuse, they lower the barrier and democratise access, like a raising tide lifting all the boats. You can use a language model in privacy on your computer, unlike web search whi
    • Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?

      We must seek out and reward the descendents of the person who created the alphabet, as all works developed since then are using the tools the creator of the alphabets created. To do anything otherwise is pure and simple theft.

      Same with math - the theft of the intellectual property by programmers is unconscionable. 8^/

  • Copyright - which is the only legal tool he has here - covers use of the original work. It doesn't cover it if its been through a neural net and comes back mashed up with other stuff and there's certainly no law against digesting it if its free on github. Plenty of sharks tried this with music (Tune A sounds like tune B therefore copyright infringement by author of B). Doesn't fly.

    Whether having autocomplete on steroids is much use in the long run I've no idea, as it'll have little idea of where you want yo

    • Re: (Score:2, Informative)

      by Anonymous Coward
      When machine uses database and the database has copyrights, the resulting new database is also under copyright IMO.

      It is not clear to me whether output of neural networks creating code from the copyrighted works also fall under copyright. I think it should. This is different from humans which have rights to have experience which they can use to create new copyrighted works.
    • by Tx ( 96709 ) on Thursday November 24, 2022 @12:05PM (#63077064) Journal

      There may or may not be any room for judges to interpret copyright law in a way that might cover this case. That doesn't make it pointless; a trial can bring to light circumstances under which current laws have become inadequate, and by the various processes that exist, need to be updated. We are at the point where you can train an AI on an artist's works and then have it spit out new works in his style, and I think it is reasonable to at least ponder whether that is ok, because that sure wasn't possible when existing laws were written.

    • by phantomfive ( 622387 ) on Thursday November 24, 2022 @12:09PM (#63077066) Journal

      [Copyright] doesn't cover it if its been through a neural net and comes back mashed up with other stuff

      [Citation needed]. There is nowhere in copyright law that explicitly states this, and there are no cases that address it.

    • by Waffle Iron ( 339739 ) on Thursday November 24, 2022 @12:15PM (#63077078)

      Copyright - which is the only legal tool he has here - covers use of the original work.

      You forgot that it also covers "derived works". You cannot distribute your own sequel to "The Lion King", even if you wrote it all.

      It doesn't cover it if its been through a neural net and comes back mashed up with other stuff and there's certainly no law against digesting it if its free on github.

      No, they argue that doing that is creating derivative works from the originals. It doesn't help that this system has been known to regurgitate recognizable chunks of code from its training samples.

      You could argue that these AI systems are so similar to human brains that they're doing the same things people do when they learn from code examples, so it's all good. But then you would open the door to legal theories such as AI systems having basic human rights.

      • You could argue that these AI systems are so similar to human brains that they're doing the same things people do when they learn from code examples, so it's all good.

        Humans have lost copyright cases for pulling recognizable chunks of music out of their neural networks (ie, brains).

      • > You forgot that it also covers "derived works".

        I think you're mixing two concepts here - expression and idea. You can copyright an expression of an idea, but not the idea. It is fair use to learn the ideas, the concepts. Like knowing how you can open a file in Python. So they can train their models on the abstract parts of the copyrighted works, because you can't copyright those. As long as the model does not replicate the training set, it just learned the concepts.
        • But the models don't just store "concepts". Like I said, people have found identifiable chunks of training code in the output. That means that they must be storing at least some of the "expression" itself.

          • Indeed, despite your poor English, I see many sentence fragments that were undoubtedly used by someone else previously. You are therefore guilty of theft and should be fined and prevented from writing.

    • Re: (Score:2, Informative)

      by pjt33 ( 739471 )

      How do you put the original work through a neural net without copying it?

    • Copyright law, at least in Berne Convention signatories, specifically recognizes 'derivative works' and, while those works are copyright protected in themselves, recognizes the right of the copyright holder of the original work to control the creation of derivative works from it. Here [cornell.edu] is the US implementation, exact scope varies somewhat by jurisdiction.

      The only part of it that's an open question is whether AI systems trained on copyrighted inputs will be judged to be sufficiently transformative or other
      • Talking as if it is a hard problem to check strings for duplication. If you have a match, then generate another string, until it passes the originality test. For efficiency you can use bloom filters. But I don't think all snippets should be copyrighted - especially if they appear in many places identically or with slight variations. Like fizz-buzz.
    • by Sique ( 173459 )
      If the code comes out barely changed (e.g. just with other variable and function names), then it's still plagiarism. The hurdle to conquer for a new original work is quite high. The way this A.I. is supposed to work, it does at most create derivatives, which require a license for the original.
    • > it'll have little idea of where you want your code to go next

      The next version of Copilot will allow users to interact with the code in an iterative manner, refining and describing what they want. Another skill is interpreting error messages and auto-fixing simple bugs. The problem you raise might be caused by the UI, not a fundamental limitation for language models.
    • by catprog ( 849688 )

      What about this scenario?

      https://www.claytonutz.com/kno... [claytonutz.com]

  • IANAL, but I see some issues with that:
    1) Like a lot of data used for AI training, it is either actually owned by the company producing the training set, or it is publicly available.
    2) When you work at Microsoft, your code is probably a "work for hire", meaning they own it. This is the case at most companies
    3) When I create an original song "influenced by ZZTop", or an original work of art "influenced by Dali", it's not piracy.
    4) Ever implemented a sorting algorithm, and looked at a textbook example?
    • by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Thursday November 24, 2022 @11:42AM (#63077000) Homepage Journal

      IANAL, but I see some issues with that:
      1) Like a lot of data used for AI training, it is either actually owned by the company producing the training set, or it is publicly available.

      The training set is based in part on GPL (and other licensed) code, but the training set is not being made publicly available, nor are the sources which went into it, as the license requires. A training set is not a brain, or in your brain, or part of your intellect, it's just data. And that data is a derivative work of the copyrighted material it's based on. Publicly available doesn't mean unlicensed. Microsoft is not complying with the licenses of the software the training data is based on.

      This is a problem for these models in general, not just Microsoft's. But it's a bigger problem for Microsoft because their tool is spitting out recognizable lines of unique code.

      • Microsoft is not complying with the licenses of the software the training data is based on.

        Now we care about licenses and copyright?

        It would be nice if people would make up their mind. Either licenses and copyright need to respected or they don't. If they do then penalties apply to everyone who violates said license and/or copyright. You can't make excuses for doing so.

        • Now we care about licenses and copyright?

          Personally, I care about copyleft, but that's powered by copyright...

          It would be nice if people would make up their mind.

          I'm only one person, with one account, so I'm not responsible for "people" making up their mind. Also, this issue is more nuanced than you are trying to make it out to be.

          Either licenses and copyright need to respected or they don't.

          That's a stupid thing to say, you're reducing a complex issue to a one-line argument.

          If they do then penalties apply to everyone who violates said license and/or copyright.

          Those penalties are enforced extremely unevenly, I note you didn't say equally. But what would that even mean? Percentages of revenue? What about copyright infringement not for commercial

          • Personally, I care about copyleft, but that's powered by copyright...

            It's a mishmash of stupid debating tactics: assume you're making a wildly extreme version of some stance and use that to "prove" that you're dumb because it's inconsistent with a different extreme viewpoint you're also assumed to hold.

            This one, namely that if you think there's anything at all to criticism in copyright law then you hate all copyright and obviously love the GPL so you clearly hate it, was a popular argument like... 20 years

          • That's a stupid thing to say, you're reducing a complex issue to a one-line argument.

            It's not a complex issue. Either you (the big you and not you personally) believe licenses and copyright should be respected or you don't. There isn't a middle ground in this case. There have been a few stories recently on here where someone complains the government or business is using their software either without paying them or not respecting the license and everyone is jumping to the defense of the programmer and spo

            • Either you (the big you and not you personally)

              See, this is the first place we break company. There is no big you when it comes to having an opinion. Everybody has their own, even if they've rented it from someone else it's still effectively theirs.

              believe licenses and copyright should be respected or you don't.

              No, I think that some licenses shouldn't be legal at all, and aspects of others should be mandatory. I don't have to agree with the existing IP regime, and suggesting that I have to is both wrong, and inherently offensive.

        • It would be nice if people would make up their mind.

          It'd be even nicer if people stopped ignoring that groups are not some monolith, and nuance means leaning one way on one issue, and another on another issue that is related, based on specifics WHILE remaining logically consistent.

          Seriously folks, this is stupid.

      • by galabar ( 518411 )
        Do we know that Microsoft uses anything other than truly open source code (Berkeley/Apache/etc. License)?
        • Nearly all of those licenses require that the license, attribution, or something stay with the code.

      • by narcc ( 412956 ) on Thursday November 24, 2022 @05:02PM (#63077586) Journal

        And that data is a derivative work of the copyrighted material it's based on.

        I'd stay away from that particular the legal term here. It doesn't make any sense, as I'll explain.

        Publicly available doesn't mean unlicensed. Microsoft is not complying with the licenses of the software the training data is based on.

        It's important to remember that the training data is not the model. What is encoded by the model is essentially just a little statistical information about the training data. It doesn't really make sense to talk about copyright or licensing here.

        As an example, let's try a simple Markov chain text generator. Why a Markov chain? It doesn't leave any room for mystery. It's intuitive, you can understand everything about its operation, and it's very easy to implement. You can also easily find one online if you're not up for making your own.

        Start by feeding it just a few pages of text and take a look at the kind of output you get. If yours was operating on individual characters, you'd be lucky to even get recognizable words! Feed it until you start to get good output, preferably from different authors, and pick out your best examples. If you didn't know anything about how the output was produced, would you say that anything about those violated the copyright of any of the text you included? I'd bet against it. Far too much information about the training data was lost.

        If we can't make the case for our Markov chain, what hope does our NN have? What our NN is encoding isn't obvious, so you might be tempted to say something like "the training data is still in there, we just can't see it". But we don't need to make guesses or lean on our intuition. Just think about what happens when the size of the training data exceeds the size of the model...

        GPT-3, which this is based on, has 175 billion parameters. That's a lot, but it was trained on 45TB of data. That's a lot of text per parameter. Both models lose information, but the NN loses quite a bit more. Anything original about any of the works used during training is completely lost.

        But it's a bigger problem for Microsoft because their tool is spitting out recognizable lines of unique code.

        Is it? If you have a link, I'd love to see it. Such a thing shouldn't be possible.

        • Very good reply, I would mod you if I had points. I want to fix one detail - I think they do regurgitate some data, but it is just 1% or less. And it can be easily filtered out by referencing the training set.
        • See this twitter post [twitter.com] where copilot spits John Carmak's fast inverse fast square root from Quake (comments and everything), and then adds a MIT type license (line by line). Both texts (the code and the license) seem to be stored verbatim in the net.
          • Just for reference, this is the original code. [softwareheritage.org]
          • by narcc ( 412956 )

            Ah, that's because that particular bit of code was included many, many, times in the training data. Had it been included just once or just a few times, like you'd expect would be the case for most projects, it wouldn't be possible for it show up like this at all. With a well-known bit of code that has been copied and discussed endlessly, this isn't terribly surprising. It's worth pointing out that this sort of thing could have been prevented. I'm guessing they just didn't anticipate this or didn't think

      • > their tool is spitting out recognizable lines of unique code.

        Saying that like it's murder. And usually it's just one line or a few words that are similar with another repo here and there. Not whole files, because humans interact with it during the process. It generates 99% original code and that 1% can be filtered in post processing.
        • Not murder, just indicative of what's actually happening. This is not just code style or structure, this is code. Just go back to some of the prior discussions on this here, there are examples in 'em

    • IANAL, but I see some issues with that: .... A case based on copying existing blocks of code might have some merit, but one based on the notion that AI generated code is a work derived from the training data seems like a stretch.

      Yes it's a stretch, but the District Court of East Texas is pretty elastic and has more than a couple head-scratcher decisions in IP law. Also consider if Microsoft can Make an AI that is able to assist coding, it would be trivial to train the AI to find plagiarized code too.

      • Let's assume we ban generative AI because copyrights. But companies in other countries build on them. Soon we'll be at a disadvantage. Is this the best choice for our future?
  • ⦠except itâ(TM)s mission critical code instead of commas and jeans. Legalities will be dealt with, but⦠There's still carbon-based testing and QC down the line, right? RIGHT?

  • Heh, students think their textbooks are expensive now. Wait until they have to pay royalties every time they use their learned knowledge!

    • If you copy directly from the textbook and try to make money off the copy, then yeah, you're going to have to pay royalties.

      • If "direct copy" is the standard, then you can rest assured. 99% of the outputs are original in that sense down to 11 consecutive words (11-grams). So easy to generate another one if it comes duplicate.
  • You start to type and it suggests whole paragraphs to include in what you are writing as it infers your intent. And by bootstrapping the paragraphs you accept it can keep suggesting paragraphs until you decide you are done.

    We are actually at that point now, though I have not seen a writing tool offering that provides exactly this. We do have email and chat services that suggest the next few words, and short responses (as typing "Sounds good" is too taxing for many peoples time and creativity it seems) and

    • You missed the point of the Galactica model - it is a citation search engine. You input a piece of text from a paper, maybe the paper you're currently writing, and it predicts the right citations. It's free lit-research. Another utility is to straighten up the language a bit - many authors are not native English speakers. Another utility is to generate teaching material - you can ask it to explain any section you don't understand, or a general concept.

      Since it is just a language model trained on papers,
  • The short answer to "is this illegal" is "we don't know because it hasn't been adjudicated yet" and that's what courts are for. The argument for it being illegal is pretty straightforward - Microsoft is redistributing code without following the licensing terms of that code. Sounds simple. But the counter argument is that Microsoft isn't actually redistributing code, but using that code privately without redistribution (which is allowed by virtually every open source license I'm aware of) and the resultin
  • This is a silly lawsuit, but even if it goes through Microsoft et al can just laugh and give this guy a giant Fuck You by requiring release to use the code for AI purposes in the EULA. Don't like it, don't store your code there asshole.
  • There are many legal questions about Copilot whose answers I don't know, and maybe nobody knows. And it's likely some of theo depend on the country you're in [because of the copyright laws in those countries.] In the U.S. we won't be able to have reliable answers until there are court cases about it, and who knows how many years it'll take for those court cases to arise and be finally decided. So basically what we have is a gigantic amount of uncertainty.

    From Slashdot story: Richard Stallman Shares His Concerns About GitHub's Copilot -- and About GitHub
    https://news.slashdot.org/stor... [slashdot.org]

  • by nester ( 14407 )

    This story is a repost.

  • Wait... isn't machine learning just like, ya know, learning? And isn't that what everyone does? How is this plagiarism, exactly?
  • This is just IP theft on an industrial scale. The AI doesn't learn how to write code it just learns what the code does, this is a sort function, this is a search function, and where to find it. When a programmer says, I need a sort function here, Copilot goes and finds one. It doesn't write it from scratch it steals one from GitHub. I think of it as more like a database of pirated code snippets.

egrep -n '^[a-z].*\(' $ | sort -t':' +2.0

Working...