Lawsuit Takes Aim at the Way AI Is Built (nytimes.com) 83
A programmer is suing Microsoft, GitHub and OpenAI over artificial intelligence technology that generates its own computer code. From a report: In late June, Microsoft released a new kind of artificial intelligence technology that could generate its own computer code. Called Copilot, the tool was designed to speed the work of professional programmers. As they typed away on their laptops, it would suggest ready-made blocks of computer code they could instantly add to their own. Many programmers loved the new tool or were at least intrigued by it. But Matthew Butterick, a programmer, designer, writer and lawyer in Los Angeles, was not one of them. This month, he and a team of other lawyers filed a lawsuit that is seeking class-action status against Microsoft and the other high-profile companies that designed and deployed Copilot.
Like many cutting-edge A.I. technologies, Copilot developed its skills by analyzing vast amounts of data. In this case, it relied on billions of lines of computer code posted to the internet. Mr. Butterick, 52, equates this process to piracy, because the system does not acknowledge its debt to existing work. His lawsuit claims that Microsoft and its collaborators violated the legal rights of millions of programmers who spent years writing the original code. The suit is believed to be the first legal attack on a design technique called "A.I. training," which is a way of building artificial intelligence that is poised to remake the tech industry. In recent years, many artists, writers, pundits and privacy activists have complained that companies are training their A.I. systems using data that does not belong to them.
The lawsuit has echoes in the last few decades of the technology industry. In the 1990s and into the 2000s, Microsoft fought the rise of open source software, seeing it as an existential threat to the future of the company's business. As the importance of open source grew, Microsoft embraced it and even acquired GitHub, a home to open source programmers and a place where they built and stored their code. Nearly every new generation of technology -- even online search engines -- has faced similar legal challenges. Often, "there is no statute or case law that covers it," said Bradley J. Hulbert, an intellectual property lawyer who specializes in this increasingly important area of the law.
Like many cutting-edge A.I. technologies, Copilot developed its skills by analyzing vast amounts of data. In this case, it relied on billions of lines of computer code posted to the internet. Mr. Butterick, 52, equates this process to piracy, because the system does not acknowledge its debt to existing work. His lawsuit claims that Microsoft and its collaborators violated the legal rights of millions of programmers who spent years writing the original code. The suit is believed to be the first legal attack on a design technique called "A.I. training," which is a way of building artificial intelligence that is poised to remake the tech industry. In recent years, many artists, writers, pundits and privacy activists have complained that companies are training their A.I. systems using data that does not belong to them.
The lawsuit has echoes in the last few decades of the technology industry. In the 1990s and into the 2000s, Microsoft fought the rise of open source software, seeing it as an existential threat to the future of the company's business. As the importance of open source grew, Microsoft embraced it and even acquired GitHub, a home to open source programmers and a place where they built and stored their code. Nearly every new generation of technology -- even online search engines -- has faced similar legal challenges. Often, "there is no statute or case law that covers it," said Bradley J. Hulbert, an intellectual property lawyer who specializes in this increasingly important area of the law.
I understand, but (Score:4, Insightful)
Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?
Re: (Score:3)
This story is a dupe, so I'll just link to my previous answer [slashdot.org] to a similar question.
Re: (Score:3)
yeah, and it's still fundamentally wrong (seems you didn't bother to even consider some of the good answers you got), plus the story is still the same inane bullshit that yet another attention whoring snowflake is trying to pull off :-)
Re: (Score:2)
forgot to add: but anyway this is prime clickbait material that will probably generate lots and lots of passionate comments, which is why it is a dupe. slashdot has been reusing bullshit for a while now.
Re:I 'think' I understand, but (Score:2)
You wrote that humans read and don't copy? *Bulldoodoo*. At the project definition phase -- top level design, that has to be true on any project not previously done. But at the lower levels -- I'll copy examples out of a manpage rather than retype it. Note: manpages don't usually have anything more than a statement (i.e. less than a function or subroutine), but for trivial examples like
($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,
Re: (Score:1)
Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?
Synthesis isnt the same as regurgitation. Humans learn, comprehend, and synthesize new answers.
AI doesnt learn, doesnt comprehend, nor is it synthesizing. It is akin to compositing, as much as "AI" engineers want to call it not compositing work previously "stored".
Re: (Score:2)
Good luck proving that not synthesizing part. It's as much of a black box as any other programmer except it's incapable of explaining what's going on inside.
Also, compositing is a subset of synthesizing.
Re: I understand, but (Score:1)
No, it's really not. If you've ever used the linear regression function on a graphing calculator in high school, as most of us have, in order to get an answer to a test question, then you've basically done what today's AI does.
Basically, it's not synthesizing or even creating anything, it's extrapolating and intrapolating based on a provided mathematical model. Let me know when you've figured out a mathematical model for extrapolating the next best selling novel. And even if you do, by the time you have, yo
Re: (Score:3)
The key defining aspect of neural networks is that they're nonlinear. The activation function exists specifically to introduce nonlinearity into the network. A neural network without an activation function (or using a linear relation as the activation function) can be mathematically reduced to just a si
Re: (Score:2)
The key defining aspect of neural networks is that they're nonlinear.
/facepalm
That's not at all what I meant. You remind me of a situation a few days ago where I told a guy to use username@server format to log in to his server, and so in his console he literally typed username@server. Please use a bit of common sense, and don't be a moron.
Re: (Score:2)
No, it is not some sort of cosmetic difference. Linear regressions can solve for noninterative linear relationships. Neural nets can solve for (far more complex) nonlinear relations with iterative feedback processes. That's literally why they're chosen over linear regression.
You cannot build something like DALL-E or Github Copilot with linear regression. Neural nets are a subset of statistics solvers, and linear regressions solvers are also a subset of statistics solvers, but neural nets are not a subse
Re: (Score:2)
It's easy to just write a sentence making claims hey? "Comprehend" is pretty loaded, good luck proving you comprehend in any way that's more rigorous than you can do for an AI, but doesn't learn and doesn't synthesize? Lol.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
You seem to have a problem with articles in your writing. Should we conclude you don't comprehend English? Or just recognize that it's probably not your native language, you lack a few specific skills, but you probably comprehend just fine?
There's a long history of people pointing out specific "deficits" in AI systems, which researchers take as feedback for improving them. It wouldn't be difficult to train a system that can already synthesize code to make flow charts.
Re: (Score:2)
And what are you backing that statement up with apart from tautology and metaphysics?
Yes, neural networks ARE just incredibly complex statistics solvers.
But so are we.
We have a lot more compute power, larger networks, and probably better network architectures too (though also a number of disadvantages to our wetware). But at a fundamental level, we too are just solving statistical relationships between inputs and outputs, minimizing a loss function
Re: (Score:2)
Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?
Synthesis isnt the same as regurgitation. Humans learn, comprehend, and synthesize new answers.
Does not matter numbers and the various alphabets were invented by someone - and using them without compensation is theft. Change my mind - what synthesis would you do if you had to invent everything, lest you be infringing?
Re: (Score:2)
Re: (Score:2)
Yes. It's silly, but I suppose if you have nothing better to do, silly lawsuits might be lucrative.
Re: (Score:3)
It depends. I'm a big supporter of AI tools (I'm particularly into AI art), but one has to draw the line at how transformative the process is (as this is a basic element of copyright law). You can reproduce styles and motifs. But your action must be transformative. So the question comes down to, "How much is the network overtrained to the dataset?"
With StableDiffusion, contrary to popular perception, you cannot reproduce specific images in the dataset, unless they're in the dataset thousands of times (li
Re: (Score:2)
That's a good point. I was objecting more to the idea that "AI" generated works are infringement because the system learns from examples.
If a method, human, AI, a million monkeys with typewriters, whatever, produces an infringing work, then its infringement, as the world's forgers know. If it doesn't, then it's not, as the world's legitimate artists know. We decide which is which on a case-by-case basis. A particular method that you can carefully goad into producing an infringing work isn't globally infring
Re: (Score:2)
** Might lose
Re:I understand, but (Score:4, Insightful)
To me both should be acceptable, humans learn from copying things and then hopefully improving it. If AI can help then why not we all should benefit.
I do hope the programmer wins the case though, not because I think its right, just because companies have been slowly copyrighting everything, charging us in perpetuity for everything they can but using the ideas that the humanity has come up with for free. Its just hypocrisy.
While some copyright/patients are OK and possibly needed but it has reached crazy levels. But nothing will come from this at best they will have a trivial fine for them imposed and continue to try to take ownership of every they can.
To me I am so disappointed in the internet. When I first saw it, I thought it would be a great way for people to share ideas, learn from each other for a better world. Unfortunately the internet has turn in into a big shop, that is used to control people, and make everything a product. Oh yes and put down people you disagree with.
Re: (Score:2)
Look at large language models this way - they can boost anyone's skills, they are a new form of code reuse, they lower the barrier and democratise access, like a raising tide lifting all the boats. You can use a language model in privacy on your computer, unlike web search whi
Re: (Score:2)
Would that also mean humans, reading code posted online to learn how to code, have to give money to everyone whose works they learned from?
We must seek out and reward the descendents of the person who created the alphabet, as all works developed since then are using the tools the creator of the alphabets created. To do anything otherwise is pure and simple theft.
Same with math - the theft of the intellectual property by programmers is unconscionable. 8^/
Good luck with that (Score:1)
Copyright - which is the only legal tool he has here - covers use of the original work. It doesn't cover it if its been through a neural net and comes back mashed up with other stuff and there's certainly no law against digesting it if its free on github. Plenty of sharks tried this with music (Tune A sounds like tune B therefore copyright infringement by author of B). Doesn't fly.
Whether having autocomplete on steroids is much use in the long run I've no idea, as it'll have little idea of where you want yo
Re: (Score:3)
"Parody" has special protection in US copyright law. That is also why, for instance how pornographers can get away with such things as "Star Wars - The Porn Parody" and "Pirates of the Caribbean - The Porn Parody" without being sued into oblivion by Disney.
And BTW, Weird Al always asks for permission first.
Re: Good luck with that (Score:3, Insightful)
Re: (Score:2)
Re: (Score:2)
"Original music" in this case means: He plays from the same score sheets than the original.
Thanks. It’s amazing how many AC’s lack basic comprehension skills
Re: (Score:2)
Doesn't Weird Al have to secure rights to every song he parodies?
No. But he does anyway, because he wants to be well-liked. SFSG.
Re: (Score:2, Informative)
It is not clear to me whether output of neural networks creating code from the copyrighted works also fall under copyright. I think it should. This is different from humans which have rights to have experience which they can use to create new copyrighted works.
Re:Good luck with that (Score:5, Insightful)
There may or may not be any room for judges to interpret copyright law in a way that might cover this case. That doesn't make it pointless; a trial can bring to light circumstances under which current laws have become inadequate, and by the various processes that exist, need to be updated. We are at the point where you can train an AI on an artist's works and then have it spit out new works in his style, and I think it is reasonable to at least ponder whether that is ok, because that sure wasn't possible when existing laws were written.
Re:Good luck with that (Score:4, Informative)
[Copyright] doesn't cover it if its been through a neural net and comes back mashed up with other stuff
[Citation needed]. There is nowhere in copyright law that explicitly states this, and there are no cases that address it.
Re: (Score:2)
Well done. It needs to explcitly state it if it IS the case, not if it isn't.
Re:Good luck with that (Score:4, Informative)
Source code is explicitly stated to be copyrightable. There is no affirmative defense for putting it through a neural network. At least, not yet.
Re:Good luck with that (Score:4, Informative)
Copyright - which is the only legal tool he has here - covers use of the original work.
You forgot that it also covers "derived works". You cannot distribute your own sequel to "The Lion King", even if you wrote it all.
It doesn't cover it if its been through a neural net and comes back mashed up with other stuff and there's certainly no law against digesting it if its free on github.
No, they argue that doing that is creating derivative works from the originals. It doesn't help that this system has been known to regurgitate recognizable chunks of code from its training samples.
You could argue that these AI systems are so similar to human brains that they're doing the same things people do when they learn from code examples, so it's all good. But then you would open the door to legal theories such as AI systems having basic human rights.
Re: (Score:2)
You could argue that these AI systems are so similar to human brains that they're doing the same things people do when they learn from code examples, so it's all good.
Humans have lost copyright cases for pulling recognizable chunks of music out of their neural networks (ie, brains).
Re: (Score:3)
I think you're mixing two concepts here - expression and idea. You can copyright an expression of an idea, but not the idea. It is fair use to learn the ideas, the concepts. Like knowing how you can open a file in Python. So they can train their models on the abstract parts of the copyrighted works, because you can't copyright those. As long as the model does not replicate the training set, it just learned the concepts.
Re: (Score:2)
But the models don't just store "concepts". Like I said, people have found identifiable chunks of training code in the output. That means that they must be storing at least some of the "expression" itself.
Re: (Score:2)
Indeed, despite your poor English, I see many sentence fragments that were undoubtedly used by someone else previously. You are therefore guilty of theft and should be fined and prevented from writing.
Re: (Score:2, Informative)
How do you put the original work through a neural net without copying it?
Re: (Score:2)
The only part of it that's an open question is whether AI systems trained on copyrighted inputs will be judged to be sufficiently transformative or other
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
The next version of Copilot will allow users to interact with the code in an iterative manner, refining and describing what they want. Another skill is interpreting error messages and auto-fixing simple bugs. The problem you raise might be caused by the UI, not a fundamental limitation for language models.
Re: (Score:2)
What about this scenario?
https://www.claytonutz.com/kno... [claytonutz.com]
Sounds like a stretch (Score:2)
1) Like a lot of data used for AI training, it is either actually owned by the company producing the training set, or it is publicly available.
2) When you work at Microsoft, your code is probably a "work for hire", meaning they own it. This is the case at most companies
3) When I create an original song "influenced by ZZTop", or an original work of art "influenced by Dali", it's not piracy.
4) Ever implemented a sorting algorithm, and looked at a textbook example?
Re:Sounds like a stretch (Score:4, Insightful)
IANAL, but I see some issues with that:
1) Like a lot of data used for AI training, it is either actually owned by the company producing the training set, or it is publicly available.
The training set is based in part on GPL (and other licensed) code, but the training set is not being made publicly available, nor are the sources which went into it, as the license requires. A training set is not a brain, or in your brain, or part of your intellect, it's just data. And that data is a derivative work of the copyrighted material it's based on. Publicly available doesn't mean unlicensed. Microsoft is not complying with the licenses of the software the training data is based on.
This is a problem for these models in general, not just Microsoft's. But it's a bigger problem for Microsoft because their tool is spitting out recognizable lines of unique code.
Re: (Score:1)
Microsoft is not complying with the licenses of the software the training data is based on.
Now we care about licenses and copyright?
It would be nice if people would make up their mind. Either licenses and copyright need to respected or they don't. If they do then penalties apply to everyone who violates said license and/or copyright. You can't make excuses for doing so.
Re: (Score:2)
Now we care about licenses and copyright?
Personally, I care about copyleft, but that's powered by copyright...
It would be nice if people would make up their mind.
I'm only one person, with one account, so I'm not responsible for "people" making up their mind. Also, this issue is more nuanced than you are trying to make it out to be.
Either licenses and copyright need to respected or they don't.
That's a stupid thing to say, you're reducing a complex issue to a one-line argument.
If they do then penalties apply to everyone who violates said license and/or copyright.
Those penalties are enforced extremely unevenly, I note you didn't say equally. But what would that even mean? Percentages of revenue? What about copyright infringement not for commercial
Re: (Score:2)
Personally, I care about copyleft, but that's powered by copyright...
It's a mishmash of stupid debating tactics: assume you're making a wildly extreme version of some stance and use that to "prove" that you're dumb because it's inconsistent with a different extreme viewpoint you're also assumed to hold.
This one, namely that if you think there's anything at all to criticism in copyright law then you hate all copyright and obviously love the GPL so you clearly hate it, was a popular argument like... 20 years
Re: (Score:2)
That's a stupid thing to say, you're reducing a complex issue to a one-line argument.
It's not a complex issue. Either you (the big you and not you personally) believe licenses and copyright should be respected or you don't. There isn't a middle ground in this case. There have been a few stories recently on here where someone complains the government or business is using their software either without paying them or not respecting the license and everyone is jumping to the defense of the programmer and spo
Re: (Score:3)
Either you (the big you and not you personally)
See, this is the first place we break company. There is no big you when it comes to having an opinion. Everybody has their own, even if they've rented it from someone else it's still effectively theirs.
believe licenses and copyright should be respected or you don't.
No, I think that some licenses shouldn't be legal at all, and aspects of others should be mandatory. I don't have to agree with the existing IP regime, and suggesting that I have to is both wrong, and inherently offensive.
Re: (Score:3)
It would be nice if people would make up their mind.
It'd be even nicer if people stopped ignoring that groups are not some monolith, and nuance means leaning one way on one issue, and another on another issue that is related, based on specifics WHILE remaining logically consistent.
Seriously folks, this is stupid.
Re: (Score:2)
Re: (Score:2)
Nearly all of those licenses require that the license, attribution, or something stay with the code.
Re:Sounds like a stretch (Score:5, Interesting)
And that data is a derivative work of the copyrighted material it's based on.
I'd stay away from that particular the legal term here. It doesn't make any sense, as I'll explain.
Publicly available doesn't mean unlicensed. Microsoft is not complying with the licenses of the software the training data is based on.
It's important to remember that the training data is not the model. What is encoded by the model is essentially just a little statistical information about the training data. It doesn't really make sense to talk about copyright or licensing here.
As an example, let's try a simple Markov chain text generator. Why a Markov chain? It doesn't leave any room for mystery. It's intuitive, you can understand everything about its operation, and it's very easy to implement. You can also easily find one online if you're not up for making your own.
Start by feeding it just a few pages of text and take a look at the kind of output you get. If yours was operating on individual characters, you'd be lucky to even get recognizable words! Feed it until you start to get good output, preferably from different authors, and pick out your best examples. If you didn't know anything about how the output was produced, would you say that anything about those violated the copyright of any of the text you included? I'd bet against it. Far too much information about the training data was lost.
If we can't make the case for our Markov chain, what hope does our NN have? What our NN is encoding isn't obvious, so you might be tempted to say something like "the training data is still in there, we just can't see it". But we don't need to make guesses or lean on our intuition. Just think about what happens when the size of the training data exceeds the size of the model...
GPT-3, which this is based on, has 175 billion parameters. That's a lot, but it was trained on 45TB of data. That's a lot of text per parameter. Both models lose information, but the NN loses quite a bit more. Anything original about any of the works used during training is completely lost.
But it's a bigger problem for Microsoft because their tool is spitting out recognizable lines of unique code.
Is it? If you have a link, I'd love to see it. Such a thing shouldn't be possible.
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
Ah, that's because that particular bit of code was included many, many, times in the training data. Had it been included just once or just a few times, like you'd expect would be the case for most projects, it wouldn't be possible for it show up like this at all. With a well-known bit of code that has been copied and discussed endlessly, this isn't terribly surprising. It's worth pointing out that this sort of thing could have been prevented. I'm guessing they just didn't anticipate this or didn't think
Re: (Score:2)
Saying that like it's murder. And usually it's just one line or a few words that are similar with another repo here and there. Not whole files, because humans interact with it during the process. It generates 99% original code and that 1% can be filtered in post processing.
Re: (Score:2)
Not murder, just indicative of what's actually happening. This is not just code style or structure, this is code. Just go back to some of the prior discussions on this here, there are examples in 'em
Re: (Score:2)
IANAL, but I see some issues with that: .... A case based on copying existing blocks of code might have some merit, but one based on the notion that AI generated code is a work derived from the training data seems like a stretch.
Yes it's a stretch, but the District Court of East Texas is pretty elastic and has more than a couple head-scratcher decisions in IP law. Also consider if Microsoft can Make an AI that is able to assist coding, it would be trivial to train the AI to find plagiarized code too.
Re: (Score:2)
If Clippy and StitchFit were allowed to breedâ (Score:2)
⦠except itâ(TM)s mission critical code instead of commas and jeans. Legalities will be dealt with, but⦠There's still carbon-based testing and QC down the line, right? RIGHT?
Academic textbook companies will love this (Score:2)
Heh, students think their textbooks are expensive now. Wait until they have to pay royalties every time they use their learned knowledge!
Re: (Score:2)
If you copy directly from the textbook and try to make money off the copy, then yeah, you're going to have to pay royalties.
Re: (Score:2)
Next the AI Enhanced Word Processor (Score:2)
You start to type and it suggests whole paragraphs to include in what you are writing as it infers your intent. And by bootstrapping the paragraphs you accept it can keep suggesting paragraphs until you decide you are done.
We are actually at that point now, though I have not seen a writing tool offering that provides exactly this. We do have email and chat services that suggest the next few words, and short responses (as typing "Sounds good" is too taxing for many peoples time and creativity it seems) and
Re: (Score:2)
Since it is just a language model trained on papers,
This was inevitable (Score:2)
Easy solution anyway. (Score:2)
Richard Stallman, GitHub's Copilot, 2021 (Score:2)
There are many legal questions about Copilot whose answers I don't know, and maybe nobody knows. And it's likely some of theo depend on the country you're in [because of the copyright laws in those countries.] In the U.S. we won't be able to have reliable answers until there are court cases about it, and who knows how many years it'll take for those court cases to arise and be finally decided. So basically what we have is a gigantic amount of uncertainty.
From Slashdot story: Richard Stallman Shares His Concerns About GitHub's Copilot -- and About GitHub
https://news.slashdot.org/stor... [slashdot.org]
Dupe (Score:2)
This story is a repost.
Next up: suing humans who went to school (Score:2)
Re: (Score:1)
Industrial IP Theft (Score:2)