New 'Open Source AI Definition' Criticized for Not Opening Training Data (slashdot.org) 38
Long-time Slashdot reader samj — also a long-time Debian developer — tells us there's some opposition to the newly-released Open Source AI definition. He calls it a "fork" that undermines the original Open Source definition (which was originally derived from Debian's Free Software Guidelines, written primarily by Bruce Perens), and points us to a new domain with a petition declaring that instead Open Source shall be defined "solely by the Open Source Definition version 1.9. Any amendments or new definitions shall only be recognized with clear community consensus via an open and transparent process."
This move follows some discussion on the Debian mailing list: Allowing "Open Source AI" to hide their training data is nothing but setting up a "data barrier" protecting the monopoly, disabling anybody other than the first party to reproduce or replicate an AI. Once passed, OSI is making a historical mistake towards the FOSS ecosystem.
They're not the only ones worried about data. This week TechCrunch noted an August study which "found that many 'open source' models are basically open source in name only. The data required to train the models is kept secret, the compute power needed to run them is beyond the reach of many developers, and the techniques to fine-tune them are intimidatingly complex. Instead of democratizing AI, these 'open source' projects tend to entrench and expand centralized power, the study's authors concluded."
samj shares the concern about training data, arguing that training data is the source code and that this new definition has real-world consequences. (On a personal note, he says it "poses an existential threat to our pAI-OS project at the non-profit Kwaai Open Source Lab I volunteer at, so we've been very active in pushing back past few weeks.")
And he also came up with a detailed response by asking ChatGPT. What would be the implications of a Debian disavowing the OSI's Open Source AI definition? ChatGPT composed a 7-point, 14-paragraph response, concluding that this level of opposition would "create challenges for AI developers regarding licensing. It might also lead to a fragmentation of the open-source community into factions with differing views on how AI should be governed under open-source rules." But "Ultimately, it could spur the creation of alternative definitions or movements aimed at maintaining stricter adherence to the traditional tenets of software freedom in the AI age."
However the official FAQ for the new Open Source AI definition argues that training data "does not equate to a software source code." Training data is important to study modern machine learning systems. But it is not what AI researchers and practitioners necessarily use as part of the preferred form for making modifications to a trained model.... [F]orks could include removing non-public or non-open data from the training dataset, in order to train a new Open Source AI system on fully public or open data...
[W]e want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information — like decisions about their health. Similarly, much of the world's Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.
Read on for the rest of their response...
This move follows some discussion on the Debian mailing list: Allowing "Open Source AI" to hide their training data is nothing but setting up a "data barrier" protecting the monopoly, disabling anybody other than the first party to reproduce or replicate an AI. Once passed, OSI is making a historical mistake towards the FOSS ecosystem.
They're not the only ones worried about data. This week TechCrunch noted an August study which "found that many 'open source' models are basically open source in name only. The data required to train the models is kept secret, the compute power needed to run them is beyond the reach of many developers, and the techniques to fine-tune them are intimidatingly complex. Instead of democratizing AI, these 'open source' projects tend to entrench and expand centralized power, the study's authors concluded."
samj shares the concern about training data, arguing that training data is the source code and that this new definition has real-world consequences. (On a personal note, he says it "poses an existential threat to our pAI-OS project at the non-profit Kwaai Open Source Lab I volunteer at, so we've been very active in pushing back past few weeks.")
And he also came up with a detailed response by asking ChatGPT. What would be the implications of a Debian disavowing the OSI's Open Source AI definition? ChatGPT composed a 7-point, 14-paragraph response, concluding that this level of opposition would "create challenges for AI developers regarding licensing. It might also lead to a fragmentation of the open-source community into factions with differing views on how AI should be governed under open-source rules." But "Ultimately, it could spur the creation of alternative definitions or movements aimed at maintaining stricter adherence to the traditional tenets of software freedom in the AI age."
However the official FAQ for the new Open Source AI definition argues that training data "does not equate to a software source code." Training data is important to study modern machine learning systems. But it is not what AI researchers and practitioners necessarily use as part of the preferred form for making modifications to a trained model.... [F]orks could include removing non-public or non-open data from the training dataset, in order to train a new Open Source AI system on fully public or open data...
[W]e want Open Source AI to exist also in fields where data cannot be legally shared, for example medical AI. Laws that permit training on data often limit the resharing of that same data to protect copyright or other interests. Privacy rules also give a person the rightful ability to control their most sensitive information — like decisions about their health. Similarly, much of the world's Indigenous knowledge is protected through mechanisms that are not compatible with later-developed frameworks for rights exclusivity and sharing.
Read on for the rest of their response...
"There are also many cases where terms of use of publicly-available data may give entity A the confidence that they may use it freely and call it "open data", but not give entity A the confidence they can give entity B guarantees in a different jurisdiction. Meanwhile, entity B may or may not feel confident to use that data in their jurisdiction. An example is so-called public domain data, where the definition of public domain varies country-by-country. Another example is fair-use or private data where the finding of fair use or privacy laws may require a good knowledge of the law of a given jurisdiction. This resharing is not so much limited as lacking legal certainty...
"Some people believe that full unfettered access to all training data (with no distinction of its kind) is paramount, arguing that anything less would compromise full reproducibility of AI systems, transparency and security. This approach would relegate Open Source AI to a niche of AI trainable only on open data... That niche would be tiny, even relative to the niche occupied by Open Source in the traditional software ecosystem. The requirements of Data Information keep the same approach present in the Open Source Definition that doesn't mandate full reproducibility and transparency but enables them (i.e. reproducible builds). At the same time, setting a baseline requiring Data Information doesn't preclude others from formulating and demanding more requirements, like the Digital Public Goods Standard or the Free Systems Distribution Guidelines add requirements to the Open Source Definition.
"One of the key aspects of OSI's mission is to drive and promote Open Source innovation. The approach OSI takes here enables full user choice with Open Source AI. Users can keep the insights derived from training+data pre-processing code and description of unshareable training data and build upon those with their own unshareable data and give the insights derived from further training to everyone, allowing for Open Source AI in areas like healthcare. Or users can obtain the available and public data from the Data Information and retrain their model without any unshareable data resulting in more data transparency in the resulting AI system. Just like with copyleft and permissive licensing, this approach leaves the choice with the user...
"This approach both advances openness in all the components of the AI system and drives more Open Source AI, i.e. in private-first areas such as healthcare."
Levels of Open (Score:3)
I of course asked ChatGPT what key components make up the creation and use of a model:
Training Data, Preprocessing and Data Pipeline, Training Configuration, Training Script, Model Checkpoints, Base Model (if applicable), Fine-tuning / Specialized Training, Trained Model, Inference Code, Deployment Pipeline, Evaluation and Testing Metrics, Post-processing.
I would say Training Data has been the most controversial aspect of AI creation, followed by censoring that may take place in a handful of the steps from the training Training Dat to the Post-Processing.
I understand that Training is the most "expensive" part of the process process, and at best most of us can really only make use of the end model.
It would be awesome to have access to all the training data, adjust it how I would want to, tune the configuration for the training, and generate my own model; and I would say to be labeled a full open AI then all these things should be available. However, I would be happy to at minimum have the model and control of the inference code settings and post processing and still say it is sort of Open AI.
I am not a believer in copyright or censorship; but I think I understand why the exist and how we can live without them. So I am not afraid of the negatives of open data sets or the results of unrestrained AI; I think it will be great, and mildly painful as we adjust to the new better reality!
Re: (Score:3)
Well, I finally took the time to skim the summary....
It really feels like the word "Open" is lost, at least for AI; given OpenAI is not fully Open from what I understand.
If the Open Source Initiative (OSI), "a long-running institution aiming to define and “steward” all things open source", does not "properly" define Open AI as having an open data set, then perhaps it is time to move beyond "Open" and cut the legs out from underneath a seemingly corrupted organization.
I like the word Libre, how a
Re: Levels of Open (Score:3)
Re: (Score:2)
::slow clap:: (Score:5, Interesting)
Re: (Score:2)
he cloud was marketed as the solution for scaling up hardware, and also for outsourcing IT expertise. The end result was that the businesses became sharecroppers, and effectively no longer own the physical means of production for their ideas.
Now AI is being marketed as the solution for business intelligence. The end result will be that businesses no longer know how to generate their own ideas independently.The lucky ones will become pointless franchises.
The dirty secret of LLMs is the training data (Score:5, Insightful)
Most of it was scraped illegally without consent.. Parts are illegal to even possess or continue to store after training. Parts of it would cause _huge_ liability issues if the ones it was stolen from find out. Most of it is crap.
There is really no surprise nobody wants to make their training data public.
Re: The dirty secret of LLMs is the training data (Score:2)
Re: (Score:2)
Scraped illegally? From a PUBLIC website?
Public means public.
If they bypassed security to access a private site, that would be illegal.
Re: (Score:2)
Public does _not_ mean "take it, sell it, do whatever you want with it". Are you a moron?
Re: (Score:2)
Before I ramble on, I am curious what gweihir, do you condemn arvhive.org the The Internet Archive Wayback Machine in the same way?
It seems they more than anyone "take it, sell it, do whatever [they] want with it"; and I think they are great and do not want them to stop!
I have never been convinced that large language models violate copyright.
I would say that "public" is at best only restricted to large exact copies, and maybe only then an attribution is needed to be kosher.
I learned to write essentially fro
Re: (Score:3)
Before I ramble on, I am curious what gweihir, do you condemn arvhive.org the The Internet Archive Wayback Machine in the same way?
Stop putting up strawmen. The Internet archive falls under the "search" exception, that is in place because consent to be found can be assumed when things are placed online. The archive is a bit of a border-case, but still covered by the law. Yes, there is a law in place to allow search engines and archives and they would be illegal otherwise. I do suspect that the Internet archive would be illegal if they put up a paywall though. And I do suspect that of a search-engine put up that paywall, they would not
Re: (Score:2)
I appreciate your response.
It is interesting to assert that archive.org is a strawman, and I am more convinced then ever it is not; at minimum it helps narrow my understanding of your initial post.
Perhaps it is because I mostly use ChatGPT in place of Google now, it is simply a search engine to me with a better interface.
In addition, the Internet Archive is the most extreme example, it serves up in whole entire web sites; Google no longer does that (miss the cached results), only in part, and ChatGPT and th
Re: (Score:2)
Most contents available on public websites are under "all rights reserved" licence. Even images on Wikipedia sometimes have non-free licence so you are not supposed to reuse them for commercial purposes (most LLM are commercial).
Re: (Score:2)
Re: (Score:2)
The law understands the difference between a person and a machine, even if you do not.
Re: The dirty secret of LLMs is the training data (Score:3)
Actually, the law doesn't. Not with respect to training data.
Re:LOL wut? (Score:2)
Scraped illegally? From a PUBLIC website? Public means public. If they bypassed security to access a private site, that would be illegal.
Just say you dont understand how property rights work. You know everytime you walk past your neighbor's house and see something laying there in the easement of the sidewalk and his lawn, and just take it, because "ITS NOT ON HIS PROPERTY!" okies.
Re: (Score:2)
Re: (Score:2)
The scuttlebutt is they didn't just scrape publicly available websites. They straight-up pirated huge volumes of books and built a billion-dollar product out of it.
Some modern AIs are built off of the equivalent of the PirateBay's "Library of Alexandria clean ebooks no DRM no dupes clean 1TB.torrent", and the people that did it are desperate to hide that fact.
Re: (Score:3)
That not true, see for example sigma.ai [sigma.ai] or kaggle [kaggle.com].
How useful these data are compared to closed data (that possibly were scraped illegally) is a different matter entirely.
Re: (Score:3)
These are not training datasets for general LLMs. Obviously, there are public datasets. Also, if you have a look at the datasets sigma.ai links (!), you will find that there are various usage limitations.
Re: (Score:2)
There is really no surprise nobody wants to make their training data public.
But the OSI should have made two levels of licences (or more) like Creative Commons made (CC0, BY, NC, SA). At least academics are interested in benchmarking their developments against a number of specific datasets. It could be "the images uploaded to Wikimedia Commons by Nov 1 2024" or "the proceedings of the EU Parliament in its 24 human-generated translations, between two dates". Or it could point at a specific folder with terabytes of data that other academics would back up. Even though these could be c
Re: (Score:2)
Most of it was scraped illegally without consent.. Parts are illegal to even possess or continue to store after training. Parts of it would cause _huge_ liability issues if the ones it was stolen from find out. Most of it is crap.
There is really no surprise nobody wants to make their training data public.
Add the that the massaging they do to match whichever the direction the creators lean on various topics.
Re: (Score:2)
Most of it was scraped illegally without consent.
I mean, isn't whether it is illegal or not determined in courts, and on a case by case basis? IF that is correct, and the necessary cases have not pulled through with some kind of outcome yet, how can you or I say whether it was either illegal, OR legal?
Re: (Score:2)
In this case, it is illegal until a law is made that makes it legal.
Re: (Score:2)
There are projects like LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs [arxiv.org] looking to solve this problem by creating LLMs from open data. These are helped by a meaningful Open Source Definition [opensourcedefinition.org], and hurt by a weak one.
A quarter century ago nobody wanted to make their source code public either; remember Ballmer: Linux is a cancer [theregister.com]?
If you want to help ensure a future where Open Source AI models are plentiful rather than drowned out by commercial black boxes, the
!Intelligent (Score:3)
Open weights (Score:3)
That's fine. Just share it under a different moniker not diluting "open source". For instance, "open weights" seems to already be in use by quite a few people, and feels fairly descriptive of the actual situation.
That's fine. Similarly, both open source and closed source software have been co-existing for decades and both seems to have their respective audiences. But if you need to use a lot of terms like "permit", "limit" and "protect" to make your case, it's not really "open".
Re: (Score:2)
Open source should probably refer to the actual source code. Absolutely make up other names for other situations.
If someone writes a program that does spline interpolation and distributes the source, but uses hardcoded spline coefficients, is that not open source? Or a marching cubes implementation with its lookup table? Those coefficients are the same (actually the same, not just analogous) to the weights in a deep learning model.
Software that depends on difficult to understand and reproduce ancillary data
When "open" doesn't really mean "open" (Score:3)
First: for the most part, the engines aren't all that important. Anyone with a modest understanding of language models, neural networks, inference engines, ettc. can write their own, and many people are doing just that. Some of these engines are better than others, but not markedly so -- and, over time, just as open source operating systems, compilers, office suites, etc. approached and then exceeded the quality of their non-open source counterparts, we can reasonably expect that open-source engines will do the same. These companies know this.
Second: what distinguishes the commercial AI/LLM offerings -- besides their computing capacity, which is not part of this discussion -- is their training data. What data was used and how it was used are both essential to the product, and unless both of those things are fully provided, then the product isn't open source. These companies are course resisting this partly because they believe that their choices of data and methods give them an advantage (and it might, for at least a while) but mostly because they know that some/most/all of the data they're using is being used in violation of (variously) TOS, licenses, and copyright.
They know what they've done is wrong. They know what they've done is illegal. And they don't want to be held accountable for it, because accountability is for the little people, not for billionaires and their pet corporations. (Let's note that these are the same companies that would sue any of us into oblivion if we got our hands on any of their private/closed source code and published it.) Like fossil fuel companies trying to greenwash their way out of responsibility for their destructive actions, these companies are trying to open-source wash their way out of culpability.
This effort must be soundly rejected, not just because it's hypocrisy writ large, but because it it's a major step toward "open source" into a term that doesn't actually mean "open source". Every person responsible for this hypocrisy, this farsical "Open Source AI Defintion", must be removed and banned for life from any further participation in any role defining "open source".
Re: (Score:2)
...but because it it's a major step toward "open source" into a term that doesn't actually mean "open source"
So what?
Maybe it is time to retire the term now that the corporate world has latched on, embraced, and extended it. Should we not consider a new term that encompasses both the software and the essential data embedded in it? Is it not time for a new modern open source initiative based in this century?
Can't Open What's Not Yours (Score:2)