AI Lab PleIAs Releases Fully Open Dataset, as AMD, Ai2 Release Open AI Models (huggingface.co) 5

Posted by EditorDavid on Saturday November 16, 2024 @12:34PM from the model-citizens dept.

French private AI lab PleIAs "is committed to training LLMs in the open," they write in a blog post at Mozilla.org. "This means not only releasing our models but also being open about every aspect, from the training data to the training code. We define 'open' strictly: all data must be both accessible and under permissive licenses."

Wednesday PleIAs announced they were releasing the largest open multilingual pretraining dataset, according to their blog post at HuggingFace: Many have claimed that training large language models requires copyrighted data, making truly open AI development impossible. Today, Pleias is proving otherwise with the release of Common Corpus (part of the AI Alliance Open Trusted Data Initiative) — the largest fully open multilingual dataset for training LLMs, containing over 2 trillion tokens of permissibly licensed content with provenance information (2,003,039,184,047 tokens).

As developers are responding to pressures from new regulations like the EU AI Act, Common Corpus goes beyond compliance by making our entire permissibly licensed dataset freely available on HuggingFace, with detailed documentation of every data source. We have taken extensive steps to ensure that the dataset is high-quality and is curated to train powerful models. Through this release, we are demonstrating that there doesn't have to be such a [heavy] trade-off between openness and performance.

Common Corpus is:

— Truly Open: contains only data that is permissively licensed and provenance is documented

— Multilingual: mostly representing English and French data, but contains at least 1B tokens for over 30 languages

— Diverse: consisting of scientific articles, government and legal documents, code, and cultural heritage data, including books and newspapers

— Extensively Curated: spelling and formatting has been corrected from digitized texts, harmful and toxic content has been removed, and content with low educational content has also been removed.

Common corpus builds on a growing ecosystem of large, open datasets, such as Dolma, FineWeb, RefinedWeb. The Common Pile currently in preparation under the coordination of Eleuther is built around the same principle of using permissible content in English language and, unsurprisingly, there were many opportunities for collaborations and shared efforts. But even together, these datasets do not provide enough training data for models much larger than a few billion parameters. So in order to expand the options for open model training, we still need more open data...

Based on an analysis of 1 million user interactions with ChatGPT, the plurality of user requests are for creative compositions... The kind of content we actually need — like creative writing — is usually tied up in copyright restrictions. Common Corpus tackles these challenges through five carefully curated collections...
Last week AMD also released its first series of fully open 1 billion parameter language models, AMD OLMo.

And last month VentureBeat reported that the non-profit Allen Institute for AI had unveiled Molmo, "an open-source family of state-of-the-art multimodal AI models which outpeform top proprietary rivals including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 on several third-party benchmarks."

This discussion has been archived. No new comments can be posted.

AI Lab PleIAs Releases Fully Open Dataset, as AMD, Ai2 Release Open AI Models

Load All Comments

Search 5 Comments Log In/Create an Account

Comments Filter:

Typography (Score:1)

by bill_mcgonigle ( 4333 ) * writes:

I can't tell if they're lower-case L's, capital i's or pipes.
That's great for a vanity license plate but not great for online success.
Best of luck but fix the name.
- Re: (Score:2)
  
  by test321 ( 8891681 ) writes:
  
  I personally set FF to display all websites to disregard css and display all websites with my font of choice. I use Adobe Source Serif Caption ( demo sheet : https://adobe-fonts.github.io/... [github.io] ), it is very explicit with the difference between Il| also oO0 is clear.
Sam Altman says (Score:2)

by Growlley ( 6732614 ) writes:

why do you hate me being a billionaire?
That may be the only approach with a future (Score:2)

by gweihir ( 88907 ) writes:

All the other datasets are basically based on piracy. They currently hide hat by keeping them secret, but that will not last.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

AI Lab PleIAs Releases Fully Open Dataset, as AMD, Ai2 Release Open AI Models (huggingface.co) 5

AI Lab PleIAs Releases Fully Open Dataset, as AMD, Ai2 Release Open AI Models More Login

AI Lab PleIAs Releases Fully Open Dataset, as AMD, Ai2 Release Open AI Models

Typography (Score:1)

Re: (Score:2)

Sam Altman says (Score:2)

That may be the only approach with a future (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot