EleutherAI Unveils Massive Licensed Text Dataset for AI Training

EleutherAI, a leading AI research group, has launched one of the largest collections of licensed and open-domain text for AI model training.
Named the Common Pile v0.1, this 8-terabyte dataset was developed over two years with AI startups Poolside, Hugging Face, and various academic institutions. It was used to train two new EleutherAI models, Comma v0.1-1T and Comma v0.1-2T, which the organization claims match the performance of models trained on unlicensed, copyrighted data.
AI firms, including OpenAI, face legal challenges over their use of web-scraped data, including copyrighted books and journals, for model training. While some have licensing deals with content providers, many rely on the U.S. fair use doctrine to justify training on copyrighted material without permission.
EleutherAI argues that these lawsuits have significantly reduced transparency in the AI industry, limiting insight into model functionality and weaknesses, which harms the broader research community.
“Legal challenges haven’t significantly altered data sourcing practices for model training, but they’ve sharply reduced the openness of AI companies,” said Stella Biderman, EleutherAI’s executive director, in a Hugging Face blog post on Friday. “Researchers at some firms we’ve spoken with cite lawsuits as the reason they can’t share their data-centric research.”
The Common Pile v0.1, available on Hugging Face’s AI platform and GitHub, was developed with legal consultation and includes sources like 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also utilized OpenAI’s Whisper model to transcribe audio content.
EleutherAI claims Comma v0.1-1T and Comma v0.1-2T demonstrate the Common Pile v0.1’s quality, enabling developers to create models competitive with proprietary systems. Both models, with 7 billion parameters and trained on a portion of the dataset, rival Meta’s original Llama model in coding, image understanding, and math benchmarks.
Save Over $200 on Your TechCrunch All Stage Pass
Innovate smarter. Grow faster. Network deeper. Connect with visionaries from Precursor Ventures, NEA, Index Ventures, Underscore VC, and more for a day of insights, workshops, and valuable connections.
Save Over $200 on Your TechCrunch All Stage Pass
Innovate smarter. Grow faster. Network deeper. Connect with visionaries from Precursor Ventures, NEA, Index Ventures, Underscore VC, and more for a day of insights, workshops, and valuable connections.
Boston, MA | July 15 REGISTER NOWParameters, often called weights, are the internal elements of an AI model that shape its behavior and responses.
“The belief that unlicensed text is essential for high performance is unfounded,” Biderman stated in her post. “As openly licensed and public domain data becomes more accessible, we anticipate significant improvements in models trained on such content.”
The Common Pile v0.1 partly addresses EleutherAI’s past controversies. Years ago, the group released The Pile, an open dataset containing copyrighted material, which drew criticism and legal scrutiny for its use in AI training.
EleutherAI pledges to release open datasets more regularly, collaborating with research and infrastructure partners.
Updated 9:48 a.m. Pacific: Biderman noted on X that EleutherAI contributed to the dataset and model release, with significant involvement from partners like the University of Toronto, which co-led the research.
Related article
Hawaiian Beach Escapades: New Bonds and Surprising Turns
Picture yourself on a pristine Hawaiian beach, sunlight warming your skin, waves crafting a calming rhythm. For Josh, this vision became reality after years of dedication. What begins as a tranquil ge
Ozzy Osbourne's 'Crazy Train' Animated Video: A Deep Dive into Its Art and Impact
Ozzy Osbourne's 'Crazy Train' transcends its status as a heavy metal classic, embodying a cultural milestone. Its animated music video delivers a striking visual journey that amplifies the song's raw
XXXTentacion AI Cover: Analyzing Marvin's Room Recreation
The realm of AI-generated music is advancing rapidly, offering fascinating yet complex possibilities. A striking example is the AI-crafted cover of Drake's renowned track, 'Marvin's Room,' reimagined
Comments (0)
0/200
EleutherAI, a leading AI research group, has launched one of the largest collections of licensed and open-domain text for AI model training.
Named the Common Pile v0.1, this 8-terabyte dataset was developed over two years with AI startups Poolside, Hugging Face, and various academic institutions. It was used to train two new EleutherAI models, Comma v0.1-1T and Comma v0.1-2T, which the organization claims match the performance of models trained on unlicensed, copyrighted data.
AI firms, including OpenAI, face legal challenges over their use of web-scraped data, including copyrighted books and journals, for model training. While some have licensing deals with content providers, many rely on the U.S. fair use doctrine to justify training on copyrighted material without permission.
EleutherAI argues that these lawsuits have significantly reduced transparency in the AI industry, limiting insight into model functionality and weaknesses, which harms the broader research community.
“Legal challenges haven’t significantly altered data sourcing practices for model training, but they’ve sharply reduced the openness of AI companies,” said Stella Biderman, EleutherAI’s executive director, in a Hugging Face blog post on Friday. “Researchers at some firms we’ve spoken with cite lawsuits as the reason they can’t share their data-centric research.”
The Common Pile v0.1, available on Hugging Face’s AI platform and GitHub, was developed with legal consultation and includes sources like 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also utilized OpenAI’s Whisper model to transcribe audio content.
EleutherAI claims Comma v0.1-1T and Comma v0.1-2T demonstrate the Common Pile v0.1’s quality, enabling developers to create models competitive with proprietary systems. Both models, with 7 billion parameters and trained on a portion of the dataset, rival Meta’s original Llama model in coding, image understanding, and math benchmarks.
Save Over $200 on Your TechCrunch All Stage Pass
Innovate smarter. Grow faster. Network deeper. Connect with visionaries from Precursor Ventures, NEA, Index Ventures, Underscore VC, and more for a day of insights, workshops, and valuable connections.
Save Over $200 on Your TechCrunch All Stage Pass
Innovate smarter. Grow faster. Network deeper. Connect with visionaries from Precursor Ventures, NEA, Index Ventures, Underscore VC, and more for a day of insights, workshops, and valuable connections.
Boston, MA | July 15 REGISTER NOWParameters, often called weights, are the internal elements of an AI model that shape its behavior and responses.
“The belief that unlicensed text is essential for high performance is unfounded,” Biderman stated in her post. “As openly licensed and public domain data becomes more accessible, we anticipate significant improvements in models trained on such content.”
The Common Pile v0.1 partly addresses EleutherAI’s past controversies. Years ago, the group released The Pile, an open dataset containing copyrighted material, which drew criticism and legal scrutiny for its use in AI training.
EleutherAI pledges to release open datasets more regularly, collaborating with research and infrastructure partners.
Updated 9:48 a.m. Pacific: Biderman noted on X that EleutherAI contributed to the dataset and model release, with significant involvement from partners like the University of Toronto, which co-led the research.












