option
Home
News
EleutherAI Unveils Massive Licensed Text Dataset for AI Training

EleutherAI Unveils Massive Licensed Text Dataset for AI Training

August 30, 2025
89

EleutherAI Unveils Massive Licensed Text Dataset for AI Training

EleutherAI, a leading AI research group, has launched one of the largest collections of licensed and open-domain text for AI model training.

Named the Common Pile v0.1, this 8-terabyte dataset was developed over two years with AI startups Poolside, Hugging Face, and various academic institutions. It was used to train two new EleutherAI models, Comma v0.1-1T and Comma v0.1-2T, which the organization claims match the performance of models trained on unlicensed, copyrighted data.

AI firms, including OpenAI, face legal challenges over their use of web-scraped data, including copyrighted books and journals, for model training. While some have licensing deals with content providers, many rely on the U.S. fair use doctrine to justify training on copyrighted material without permission.

EleutherAI argues that these lawsuits have significantly reduced transparency in the AI industry, limiting insight into model functionality and weaknesses, which harms the broader research community.

“Legal challenges haven’t significantly altered data sourcing practices for model training, but they’ve sharply reduced the openness of AI companies,” said Stella Biderman, EleutherAI’s executive director, in a Hugging Face blog post on Friday. “Researchers at some firms we’ve spoken with cite lawsuits as the reason they can’t share their data-centric research.”

The Common Pile v0.1, available on Hugging Face’s AI platform and GitHub, was developed with legal consultation and includes sources like 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also utilized OpenAI’s Whisper model to transcribe audio content.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T demonstrate the Common Pile v0.1’s quality, enabling developers to create models competitive with proprietary systems. Both models, with 7 billion parameters and trained on a portion of the dataset, rival Meta’s original Llama model in coding, image understanding, and math benchmarks.

Save Over $200 on Your TechCrunch All Stage Pass

Innovate smarter. Grow faster. Network deeper. Connect with visionaries from Precursor Ventures, NEA, Index Ventures, Underscore VC, and more for a day of insights, workshops, and valuable connections.

Save Over $200 on Your TechCrunch All Stage Pass

Innovate smarter. Grow faster. Network deeper. Connect with visionaries from Precursor Ventures, NEA, Index Ventures, Underscore VC, and more for a day of insights, workshops, and valuable connections.

Boston, MA | July 15 REGISTER NOW

Parameters, often called weights, are the internal elements of an AI model that shape its behavior and responses.

“The belief that unlicensed text is essential for high performance is unfounded,” Biderman stated in her post. “As openly licensed and public domain data becomes more accessible, we anticipate significant improvements in models trained on such content.”

The Common Pile v0.1 partly addresses EleutherAI’s past controversies. Years ago, the group released The Pile, an open dataset containing copyrighted material, which drew criticism and legal scrutiny for its use in AI training.

EleutherAI pledges to release open datasets more regularly, collaborating with research and infrastructure partners.

Updated 9:48 a.m. Pacific: Biderman noted on X that EleutherAI contributed to the dataset and model release, with significant involvement from partners like the University of Toronto, which co-led the research.

Related article
Claude Used to Create Malicious npm Packages: Over 670 Compromised Threaten Open Source Claude Used to Create Malicious npm Packages: Over 670 Compromised Threaten Open Source A recent cybersecurity incident reveals how large language models (LLMs) are being weaponized for malicious software development. Security researcher Sibi Moosa spotted an attacker using the alias "mousie-5212-super-formatter" leveraging Anthropic's
Reliance unveils $110B AI investment plan as India accelerates tech drive Reliance unveils $110B AI investment plan as India accelerates tech drive Mukesh Ambani, the billionaire chairman of India's Reliance conglomerate, announced on Thursday a ₹10 trillion (roughly $110 billion) plan to build AI computing infrastructure across India over the next seven years.Speaking at the India AI Impact Sum
Zhiyuan WITA Ends 'Naked' Robot Interaction with First Compliance Filing Zhiyuan WITA Ends 'Naked' Robot Interaction with First Compliance Filing The embodied intelligence sector has reached a significant milestone. According to the latest announcement from the Shanghai Cyberspace Administration, the WITA large model developed by Zhiyuan has successfully completed the filing process, becoming
Related Special Topic Recommendations
Animation Creation AI Anime Generator for Donghua: Create Web Novel Characters & Comic Avatars
AI Anime Generator for Donghua: Create Web Novel Characters & Comic Avatars

Discover the 2026 best AI anime generators for donghua. Our top-rated, curated list features powerful tools to create stunning web novel characters and comic avatars. Compare free vs paid options with real-world tests. Find your perfect creative partner and bring your stories to life today at XIX.AI.

10 tools
xix.ai
Comic Creation Top AI Auto-Colorization Tools for Manga: Apply Flat Colors with Zero Consistency Errors
Top AI Auto-Colorization Tools for Manga: Apply Flat Colors with Zero Consistency Errors

Discover the 2026 best AI auto-colorization tools for manga at XIX.AI. Our curated list features top-rated, game-changing solutions that apply flat colors with zero consistency errors, boosting your productivity. Explore free vs paid comparisons, real-world tests, and weekly updated rankings to find your perfect match. Unlock your AI edge today.

10 tools
xix.ai
writing Top AI Fiction Profile Creators: Generate Consistent Character Motivations and Fatal Flaws
Top AI Fiction Profile Creators: Generate Consistent Character Motivations and Fatal Flaws

Discover the 2026 best AI fiction profile creators for crafting deep characters. XIX.AI's curated list features top-rated, game-changing tools that generate consistent motivations and fatal flaws. Compare free vs paid options with real-world tests. Unlock your storytelling potential now.

10 tools
xix.ai
Business Top AI Pricing Optimization Software: Track Competitors & Auto-Adjust Store Prices
Top AI Pricing Optimization Software: Track Competitors & Auto-Adjust Store Prices

Discover the 2026 best AI pricing optimization software on XIX.AI. Our curated list features top-rated, game-changing tools that track competitors and auto-adjust your store prices for maximum profit. Compare free vs paid options with real-world tests. Unlock your pricing edge now.

10 tools
xix.ai
code Best AI Code Reviewers: Automate Clean Code Compliance & Refactor Legacy Repo Files
Best AI Code Reviewers: Automate Clean Code Compliance & Refactor Legacy Repo Files

Discover the 2026 best AI code reviewers on XIX.AI. Our curated list features top-rated, game-changing tools for automating clean code compliance and refactoring legacy repo files. Compare free vs paid options with real-world tests and weekly updated rankings. Unlock your AI edge today.

10 tools
xix.ai
Text-to-speech Top AI TTS Apps for Dyslexia: Support Learning and Reading Efficiency for Students
Top AI TTS Apps for Dyslexia: Support Learning and Reading Efficiency for Students

Discover the 2026 latest top-rated AI TTS apps curated for dyslexia support. Our expert rankings compare free vs paid tools, highlighting powerful features for enhanced reading efficiency and learning. Explore must-try, game-changing solutions to unlock student potential. Start your journey at XIX.AI.

10 tools
xix.ai
Comments (2)
0/500
NicholasLewis
NicholasLewis March 10, 2026 at 6:01:03 AM EDT

Наконец-то качественные данные для обучения ИИ! 😄 Но интересно, как это повлияет на конкуренцию между OpenAI и другими компаниями. Может, скоро увидим более умные модели?

RyanLopez
RyanLopez February 2, 2026 at 3:00:51 AM EST

Wow, 8 terabytes of legally licensed text is a game-changer! It's fantastic to see more high-quality, transparent data becoming available. This should really help push open-source AI models forward and maybe even challenge some of the big players who rely on murkier data sources. Hopefully, it leads to more reliable and ethically-sound systems. Can't wait to see what gets built on this! 🚀

OR