Meta Staff Discussed Using Copyrighted Content for AI Training, Court Filings Reveal

Home

News

April 10, 2025

JosephEvans

169

# meta # Lawsuit

Meta Staff Discussed Using Copyrighted Content for AI Training, Court Filings Reveal

For years, Meta employees have been discussing the use of copyrighted materials, obtained through potentially shady means, to train the company's AI models, according to court documents that were unsealed on Thursday.

These documents were part of the ongoing lawsuit Kadrey v. Meta, one of several AI copyright disputes making their way through the U.S. court system. Meta argues that using IP-protected works, especially books, for training their models falls under "fair use." However, the plaintiffs, including authors Sarah Silverman and Ta-Nehisi Coates, strongly disagree.

Earlier filings in the case suggested that Meta CEO Mark Zuckerberg had approved the use of copyrighted content for training and that Meta had stopped negotiating licensing deals with book publishers. The newly unsealed documents, which include internal work chats among Meta staff, provide the most detailed insight yet into how Meta might have used copyrighted data to train its models, including those in the Llama family.

In one chat, Meta employees, including Melanie Kambadur, a senior manager on Meta's Llama model research team, talked about training models on works they knew could be legally risky.

"My take is (in the spirit of 'ask forgiveness, not permission'): we should grab the books and let the execs decide," wrote Xavier Martinet, a Meta research engineer, in a February 2023 chat, according to the filings. "That's why they created this gen AI org: so we can take more risks."

Martinet suggested buying e-books at retail prices to build a training set instead of negotiating licensing deals with publishers. When another staffer pointed out the potential legal issues with using unauthorized copyrighted materials, Martinet doubled down, noting that "a gazillion" startups were likely already using pirated books for training.

"I mean, worst case: we find out it's okay, while a gazillion startups just pirated tons of books on BitTorrent," Martinet wrote, according to the filings. "My two cents again: dealing directly with publishers takes forever..."

In the same chat, Kambadur, who mentioned that Meta was negotiating with Scribd and other platforms for licenses, noted that while using "publicly available data" for training would still need approvals, Meta's lawyers were becoming "less conservative" about granting such approvals.

"Yeah, we still need to get licenses or approvals for publicly available data," Kambadur said, according to the filings. "The difference now is we have more money, more lawyers, more business development help, the ability to fast-track and escalate for speed, and the lawyers are being a bit less cautious with approvals."

Talks of Libgen

In another work chat mentioned in the filings, Kambadur discussed the possibility of using Libgen, a "links aggregator" that provides access to copyrighted works from publishers, as an alternative to licensed data sources.

Libgen has faced numerous lawsuits, been ordered to shut down, and been fined tens of millions of dollars for copyright infringement. One of Kambadur's colleagues responded with a screenshot of a Google Search result for Libgen that included the snippet "No, Libgen is not legal."

Some decision-makers at Meta seemed to believe that not using Libgen for model training could seriously impact Meta's competitiveness in the AI race, according to the filings.

In an email to Meta AI VP Joelle Pineau, Sony Theakanath, director of product management at Meta, called Libgen "essential to meet SOTA numbers across all categories," referring to achieving the best, state-of-the-art (SOTA) AI model performance and benchmark categories.

Theakanath also outlined "mitigations" in the email to reduce Meta's legal exposure, such as removing data from Libgen that was "clearly marked as pirated/stolen" and not publicly disclosing the use of Libgen datasets for training. "We would not disclose use of Libgen datasets used to train," Theakanath wrote.

In practice, these mitigations involved searching through Libgen files for words like "stolen" or "pirated," according to the filings.

In a work chat, Kambadur mentioned that Meta's AI team also adjusted models to "avoid IP risky prompts" — meaning they configured the models to refuse to answer questions like "reproduce the first three pages of 'Harry Potter and the Sorcerer's Stone'" or "tell me which e-books you were trained on."

The filings also suggest that Meta may have scraped Reddit data for some type of model training, possibly by mimicking the behavior of a third-party app called Pushshift. Notably, Reddit announced in April 2023 that it planned to start charging AI companies for access to data for model training.

In a March 2024 chat, Chaya Nayak, director of product management at Meta's generative AI org, said that Meta leadership was considering "overriding" past decisions on training sets, including a decision not to use Quora content or licensed books and scientific articles, to ensure the company's models had enough training data.

Nayak implied that Meta's first-party training datasets — such as Facebook and Instagram posts, text transcribed from videos on Meta platforms, and certain Meta for Business messages — were not sufficient. "We need more data," she wrote.

The plaintiffs in Kadrey v. Meta have amended their complaint several times since filing the case in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The latest amendment alleges that Meta, among other claims, compared certain pirated books with copyrighted books available for license to decide whether to pursue a licensing agreement with a publisher.

In a sign of how seriously Meta views the legal stakes, the company has added two Supreme Court litigators from the law firm Paul Weiss to its defense team on the case.

Meta did not immediately respond to a request for comment.

Meta AI now responds to buyer messages on Facebook Marketplace Facebook Marketplace introduces new Meta AI features, including automated replies to buyer inquiries, the company announced Thursday. The platform also leverages AI to accelerate item listings, summarize seller profiles, and now lets sellers offer sh

Meta signs deal for millions of Amazon AI CPUs Amazon has secured a significant partnership with Meta, once again relying on its own custom-designed chips. Meta has agreed to deploy millions of AWS Graviton chips to meet its expanding AI demands, Amazon confirmed on Friday.Note that AWS Graviton

Meta's natural gas surge may fuel South Dakota's power grid Data centers have grown so massive that their electricity consumption now matches that of entire U.S. states. Consider Meta's Hyperion AI data center: once finished, it will consume as much power as South Dakota.Meta recently announced funding for se

Related Special Topic Recommendations

chatbot

Create Your Own AI Love Story with These Roleplay Tools

Discover the 2026 latest top-rated AI roleplay tools for crafting immersive narratives. XIX.AI's curated list features powerful, game-changing assistants to unlock creative storytelling and emotional depth. Compare free vs paid options with real-world tests. Start your unique journey today.

10 tools

xix.ai

Text-to-speech

Top AI Voice Tools for Indie Game Devs: Save Time on Voice Acting for RPGs and Visual Novels

Discover the 2026 best AI voice tools for game devs! XIX.AI's curated list features top-rated, game-changing solutions to save you time and money on voice acting for RPGs and visual novels. Explore free vs paid comparisons, real-world tests, and weekly updated rankings. Find your perfect voice tool today!

10 tools

xix.ai

Education and Learning

Best AI Spaced Repetition Tools: Optimize Study Schedules for Medical & Law Students

Discover the 2026 best AI spaced repetition tools, curated by XIX.AI. Our top-rated, game-changing picks help medical and law students optimize study schedules for maximum retention. Compare free vs paid options with real-world tests and weekly updated rankings. Unlock your learning edge now.

10 tools

xix.ai

Video creation

Best AI Text to Video Platforms for Script Writing and Visual Storytelling

2026 Latest Best AI Text to Video Platforms: Top-rated tools for script writing and visual storytelling. Discover powerful, game-changing solutions to transform your text into engaging videos. Compare free vs paid options with our weekly updated rankings and real-world tests. Find your perfect platform to boost creativity and productivity. Explore the curated selection at XIX.AI.

10 tools

xix.ai

chatbot

AI Multi-Agent Orchestrators: Design Complex Automated Workflows through Natural Language

2026 Latest: Discover the best AI multi-agent orchestrators to design complex automated workflows through natural language. Our curated list features top-rated, powerful platforms for seamless task automation and intelligent process management. Compare free vs paid options with real-world insights. Unlock your AI edge with XIX.AI's expert weekly updated rankings.

10 tools

xix.ai

Image editing

Best AI Noise Reduction Software: Remove Grain & Artifacts from Low-Light Night Photography

Discover the 2026 best AI noise reduction software for low-light night photography. Our top-rated, curated list compares free vs paid tools, featuring real-world tests and weekly updated rankings. Remove grain & artifacts effortlessly. Unlock your AI edge at XIX.AI.

10 tools

xix.ai

Comments (32)

0/500

Please login first

PaulMartínez

May 6, 2026 at 12:00:49 AM EDT

Meta scheint sich nicht an die Regeln zu halten, wenn es um Urheberrechte geht. Das erinnert mich an die frühen Tage von Napster – nur dass es diesmal um KI geht. Wenn große Tech-Firmen einfach alles verwenden, was sie finden können, ohne Rücksicht auf Künstler und Autoren, wo führt das hin? 🤔 Es ist nicht nur unethisch, sondern könnte auch langfristig die Kreativwirtschaft schädigen. Hoffentlich setzt das Gericht hier ein klares Zeichen.

CharlesYoung

April 5, 2026 at 6:02:04 PM EDT

¿Es legal usar contenido con derechos de autor para entrenar IA de esta manera? Parece que Meta ha estado considerando métodos cuestionables durante años. Esta noticia me hace pensar mucho en quién realmente se beneficia del 'progreso' tecnológico 🤔. Como usuario, me preocupa la falta de transparencia de estas empresas sobre cómo obtienen los datos.

PeterMartinez

April 24, 2025 at 2:59:57 PM EDT

Fiquei chocado que o Meta estava usando conteúdo com direitos autorais para treinar IA! 🤯 É um pouco suspeito, mas devo admitir que a IA deles é bem boa. Só queria que eles encontrassem uma maneira mais ética de fazer isso. Ainda assim, é uma revelação sobre como essas empresas operam.

RalphMitchell

April 23, 2025 at 10:42:41 PM EDT

Metaが著作権付きのコンテンツをAIのトレーニングに使っていたなんて驚きました！🤯 ちょっと怪しいけど、AIの性能は確かに良いですね。もっと倫理的な方法を見つけてほしいです。でも、これで企業のやり方がよくわかりました。

AnthonyPerez

April 21, 2025 at 4:19:31 PM EDT

¡Me sorprendió que Meta estuviera usando contenido con derechos de autor para entrenar IA! 🤯 Es un poco turbio, pero debo admitir que su IA es bastante buena. Ojalá encontraran una manera más ética de hacerlo. Aún así, es una revelación sobre cómo operan estas empresas.

BrianWilliams

April 19, 2025 at 5:15:40 AM EDT

I'm kinda shocked that Meta was using copyrighted content for AI training! 🤯 It's a bit shady, but I gotta admit, their AI is pretty good. Just wish they'd find a more ethical way to do it. Still, it's an eye-opener on how these companies operate.