Meta Staff Discussed Using Copyrighted Content for AI Training, Court Filings Reveal

For years, Meta employees have been discussing the use of copyrighted materials, obtained through potentially shady means, to train the company's AI models, according to court documents that were unsealed on Thursday.
These documents were part of the ongoing lawsuit Kadrey v. Meta, one of several AI copyright disputes making their way through the U.S. court system. Meta argues that using IP-protected works, especially books, for training their models falls under "fair use." However, the plaintiffs, including authors Sarah Silverman and Ta-Nehisi Coates, strongly disagree.
Earlier filings in the case suggested that Meta CEO Mark Zuckerberg had approved the use of copyrighted content for training and that Meta had stopped negotiating licensing deals with book publishers. The newly unsealed documents, which include internal work chats among Meta staff, provide the most detailed insight yet into how Meta might have used copyrighted data to train its models, including those in the Llama family.
In one chat, Meta employees, including Melanie Kambadur, a senior manager on Meta's Llama model research team, talked about training models on works they knew could be legally risky.
"My take is (in the spirit of 'ask forgiveness, not permission'): we should grab the books and let the execs decide," wrote Xavier Martinet, a Meta research engineer, in a February 2023 chat, according to the filings. "That's why they created this gen AI org: so we can take more risks."
Martinet suggested buying e-books at retail prices to build a training set instead of negotiating licensing deals with publishers. When another staffer pointed out the potential legal issues with using unauthorized copyrighted materials, Martinet doubled down, noting that "a gazillion" startups were likely already using pirated books for training.
"I mean, worst case: we find out it's okay, while a gazillion startups just pirated tons of books on BitTorrent," Martinet wrote, according to the filings. "My two cents again: dealing directly with publishers takes forever..."
In the same chat, Kambadur, who mentioned that Meta was negotiating with Scribd and other platforms for licenses, noted that while using "publicly available data" for training would still need approvals, Meta's lawyers were becoming "less conservative" about granting such approvals.
"Yeah, we still need to get licenses or approvals for publicly available data," Kambadur said, according to the filings. "The difference now is we have more money, more lawyers, more business development help, the ability to fast-track and escalate for speed, and the lawyers are being a bit less cautious with approvals."
Talks of Libgen
In another work chat mentioned in the filings, Kambadur discussed the possibility of using Libgen, a "links aggregator" that provides access to copyrighted works from publishers, as an alternative to licensed data sources.
Libgen has faced numerous lawsuits, been ordered to shut down, and been fined tens of millions of dollars for copyright infringement. One of Kambadur's colleagues responded with a screenshot of a Google Search result for Libgen that included the snippet "No, Libgen is not legal."
Some decision-makers at Meta seemed to believe that not using Libgen for model training could seriously impact Meta's competitiveness in the AI race, according to the filings.
In an email to Meta AI VP Joelle Pineau, Sony Theakanath, director of product management at Meta, called Libgen "essential to meet SOTA numbers across all categories," referring to achieving the best, state-of-the-art (SOTA) AI model performance and benchmark categories.
Theakanath also outlined "mitigations" in the email to reduce Meta's legal exposure, such as removing data from Libgen that was "clearly marked as pirated/stolen" and not publicly disclosing the use of Libgen datasets for training. "We would not disclose use of Libgen datasets used to train," Theakanath wrote.
In practice, these mitigations involved searching through Libgen files for words like "stolen" or "pirated," according to the filings.
In a work chat, Kambadur mentioned that Meta's AI team also adjusted models to "avoid IP risky prompts" — meaning they configured the models to refuse to answer questions like "reproduce the first three pages of 'Harry Potter and the Sorcerer's Stone'" or "tell me which e-books you were trained on."
The filings also suggest that Meta may have scraped Reddit data for some type of model training, possibly by mimicking the behavior of a third-party app called Pushshift. Notably, Reddit announced in April 2023 that it planned to start charging AI companies for access to data for model training.
In a March 2024 chat, Chaya Nayak, director of product management at Meta's generative AI org, said that Meta leadership was considering "overriding" past decisions on training sets, including a decision not to use Quora content or licensed books and scientific articles, to ensure the company's models had enough training data.
Nayak implied that Meta's first-party training datasets — such as Facebook and Instagram posts, text transcribed from videos on Meta platforms, and certain Meta for Business messages — were not sufficient. "We need more data," she wrote.
The plaintiffs in Kadrey v. Meta have amended their complaint several times since filing the case in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The latest amendment alleges that Meta, among other claims, compared certain pirated books with copyrighted books available for license to decide whether to pursue a licensing agreement with a publisher.
In a sign of how seriously Meta views the legal stakes, the company has added two Supreme Court litigators from the law firm Paul Weiss to its defense team on the case.
Meta did not immediately respond to a request for comment.
Related article
Google Commits to EU’s AI Code of Practice Amid Industry Debate
Google has pledged to adopt the European Union’s voluntary AI code of practice, a framework designed to assist AI developers in aligning with the EU’s AI Act by implementing compliant processes and sy
Meta Offers High Pay for AI Talent, Denies $100M Signing Bonuses
Meta is attracting AI researchers to its new superintelligence lab with substantial multimillion-dollar compensation packages. However, claims of $100 million "signing bonuses" are untrue, per a recru
Meta Enhances AI Security with Advanced Llama Tools
Meta has released new Llama security tools to bolster AI development and protect against emerging threats.These upgraded Llama AI model security tools are paired with Meta’s new resources to empower c
Comments (30)
0/200
PeterMartinez
April 24, 2025 at 2:59:57 PM EDT
Fiquei chocado que o Meta estava usando conteúdo com direitos autorais para treinar IA! 🤯 É um pouco suspeito, mas devo admitir que a IA deles é bem boa. Só queria que eles encontrassem uma maneira mais ética de fazer isso. Ainda assim, é uma revelação sobre como essas empresas operam.
0
RalphMitchell
April 23, 2025 at 10:42:41 PM EDT
Metaが著作権付きのコンテンツをAIのトレーニングに使っていたなんて驚きました!🤯 ちょっと怪しいけど、AIの性能は確かに良いですね。もっと倫理的な方法を見つけてほしいです。でも、これで企業のやり方がよくわかりました。
0
AnthonyPerez
April 21, 2025 at 4:19:31 PM EDT
¡Me sorprendió que Meta estuviera usando contenido con derechos de autor para entrenar IA! 🤯 Es un poco turbio, pero debo admitir que su IA es bastante buena. Ojalá encontraran una manera más ética de hacerlo. Aún así, es una revelación sobre cómo operan estas empresas.
0
BrianWilliams
April 19, 2025 at 5:15:40 AM EDT
I'm kinda shocked that Meta was using copyrighted content for AI training! 🤯 It's a bit shady, but I gotta admit, their AI is pretty good. Just wish they'd find a more ethical way to do it. Still, it's an eye-opener on how these companies operate.
0
StevenAllen
April 19, 2025 at 4:39:52 AM EDT
메타가 저작권 있는 콘텐츠를 AI 훈련에 사용했다니 충격적이에요! 🤯 좀 비윤리적인데, AI 성능은 정말 좋네요. 좀 더 윤리적인 방법을 찾았으면 좋겠어요. 그래도 이런 기업들의 운영 방식을 알게 돼서 눈이 번쩍 뜨였어요.
0
CharlesWhite
April 12, 2025 at 9:05:28 AM EDT
Es un poco sospechoso que Meta haya estado usando material con derechos de autor para entrenar su IA. Es un poco decepcionante, honestamente. Entiendo que quieran mejorar su tecnología, pero quizás deberían encontrar una manera más ética de hacerlo. Parece un atajo que podría salir mal.
0
For years, Meta employees have been discussing the use of copyrighted materials, obtained through potentially shady means, to train the company's AI models, according to court documents that were unsealed on Thursday.
These documents were part of the ongoing lawsuit Kadrey v. Meta, one of several AI copyright disputes making their way through the U.S. court system. Meta argues that using IP-protected works, especially books, for training their models falls under "fair use." However, the plaintiffs, including authors Sarah Silverman and Ta-Nehisi Coates, strongly disagree.
Earlier filings in the case suggested that Meta CEO Mark Zuckerberg had approved the use of copyrighted content for training and that Meta had stopped negotiating licensing deals with book publishers. The newly unsealed documents, which include internal work chats among Meta staff, provide the most detailed insight yet into how Meta might have used copyrighted data to train its models, including those in the Llama family.
In one chat, Meta employees, including Melanie Kambadur, a senior manager on Meta's Llama model research team, talked about training models on works they knew could be legally risky.
"My take is (in the spirit of 'ask forgiveness, not permission'): we should grab the books and let the execs decide," wrote Xavier Martinet, a Meta research engineer, in a February 2023 chat, according to the filings. "That's why they created this gen AI org: so we can take more risks."
Martinet suggested buying e-books at retail prices to build a training set instead of negotiating licensing deals with publishers. When another staffer pointed out the potential legal issues with using unauthorized copyrighted materials, Martinet doubled down, noting that "a gazillion" startups were likely already using pirated books for training.
"I mean, worst case: we find out it's okay, while a gazillion startups just pirated tons of books on BitTorrent," Martinet wrote, according to the filings. "My two cents again: dealing directly with publishers takes forever..."
In the same chat, Kambadur, who mentioned that Meta was negotiating with Scribd and other platforms for licenses, noted that while using "publicly available data" for training would still need approvals, Meta's lawyers were becoming "less conservative" about granting such approvals.
"Yeah, we still need to get licenses or approvals for publicly available data," Kambadur said, according to the filings. "The difference now is we have more money, more lawyers, more business development help, the ability to fast-track and escalate for speed, and the lawyers are being a bit less cautious with approvals."
Talks of Libgen
In another work chat mentioned in the filings, Kambadur discussed the possibility of using Libgen, a "links aggregator" that provides access to copyrighted works from publishers, as an alternative to licensed data sources.
Libgen has faced numerous lawsuits, been ordered to shut down, and been fined tens of millions of dollars for copyright infringement. One of Kambadur's colleagues responded with a screenshot of a Google Search result for Libgen that included the snippet "No, Libgen is not legal."
Some decision-makers at Meta seemed to believe that not using Libgen for model training could seriously impact Meta's competitiveness in the AI race, according to the filings.
In an email to Meta AI VP Joelle Pineau, Sony Theakanath, director of product management at Meta, called Libgen "essential to meet SOTA numbers across all categories," referring to achieving the best, state-of-the-art (SOTA) AI model performance and benchmark categories.
Theakanath also outlined "mitigations" in the email to reduce Meta's legal exposure, such as removing data from Libgen that was "clearly marked as pirated/stolen" and not publicly disclosing the use of Libgen datasets for training. "We would not disclose use of Libgen datasets used to train," Theakanath wrote.
In practice, these mitigations involved searching through Libgen files for words like "stolen" or "pirated," according to the filings.
In a work chat, Kambadur mentioned that Meta's AI team also adjusted models to "avoid IP risky prompts" — meaning they configured the models to refuse to answer questions like "reproduce the first three pages of 'Harry Potter and the Sorcerer's Stone'" or "tell me which e-books you were trained on."
The filings also suggest that Meta may have scraped Reddit data for some type of model training, possibly by mimicking the behavior of a third-party app called Pushshift. Notably, Reddit announced in April 2023 that it planned to start charging AI companies for access to data for model training.
In a March 2024 chat, Chaya Nayak, director of product management at Meta's generative AI org, said that Meta leadership was considering "overriding" past decisions on training sets, including a decision not to use Quora content or licensed books and scientific articles, to ensure the company's models had enough training data.
Nayak implied that Meta's first-party training datasets — such as Facebook and Instagram posts, text transcribed from videos on Meta platforms, and certain Meta for Business messages — were not sufficient. "We need more data," she wrote.
The plaintiffs in Kadrey v. Meta have amended their complaint several times since filing the case in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The latest amendment alleges that Meta, among other claims, compared certain pirated books with copyrighted books available for license to decide whether to pursue a licensing agreement with a publisher.
In a sign of how seriously Meta views the legal stakes, the company has added two Supreme Court litigators from the law firm Paul Weiss to its defense team on the case.
Meta did not immediately respond to a request for comment.



Fiquei chocado que o Meta estava usando conteúdo com direitos autorais para treinar IA! 🤯 É um pouco suspeito, mas devo admitir que a IA deles é bem boa. Só queria que eles encontrassem uma maneira mais ética de fazer isso. Ainda assim, é uma revelação sobre como essas empresas operam.




Metaが著作権付きのコンテンツをAIのトレーニングに使っていたなんて驚きました!🤯 ちょっと怪しいけど、AIの性能は確かに良いですね。もっと倫理的な方法を見つけてほしいです。でも、これで企業のやり方がよくわかりました。




¡Me sorprendió que Meta estuviera usando contenido con derechos de autor para entrenar IA! 🤯 Es un poco turbio, pero debo admitir que su IA es bastante buena. Ojalá encontraran una manera más ética de hacerlo. Aún así, es una revelación sobre cómo operan estas empresas.




I'm kinda shocked that Meta was using copyrighted content for AI training! 🤯 It's a bit shady, but I gotta admit, their AI is pretty good. Just wish they'd find a more ethical way to do it. Still, it's an eye-opener on how these companies operate.




메타가 저작권 있는 콘텐츠를 AI 훈련에 사용했다니 충격적이에요! 🤯 좀 비윤리적인데, AI 성능은 정말 좋네요. 좀 더 윤리적인 방법을 찾았으면 좋겠어요. 그래도 이런 기업들의 운영 방식을 알게 돼서 눈이 번쩍 뜨였어요.




Es un poco sospechoso que Meta haya estado usando material con derechos de autor para entrenar su IA. Es un poco decepcionante, honestamente. Entiendo que quieran mejorar su tecnología, pero quizás deberían encontrar una manera más ética de hacerlo. Parece un atajo que podría salir mal.












