Meta Staff Discussed Using Copyrighted Content for AI Training, Court Filings Reveal

For years, Meta employees have been discussing the use of copyrighted materials, obtained through potentially shady means, to train the company's AI models, according to court documents that were unsealed on Thursday.
These documents were part of the ongoing lawsuit Kadrey v. Meta, one of several AI copyright disputes making their way through the U.S. court system. Meta argues that using IP-protected works, especially books, for training their models falls under "fair use." However, the plaintiffs, including authors Sarah Silverman and Ta-Nehisi Coates, strongly disagree.
Earlier filings in the case suggested that Meta CEO Mark Zuckerberg had approved the use of copyrighted content for training and that Meta had stopped negotiating licensing deals with book publishers. The newly unsealed documents, which include internal work chats among Meta staff, provide the most detailed insight yet into how Meta might have used copyrighted data to train its models, including those in the Llama family.
In one chat, Meta employees, including Melanie Kambadur, a senior manager on Meta's Llama model research team, talked about training models on works they knew could be legally risky.
"My take is (in the spirit of 'ask forgiveness, not permission'): we should grab the books and let the execs decide," wrote Xavier Martinet, a Meta research engineer, in a February 2023 chat, according to the filings. "That's why they created this gen AI org: so we can take more risks."
Martinet suggested buying e-books at retail prices to build a training set instead of negotiating licensing deals with publishers. When another staffer pointed out the potential legal issues with using unauthorized copyrighted materials, Martinet doubled down, noting that "a gazillion" startups were likely already using pirated books for training.
"I mean, worst case: we find out it's okay, while a gazillion startups just pirated tons of books on BitTorrent," Martinet wrote, according to the filings. "My two cents again: dealing directly with publishers takes forever..."
In the same chat, Kambadur, who mentioned that Meta was negotiating with Scribd and other platforms for licenses, noted that while using "publicly available data" for training would still need approvals, Meta's lawyers were becoming "less conservative" about granting such approvals.
"Yeah, we still need to get licenses or approvals for publicly available data," Kambadur said, according to the filings. "The difference now is we have more money, more lawyers, more business development help, the ability to fast-track and escalate for speed, and the lawyers are being a bit less cautious with approvals."
Talks of Libgen
In another work chat mentioned in the filings, Kambadur discussed the possibility of using Libgen, a "links aggregator" that provides access to copyrighted works from publishers, as an alternative to licensed data sources.
Libgen has faced numerous lawsuits, been ordered to shut down, and been fined tens of millions of dollars for copyright infringement. One of Kambadur's colleagues responded with a screenshot of a Google Search result for Libgen that included the snippet "No, Libgen is not legal."
Some decision-makers at Meta seemed to believe that not using Libgen for model training could seriously impact Meta's competitiveness in the AI race, according to the filings.
In an email to Meta AI VP Joelle Pineau, Sony Theakanath, director of product management at Meta, called Libgen "essential to meet SOTA numbers across all categories," referring to achieving the best, state-of-the-art (SOTA) AI model performance and benchmark categories.
Theakanath also outlined "mitigations" in the email to reduce Meta's legal exposure, such as removing data from Libgen that was "clearly marked as pirated/stolen" and not publicly disclosing the use of Libgen datasets for training. "We would not disclose use of Libgen datasets used to train," Theakanath wrote.
In practice, these mitigations involved searching through Libgen files for words like "stolen" or "pirated," according to the filings.
In a work chat, Kambadur mentioned that Meta's AI team also adjusted models to "avoid IP risky prompts" — meaning they configured the models to refuse to answer questions like "reproduce the first three pages of 'Harry Potter and the Sorcerer's Stone'" or "tell me which e-books you were trained on."
The filings also suggest that Meta may have scraped Reddit data for some type of model training, possibly by mimicking the behavior of a third-party app called Pushshift. Notably, Reddit announced in April 2023 that it planned to start charging AI companies for access to data for model training.
In a March 2024 chat, Chaya Nayak, director of product management at Meta's generative AI org, said that Meta leadership was considering "overriding" past decisions on training sets, including a decision not to use Quora content or licensed books and scientific articles, to ensure the company's models had enough training data.
Nayak implied that Meta's first-party training datasets — such as Facebook and Instagram posts, text transcribed from videos on Meta platforms, and certain Meta for Business messages — were not sufficient. "We need more data," she wrote.
The plaintiffs in Kadrey v. Meta have amended their complaint several times since filing the case in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The latest amendment alleges that Meta, among other claims, compared certain pirated books with copyrighted books available for license to decide whether to pursue a licensing agreement with a publisher.
In a sign of how seriously Meta views the legal stakes, the company has added two Supreme Court litigators from the law firm Paul Weiss to its defense team on the case.
Meta did not immediately respond to a request for comment.
Related article
Meta AI now responds to buyer messages on Facebook Marketplace
Facebook Marketplace introduces new Meta AI features, including automated replies to buyer inquiries, the company announced Thursday. The platform also leverages AI to accelerate item listings, summarize seller profiles, and now lets sellers offer sh
Meta signs deal for millions of Amazon AI CPUs
Amazon has secured a significant partnership with Meta, once again relying on its own custom-designed chips. Meta has agreed to deploy millions of AWS Graviton chips to meet its expanding AI demands, Amazon confirmed on Friday.Note that AWS Graviton
Meta's natural gas surge may fuel South Dakota's power grid
Data centers have grown so massive that their electricity consumption now matches that of entire U.S. states. Consider Meta's Hyperion AI data center: once finished, it will consume as much power as South Dakota.Meta recently announced funding for se
Related Special Topic Recommendations
Comments (32)
0/500
Meta scheint sich nicht an die Regeln zu halten, wenn es um Urheberrechte geht. Das erinnert mich an die frühen Tage von Napster – nur dass es diesmal um KI geht. Wenn große Tech-Firmen einfach alles verwenden, was sie finden können, ohne Rücksicht auf Künstler und Autoren, wo führt das hin? 🤔 Es ist nicht nur unethisch, sondern könnte auch langfristig die Kreativwirtschaft schädigen. Hoffentlich setzt das Gericht hier ein klares Zeichen.
¿Es legal usar contenido con derechos de autor para entrenar IA de esta manera? Parece que Meta ha estado considerando métodos cuestionables durante años. Esta noticia me hace pensar mucho en quién realmente se beneficia del 'progreso' tecnológico 🤔. Como usuario, me preocupa la falta de transparencia de estas empresas sobre cómo obtienen los datos.
Fiquei chocado que o Meta estava usando conteúdo com direitos autorais para treinar IA! 🤯 É um pouco suspeito, mas devo admitir que a IA deles é bem boa. Só queria que eles encontrassem uma maneira mais ética de fazer isso. Ainda assim, é uma revelação sobre como essas empresas operam.
Metaが著作権付きのコンテンツをAIのトレーニングに使っていたなんて驚きました!🤯 ちょっと怪しいけど、AIの性能は確かに良いですね。もっと倫理的な方法を見つけてほしいです。でも、これで企業のやり方がよくわかりました。
¡Me sorprendió que Meta estuviera usando contenido con derechos de autor para entrenar IA! 🤯 Es un poco turbio, pero debo admitir que su IA es bastante buena. Ojalá encontraran una manera más ética de hacerlo. Aún así, es una revelación sobre cómo operan estas empresas.

For years, Meta employees have been discussing the use of copyrighted materials, obtained through potentially shady means, to train the company's AI models, according to court documents that were unsealed on Thursday.
These documents were part of the ongoing lawsuit Kadrey v. Meta, one of several AI copyright disputes making their way through the U.S. court system. Meta argues that using IP-protected works, especially books, for training their models falls under "fair use." However, the plaintiffs, including authors Sarah Silverman and Ta-Nehisi Coates, strongly disagree.
Earlier filings in the case suggested that Meta CEO Mark Zuckerberg had approved the use of copyrighted content for training and that Meta had stopped negotiating licensing deals with book publishers. The newly unsealed documents, which include internal work chats among Meta staff, provide the most detailed insight yet into how Meta might have used copyrighted data to train its models, including those in the Llama family.
In one chat, Meta employees, including Melanie Kambadur, a senior manager on Meta's Llama model research team, talked about training models on works they knew could be legally risky.
"My take is (in the spirit of 'ask forgiveness, not permission'): we should grab the books and let the execs decide," wrote Xavier Martinet, a Meta research engineer, in a February 2023 chat, according to the filings. "That's why they created this gen AI org: so we can take more risks."
Martinet suggested buying e-books at retail prices to build a training set instead of negotiating licensing deals with publishers. When another staffer pointed out the potential legal issues with using unauthorized copyrighted materials, Martinet doubled down, noting that "a gazillion" startups were likely already using pirated books for training.
"I mean, worst case: we find out it's okay, while a gazillion startups just pirated tons of books on BitTorrent," Martinet wrote, according to the filings. "My two cents again: dealing directly with publishers takes forever..."
In the same chat, Kambadur, who mentioned that Meta was negotiating with Scribd and other platforms for licenses, noted that while using "publicly available data" for training would still need approvals, Meta's lawyers were becoming "less conservative" about granting such approvals.
"Yeah, we still need to get licenses or approvals for publicly available data," Kambadur said, according to the filings. "The difference now is we have more money, more lawyers, more business development help, the ability to fast-track and escalate for speed, and the lawyers are being a bit less cautious with approvals."
Talks of Libgen
In another work chat mentioned in the filings, Kambadur discussed the possibility of using Libgen, a "links aggregator" that provides access to copyrighted works from publishers, as an alternative to licensed data sources.
Libgen has faced numerous lawsuits, been ordered to shut down, and been fined tens of millions of dollars for copyright infringement. One of Kambadur's colleagues responded with a screenshot of a Google Search result for Libgen that included the snippet "No, Libgen is not legal."
Some decision-makers at Meta seemed to believe that not using Libgen for model training could seriously impact Meta's competitiveness in the AI race, according to the filings.
In an email to Meta AI VP Joelle Pineau, Sony Theakanath, director of product management at Meta, called Libgen "essential to meet SOTA numbers across all categories," referring to achieving the best, state-of-the-art (SOTA) AI model performance and benchmark categories.
Theakanath also outlined "mitigations" in the email to reduce Meta's legal exposure, such as removing data from Libgen that was "clearly marked as pirated/stolen" and not publicly disclosing the use of Libgen datasets for training. "We would not disclose use of Libgen datasets used to train," Theakanath wrote.
In practice, these mitigations involved searching through Libgen files for words like "stolen" or "pirated," according to the filings.
In a work chat, Kambadur mentioned that Meta's AI team also adjusted models to "avoid IP risky prompts" — meaning they configured the models to refuse to answer questions like "reproduce the first three pages of 'Harry Potter and the Sorcerer's Stone'" or "tell me which e-books you were trained on."
The filings also suggest that Meta may have scraped Reddit data for some type of model training, possibly by mimicking the behavior of a third-party app called Pushshift. Notably, Reddit announced in April 2023 that it planned to start charging AI companies for access to data for model training.
In a March 2024 chat, Chaya Nayak, director of product management at Meta's generative AI org, said that Meta leadership was considering "overriding" past decisions on training sets, including a decision not to use Quora content or licensed books and scientific articles, to ensure the company's models had enough training data.
Nayak implied that Meta's first-party training datasets — such as Facebook and Instagram posts, text transcribed from videos on Meta platforms, and certain Meta for Business messages — were not sufficient. "We need more data," she wrote.
The plaintiffs in Kadrey v. Meta have amended their complaint several times since filing the case in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The latest amendment alleges that Meta, among other claims, compared certain pirated books with copyrighted books available for license to decide whether to pursue a licensing agreement with a publisher.
In a sign of how seriously Meta views the legal stakes, the company has added two Supreme Court litigators from the law firm Paul Weiss to its defense team on the case.
Meta did not immediately respond to a request for comment.
Meta AI now responds to buyer messages on Facebook Marketplace
Facebook Marketplace introduces new Meta AI features, including automated replies to buyer inquiries, the company announced Thursday. The platform also leverages AI to accelerate item listings, summarize seller profiles, and now lets sellers offer sh
Meta signs deal for millions of Amazon AI CPUs
Amazon has secured a significant partnership with Meta, once again relying on its own custom-designed chips. Meta has agreed to deploy millions of AWS Graviton chips to meet its expanding AI demands, Amazon confirmed on Friday.Note that AWS Graviton
Meta's natural gas surge may fuel South Dakota's power grid
Data centers have grown so massive that their electricity consumption now matches that of entire U.S. states. Consider Meta's Hyperion AI data center: once finished, it will consume as much power as South Dakota.Meta recently announced funding for se
Meta scheint sich nicht an die Regeln zu halten, wenn es um Urheberrechte geht. Das erinnert mich an die frühen Tage von Napster – nur dass es diesmal um KI geht. Wenn große Tech-Firmen einfach alles verwenden, was sie finden können, ohne Rücksicht auf Künstler und Autoren, wo führt das hin? 🤔 Es ist nicht nur unethisch, sondern könnte auch langfristig die Kreativwirtschaft schädigen. Hoffentlich setzt das Gericht hier ein klares Zeichen.
¿Es legal usar contenido con derechos de autor para entrenar IA de esta manera? Parece que Meta ha estado considerando métodos cuestionables durante años. Esta noticia me hace pensar mucho en quién realmente se beneficia del 'progreso' tecnológico 🤔. Como usuario, me preocupa la falta de transparencia de estas empresas sobre cómo obtienen los datos.
Fiquei chocado que o Meta estava usando conteúdo com direitos autorais para treinar IA! 🤯 É um pouco suspeito, mas devo admitir que a IA deles é bem boa. Só queria que eles encontrassem uma maneira mais ética de fazer isso. Ainda assim, é uma revelação sobre como essas empresas operam.
Metaが著作権付きのコンテンツをAIのトレーニングに使っていたなんて驚きました!🤯 ちょっと怪しいけど、AIの性能は確かに良いですね。もっと倫理的な方法を見つけてほしいです。でも、これで企業のやり方がよくわかりました。
¡Me sorprendió que Meta estuviera usando contenido con derechos de autor para entrenar IA! 🤯 Es un poco turbio, pero debo admitir que su IA es bastante buena. Ojalá encontraran una manera más ética de hacerlo. Aún así, es una revelación sobre cómo operan estas empresas.





Home






