option
Home News Meta Staff Discussed Using Copyrighted Content for AI Training, Court Filings Reveal

Meta Staff Discussed Using Copyrighted Content for AI Training, Court Filings Reveal

release date release date April 10, 2025
Author Author JosephEvans
views views 16

Meta Staff Discussed Using Copyrighted Content for AI Training, Court Filings Reveal

For years, Meta employees have been discussing the use of copyrighted materials, obtained through potentially shady means, to train the company's AI models, according to court documents that were unsealed on Thursday.

These documents were part of the ongoing lawsuit Kadrey v. Meta, one of several AI copyright disputes making their way through the U.S. court system. Meta argues that using IP-protected works, especially books, for training their models falls under "fair use." However, the plaintiffs, including authors Sarah Silverman and Ta-Nehisi Coates, strongly disagree.

Earlier filings in the case suggested that Meta CEO Mark Zuckerberg had approved the use of copyrighted content for training and that Meta had stopped negotiating licensing deals with book publishers. The newly unsealed documents, which include internal work chats among Meta staff, provide the most detailed insight yet into how Meta might have used copyrighted data to train its models, including those in the Llama family.

In one chat, Meta employees, including Melanie Kambadur, a senior manager on Meta's Llama model research team, talked about training models on works they knew could be legally risky.

"My take is (in the spirit of 'ask forgiveness, not permission'): we should grab the books and let the execs decide," wrote Xavier Martinet, a Meta research engineer, in a February 2023 chat, according to the filings. "That's why they created this gen AI org: so we can take more risks."

Martinet suggested buying e-books at retail prices to build a training set instead of negotiating licensing deals with publishers. When another staffer pointed out the potential legal issues with using unauthorized copyrighted materials, Martinet doubled down, noting that "a gazillion" startups were likely already using pirated books for training.

"I mean, worst case: we find out it's okay, while a gazillion startups just pirated tons of books on BitTorrent," Martinet wrote, according to the filings. "My two cents again: dealing directly with publishers takes forever..."

In the same chat, Kambadur, who mentioned that Meta was negotiating with Scribd and other platforms for licenses, noted that while using "publicly available data" for training would still need approvals, Meta's lawyers were becoming "less conservative" about granting such approvals.

"Yeah, we still need to get licenses or approvals for publicly available data," Kambadur said, according to the filings. "The difference now is we have more money, more lawyers, more business development help, the ability to fast-track and escalate for speed, and the lawyers are being a bit less cautious with approvals."

Talks of Libgen

In another work chat mentioned in the filings, Kambadur discussed the possibility of using Libgen, a "links aggregator" that provides access to copyrighted works from publishers, as an alternative to licensed data sources.

Libgen has faced numerous lawsuits, been ordered to shut down, and been fined tens of millions of dollars for copyright infringement. One of Kambadur's colleagues responded with a screenshot of a Google Search result for Libgen that included the snippet "No, Libgen is not legal."

Some decision-makers at Meta seemed to believe that not using Libgen for model training could seriously impact Meta's competitiveness in the AI race, according to the filings.

In an email to Meta AI VP Joelle Pineau, Sony Theakanath, director of product management at Meta, called Libgen "essential to meet SOTA numbers across all categories," referring to achieving the best, state-of-the-art (SOTA) AI model performance and benchmark categories.

Theakanath also outlined "mitigations" in the email to reduce Meta's legal exposure, such as removing data from Libgen that was "clearly marked as pirated/stolen" and not publicly disclosing the use of Libgen datasets for training. "We would not disclose use of Libgen datasets used to train," Theakanath wrote.

In practice, these mitigations involved searching through Libgen files for words like "stolen" or "pirated," according to the filings.

In a work chat, Kambadur mentioned that Meta's AI team also adjusted models to "avoid IP risky prompts" — meaning they configured the models to refuse to answer questions like "reproduce the first three pages of 'Harry Potter and the Sorcerer's Stone'" or "tell me which e-books you were trained on."

The filings also suggest that Meta may have scraped Reddit data for some type of model training, possibly by mimicking the behavior of a third-party app called Pushshift. Notably, Reddit announced in April 2023 that it planned to start charging AI companies for access to data for model training.

In a March 2024 chat, Chaya Nayak, director of product management at Meta's generative AI org, said that Meta leadership was considering "overriding" past decisions on training sets, including a decision not to use Quora content or licensed books and scientific articles, to ensure the company's models had enough training data.

Nayak implied that Meta's first-party training datasets — such as Facebook and Instagram posts, text transcribed from videos on Meta platforms, and certain Meta for Business messages — were not sufficient. "We need more data," she wrote.

The plaintiffs in Kadrey v. Meta have amended their complaint several times since filing the case in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The latest amendment alleges that Meta, among other claims, compared certain pirated books with copyrighted books available for license to decide whether to pursue a licensing agreement with a publisher.

In a sign of how seriously Meta views the legal stakes, the company has added two Supreme Court litigators from the law firm Paul Weiss to its defense team on the case.

Meta did not immediately respond to a request for comment.

Related article
Meta捍卫Llama 4版本,引用Bug作为混合质量报告的原因 Meta捍卫Llama 4版本,引用Bug作为混合质量报告的原因 在周末,Facebook,Instagram,WhatsApp和Quest VR背后的强大力量Meta通过揭露其最新的AI语言模型Llama 4。不仅是一个,而且引入了三个新版本,每个版本都具有增强功能,这要归功于“ Architecturs” Architecturs”
法学教授支持作者在AI的版权与META的版权之战中 法学教授支持作者在AI的版权与META的版权之战中 一组版权法学教授在起诉元的作者后面提供了支持,指控这家科技巨头未经作者同意就在电子书上训练了其Llama AI模型。教授于周五在美国加利福尼亚北区的美国地方法院提交了一份法庭之友。
Openai反击:起诉Elon Musk涉嫌努力破坏AI竞争对手 Openai反击:起诉Elon Musk涉嫌努力破坏AI竞争对手 Openai对其联合创始人Elon Musk及其竞争的AI公司Xai发起了激烈的法律反击。在他们正在进行的争执的戏剧性升级中,Openai指责马斯克发动了一场“无情”和“恶意”运动,破坏了他帮助创办的公司。根据法院D
Comments (25)
0/200
FrankMartínez
FrankMartínez April 11, 2025 at 2:36:50 AM GMT

So, Meta's been using copyrighted stuff to train their AI? That's shady as hell. No wonder their AI models are so good, but at what cost? Feels wrong to me. They need to clean up their act or face the music. Thoughts?

WilliamYoung
WilliamYoung April 11, 2025 at 2:36:50 AM GMT

メタが著作権物を使ってAIを訓練していたなんて、めっちゃ怪しいですね。だからこそAIモデルが優れているのかもしれないけど、その代償は?私には間違っているように感じます。メタは行動を改めるか、責任を取るべきです。どう思いますか?

HenryJackson
HenryJackson April 11, 2025 at 2:36:50 AM GMT

메타가 저작권 있는 자료를 AI 훈련에 사용했다니, 정말 불법적이네요. 그래서 AI 모델이 좋은 건지 모르겠지만, 그 대가는 뭘까요? 제겐 잘못된 일로 느껴져요. 메타는 행동을 개선하거나 책임을 져야 합니다. 어떻게 생각하세요?

HarryRoberts
HarryRoberts April 11, 2025 at 2:36:50 AM GMT

Então, a Meta estava usando material com direitos autorais para treinar seu AI? Isso é muito suspeito. Não é de se admirar que seus modelos de AI sejam tão bons, mas a que custo? Parece errado para mim. Eles precisam se corrigir ou enfrentar as consequências. O que vocês acham?

JoseJackson
JoseJackson April 11, 2025 at 2:36:50 AM GMT

Así que, ¿Meta ha estado usando material con derechos de autor para entrenar su IA? Eso es muy sospechoso. No es de extrañar que sus modelos de IA sean tan buenos, pero a qué costo. Me parece mal. Necesitan limpiar su acto o enfrentar las consecuencias. ¿Qué opinan?

AlbertHill
AlbertHill April 10, 2025 at 7:16:25 PM GMT

So, Meta's been using copyrighted stuff to train their AI? That's pretty shady if you ask me. I mean, I get wanting to improve your AI, but at what cost? This lawsuit might just open a can of worms. Thoughts?

Back to Top
OR