option
Home
News
Study: OpenAI Models Memorized Copyrighted Content

Study: OpenAI Models Memorized Copyrighted Content

April 10, 2025
265

A recent study suggests that OpenAI may have indeed used copyrighted material to train some of its AI models, adding fuel to the ongoing legal battles the company faces. Authors, programmers, and other content creators have accused OpenAI of using their works—such as books and code—without permission to develop its AI models. While OpenAI has defended itself by claiming fair use, the plaintiffs argue that U.S. copyright law doesn't provide an exception for training data.

The study, a collaboration between researchers from the University of Washington, the University of Copenhagen, and Stanford, introduces a new technique for detecting "memorized" training data in models accessed through an API, like those from OpenAI. AI models essentially learn from vast amounts of data to recognize patterns, enabling them to create essays, images, and more. Although most outputs aren't direct copies of the training data, some inevitably are due to the learning process. For instance, image models have been known to reproduce movie screenshots, while language models have been caught essentially plagiarizing news articles.

The method described in the study focuses on "high-surprisal" words—words that are unusual in a given context. For example, in the sentence "Jack and I sat perfectly still with the radar humming," "radar" would be a high-surprisal word because it's less expected than words like "engine" or "radio" to precede "humming."

The researchers tested several OpenAI models, including GPT-4 and GPT-3.5, by removing high-surprisal words from excerpts of fiction books and New York Times articles and asking the models to predict these missing words. If the models accurately guessed the words, it suggested they had memorized the text during training.

OpenAI copyright study

An example of having a model “guess” a high-surprisal word.Image Credits:OpenAI
The results indicated that GPT-4 had likely memorized parts of popular fiction books, including those in the BookMIA dataset of copyrighted ebooks. It also appeared to have memorized some New York Times articles, though at a lower frequency.

Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study, emphasized to TechCrunch that these findings highlight the "contentious data" that might have been used to train these models. "In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically," Ravichander stated. "Our work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem."

OpenAI has pushed for more relaxed rules on using copyrighted data to develop AI models. Although the company has some content licensing agreements and offers opt-out options for copyright holders, it has lobbied various governments to establish "fair use" rules specifically for AI training.

Related article
OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha
Greg Brockman reveals how Elon Musk departed OpenAI Greg Brockman reveals how Elon Musk departed OpenAI In late August 2017, key figures at OpenAI—then a small nonprofit research lab—met to discuss how they would establish a for-profit entity to commercialize their technology and raise the capital needed to achieve AGI.Elon Musk was demanding full cont
Pentagon signs deals with Nvidia, Microsoft, AWS to deploy AI on classified networks Pentagon signs deals with Nvidia, Microsoft, AWS to deploy AI on classified networks After previously reaching agreements with Google, SpaceX, and OpenAI, the U.S. Defense Department announced Friday that it has now signed deals with Nvidia, Microsoft, Amazon Web Services, and Reflection AI to deploy their AI technologies and models
Related Special Topic Recommendations
Productivity AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels
AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels

Discover the 2026 best AI personal wellness and focus coaches on XIX.AI. Our curated rankings feature top-rated, game-changing tools to manage burnout and boost mental energy. Compare free vs paid options with real-world insights. Unlock your path to peak productivity and well-being today.

10 tools
xix.ai
chatbot Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities
Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities

Discover the 2026 latest top-rated AI romantic chatbots for building genuine, long-term connections. Our curated list features powerful, consistent personalities, free vs paid comparisons, and real-world tests. Find your perfect companion and start building today at XIX.AI.

10 tools
xix.ai
Education and Learning Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows
Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows

Discover the 2026 best AI data science mentors to master SQL, Pandas & ML workflows. Explore our top-rated, curated selection at XIX.AI for powerful, game-changing guidance. Compare free vs paid options with real-world insights. Unlock your data science mastery today.

10 tools
xix.ai
chatbot Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time
Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time

Discover the 2026 best AI flirting and conversation trainers on XIX.AI. Our curated, top-rated selection helps you build social charisma and confidence in real-time. Explore must-try, game-changing tools with free vs paid comparisons and weekly updated rankings. Unlock your social edge today.

10 tools
xix.ai
code Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click
Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click

Discover the 2026 latest top-rated AI tools for automated unit testing. Our curated selection features powerful, game-changing solutions to generate Jest, PyTest & JUnit test cases instantly. Compare free vs paid options with real-world tests and weekly updated rankings on XIX.AI. Unlock your AI edge and boost development productivity today.

10 tools
xix.ai
Data Analysis Best AI Data Visualization Tools: Auto-Generate Interactive BI Dashboards from Raw Files
Best AI Data Visualization Tools: Auto-Generate Interactive BI Dashboards from Raw Files

Discover the 2026 best AI data visualization tools at XIX.AI. Our curated, top-rated selection helps you auto-generate powerful, interactive BI dashboards from raw files instantly. Compare free vs paid options with real-world tests and weekly updated rankings. Unlock your data's potential today.

10 tools
xix.ai
Comments (33)
0/500
JackAllen
JackAllen December 30, 2025 at 9:30:40 AM EST

这篇文章提到的版权问题确实让人担忧,以后AI生成的内容会不会都带着'侵权'的标签?想想就觉得挺讽刺的,毕竟这些模型训练数据不透明,普通用户根本不知道输出里夹带了什么'私货'。希望有更严格的管理办法吧。

WilliamGonzalez
WilliamGonzalez August 25, 2025 at 5:01:06 AM EDT

This is wild! OpenAI might’ve gobbled up copyrighted stuff to train their models? I’m not shocked, but it’s kinda shady. Hope those authors and coders get some justice! 😤

GregoryBaker
GregoryBaker August 23, 2025 at 7:01:18 AM EDT

This is wild! OpenAI might've trained their models on copyrighted stuff? 😳 I wonder how many books and code snippets got swept up in that data vacuum. Ethics in AI is such a messy topic right now.

JohnGarcia
JohnGarcia April 23, 2025 at 11:10:14 AM EDT

Me sorprendió un poco que OpenAI podría haber usado material con derechos de autor para entrenar sus modelos. Es un poco decepcionante, pero supongo que es el salvaje oeste allá en el mundo de la IA. 🤔 ¿Quizás deberían ser más cuidadosos la próxima vez?

TimothyMitchell
TimothyMitchell April 21, 2025 at 8:12:42 PM EDT

OpenAIが著作権付きの資料を使ってAIを訓練しているという研究は本当に驚きですね!クリエイターにとっては残念ですが、AIの訓練方法について知るのは面白いです。もっと透明性が必要かもしれませんね?🤔

WillLopez
WillLopez April 21, 2025 at 7:49:05 AM EDT

오픈AI가 저작권 있는 자료를 사용해 AI를 훈련했다는 연구는 정말 충격적이에요! 창작자들에게는 안타까운 일이지만, AI가 어떻게 훈련되는지 아는 건 흥미로워요. 오픈AI가 더 투명해져야 할까요? 🤔

OR