Study: OpenAI Models Memorized Copyrighted Content
April 10, 2025
RonaldHernández
44
A recent study suggests that OpenAI may have indeed used copyrighted material to train some of its AI models, adding fuel to the ongoing legal battles the company faces. Authors, programmers, and other content creators have accused OpenAI of using their works—such as books and code—without permission to develop its AI models. While OpenAI has defended itself by claiming fair use, the plaintiffs argue that U.S. copyright law doesn't provide an exception for training data.
The study, a collaboration between researchers from the University of Washington, the University of Copenhagen, and Stanford, introduces a new technique for detecting "memorized" training data in models accessed through an API, like those from OpenAI. AI models essentially learn from vast amounts of data to recognize patterns, enabling them to create essays, images, and more. Although most outputs aren't direct copies of the training data, some inevitably are due to the learning process. For instance, image models have been known to reproduce movie screenshots, while language models have been caught essentially plagiarizing news articles.
The method described in the study focuses on "high-surprisal" words—words that are unusual in a given context. For example, in the sentence "Jack and I sat perfectly still with the radar humming," "radar" would be a high-surprisal word because it's less expected than words like "engine" or "radio" to precede "humming."
The researchers tested several OpenAI models, including GPT-4 and GPT-3.5, by removing high-surprisal words from excerpts of fiction books and New York Times articles and asking the models to predict these missing words. If the models accurately guessed the words, it suggested they had memorized the text during training.

An example of having a model “guess” a high-surprisal word.Image Credits:OpenAI The results indicated that GPT-4 had likely memorized parts of popular fiction books, including those in the BookMIA dataset of copyrighted ebooks. It also appeared to have memorized some New York Times articles, though at a lower frequency.
Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study, emphasized to TechCrunch that these findings highlight the "contentious data" that might have been used to train these models. "In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically," Ravichander stated. "Our work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem."
OpenAI has pushed for more relaxed rules on using copyrighted data to develop AI models. Although the company has some content licensing agreements and offers opt-out options for copyright holders, it has lobbied various governments to establish "fair use" rules specifically for AI training.
Related article
Google Search Introduces 'AI Mode' for Complex, Multi-Part Queries
Google Unveils "AI Mode" in Search to Rival Perplexity AI and ChatGPTGoogle is stepping up its game in the AI arena with the launch of an experimental "AI Mode" feature in its Search engine. Aimed at taking on the likes of Perplexity AI and OpenAI's ChatGPT Search, this new mode was announced on Wed
ChatGPT's Unsolicited Use of User Names Sparks 'Creepy' Concerns Among Some
Some users of ChatGPT have recently encountered an odd new feature: the chatbot occasionally uses their name while working through problems. This wasn't part of its usual behavior before, and many users report that ChatGPT mentions their names without ever being told what to call them.
Opinions on
OpenAI Enhances ChatGPT to Recall Previous Conversations
OpenAI made a big announcement on Thursday about rolling out a fresh feature in ChatGPT called "memory." This nifty tool is designed to make your chats with the AI more personalized by remembering what you've talked about before. Imagine not having to repeat yourself every time you start a new conve
Comments (20)
0/200
AlbertHernández
April 14, 2025 at 9:39:34 PM GMT
This study about OpenAI using copyrighted material is pretty eye-opening! I mean, it's kind of a bummer for creators, but also fascinating to see how AI is trained. It makes you wonder what else is out there that we don't know about. Maybe OpenAI should start being more transparent? 🤔
0
TimothyMitchell
April 22, 2025 at 12:12:42 AM GMT
OpenAIが著作権付きの資料を使ってAIを訓練しているという研究は本当に驚きですね!クリエイターにとっては残念ですが、AIの訓練方法について知るのは面白いです。もっと透明性が必要かもしれませんね?🤔
0
WillLopez
April 21, 2025 at 11:49:05 AM GMT
오픈AI가 저작권 있는 자료를 사용해 AI를 훈련했다는 연구는 정말 충격적이에요! 창작자들에게는 안타까운 일이지만, AI가 어떻게 훈련되는지 아는 건 흥미로워요. 오픈AI가 더 투명해져야 할까요? 🤔
0
JamesMiller
April 10, 2025 at 6:07:57 PM GMT
Esse estudo sobre a OpenAI usando material com direitos autorais é bem revelador! É uma pena para os criadores, mas também fascinante ver como o AI é treinado. Faz você se perguntar o que mais está por aí que não sabemos. Talvez a OpenAI devesse ser mais transparente? 🤔
0
BruceSmith
April 13, 2025 at 1:01:58 AM GMT
Este estudio sobre OpenAI usando material con derechos de autor es bastante revelador. Es una lástima para los creadores, pero también fascinante ver cómo se entrena la IA. Te hace preguntarte qué más hay por ahí que no sabemos. ¿Quizás OpenAI debería ser más transparente? 🤔
0
JohnWilson
April 17, 2025 at 5:16:23 PM GMT
This study on OpenAI's models using copyrighted content is kinda scary! 😱 I mean, it's cool how smart AI is getting, but it feels wrong if they're just copying books and code without asking. Hope they sort it out soon! 🤞
0






A recent study suggests that OpenAI may have indeed used copyrighted material to train some of its AI models, adding fuel to the ongoing legal battles the company faces. Authors, programmers, and other content creators have accused OpenAI of using their works—such as books and code—without permission to develop its AI models. While OpenAI has defended itself by claiming fair use, the plaintiffs argue that U.S. copyright law doesn't provide an exception for training data.
The study, a collaboration between researchers from the University of Washington, the University of Copenhagen, and Stanford, introduces a new technique for detecting "memorized" training data in models accessed through an API, like those from OpenAI. AI models essentially learn from vast amounts of data to recognize patterns, enabling them to create essays, images, and more. Although most outputs aren't direct copies of the training data, some inevitably are due to the learning process. For instance, image models have been known to reproduce movie screenshots, while language models have been caught essentially plagiarizing news articles.
The method described in the study focuses on "high-surprisal" words—words that are unusual in a given context. For example, in the sentence "Jack and I sat perfectly still with the radar humming," "radar" would be a high-surprisal word because it's less expected than words like "engine" or "radio" to precede "humming."
The researchers tested several OpenAI models, including GPT-4 and GPT-3.5, by removing high-surprisal words from excerpts of fiction books and New York Times articles and asking the models to predict these missing words. If the models accurately guessed the words, it suggested they had memorized the text during training.
Abhilasha Ravichander, a doctoral student at the University of Washington and co-author of the study, emphasized to TechCrunch that these findings highlight the "contentious data" that might have been used to train these models. "In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically," Ravichander stated. "Our work aims to provide a tool to probe large language models, but there is a real need for greater data transparency in the whole ecosystem."
OpenAI has pushed for more relaxed rules on using copyrighted data to develop AI models. Although the company has some content licensing agreements and offers opt-out options for copyright holders, it has lobbied various governments to establish "fair use" rules specifically for AI training.




This study about OpenAI using copyrighted material is pretty eye-opening! I mean, it's kind of a bummer for creators, but also fascinating to see how AI is trained. It makes you wonder what else is out there that we don't know about. Maybe OpenAI should start being more transparent? 🤔




OpenAIが著作権付きの資料を使ってAIを訓練しているという研究は本当に驚きですね!クリエイターにとっては残念ですが、AIの訓練方法について知るのは面白いです。もっと透明性が必要かもしれませんね?🤔




오픈AI가 저작권 있는 자료를 사용해 AI를 훈련했다는 연구는 정말 충격적이에요! 창작자들에게는 안타까운 일이지만, AI가 어떻게 훈련되는지 아는 건 흥미로워요. 오픈AI가 더 투명해져야 할까요? 🤔




Esse estudo sobre a OpenAI usando material com direitos autorais é bem revelador! É uma pena para os criadores, mas também fascinante ver como o AI é treinado. Faz você se perguntar o que mais está por aí que não sabemos. Talvez a OpenAI devesse ser mais transparente? 🤔




Este estudio sobre OpenAI usando material con derechos de autor es bastante revelador. Es una lástima para los creadores, pero también fascinante ver cómo se entrena la IA. Te hace preguntarte qué más hay por ahí que no sabemos. ¿Quizás OpenAI debería ser más transparente? 🤔




This study on OpenAI's models using copyrighted content is kinda scary! 😱 I mean, it's cool how smart AI is getting, but it feels wrong if they're just copying books and code without asking. Hope they sort it out soon! 🤞












