OpenAI's AI Trained on Paywalled O’Reilly Books, Researchers Claim

OpenAI has faced numerous accusations of using copyrighted material without permission to train its AI models. A recent study by the AI Disclosures Project, a nonprofit established in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, suggests that OpenAI may have used non-public books from O’Reilly Media to train its more advanced model, GPT-4o.
AI models, essentially sophisticated prediction engines, are trained on vast datasets including books, movies, and TV shows. They learn patterns and generate responses based on these patterns, not creating anything truly new but rather approximating from their extensive knowledge base. As real-world data sources like the public web become exhausted, some AI labs, including OpenAI, have started using AI-generated data for training, though few have completely abandoned real-world data due to the risks of degrading model performance.
The AI Disclosures Project's paper claims that OpenAI's GPT-4o model, which is the default in ChatGPT, shows a strong recognition of content from paywalled O’Reilly books, unlike the earlier GPT-3.5 Turbo model. The paper suggests that GPT-4o was likely trained on these non-public books, despite O’Reilly Media not having a licensing agreement with OpenAI.
The study employed a method called DE-COP, introduced in 2024, to detect copyrighted content in AI training data. This "membership inference attack" tests whether a model can distinguish between human-authored texts and AI-generated paraphrases, indicating prior knowledge of the text if it can do so reliably. The researchers tested GPT-4o, GPT-3.5 Turbo, and other OpenAI models using 13,962 paragraph excerpts from 34 O’Reilly books, finding that GPT-4o recognized significantly more paywalled content than the older models.
While the authors acknowledge that their method isn't foolproof and that the paywalled content might have been introduced by users copying and pasting into ChatGPT, the findings raise questions about OpenAI's data practices. The study did not evaluate OpenAI's latest models, such as GPT-4.5 and reasoning models like o3-mini and o1, leaving open the possibility that these might not have been trained on the same data.
OpenAI has been pushing for more relaxed copyright laws regarding AI training data and has been seeking higher-quality data sources. The company has even hired journalists to refine its models' outputs, a practice seen across the AI industry where experts in various fields are recruited to enhance AI systems.
OpenAI does pay for some of its training data, having licensing agreements with various content providers and offering opt-out mechanisms for copyright owners. However, as the company faces legal challenges over its data practices, the findings of the O’Reilly paper cast a shadow over its operations.
OpenAI did not respond to requests for comment on the study.
Related article
Oracle's $40B Nvidia Chip Investment Boosts Texas AI Data Center
Oracle is set to invest approximately $40 billion in Nvidia chips to power a major new data center in Texas, developed by OpenAI, as reported by the Financial Times. This deal, one of the largest chip
SoftBank Acquires $676M Sharp Factory for AI Data Center in Japan
SoftBank is advancing its goal to establish a major AI hub in Japan, both independently and through partnerships like OpenAI. The tech giant confirmed on Friday it will invest $676 million to acquire
Adobe and Figma Integrate OpenAI's Advanced Image Generation Model
OpenAI’s enhanced image generation in ChatGPT has driven a surge in users, fueled by its ability to produce Studio Ghibli-style visuals and unique designs, and is now expanding to other platforms. The
Comments (41)
0/200
PeterNelson
July 31, 2025 at 7:35:39 AM EDT
This is wild! OpenAI sneaking in paywalled books to train their AI? Sounds like a plot twist from a sci-fi novel. Curious how they'll dodge this one—ethics in AI is getting messier by the day! 😅
0
HarperJones
April 22, 2025 at 10:24:27 PM EDT
OpenAI가 유료 책을 이용해 AI를 훈련했다니 좀 의심스럽네요. 한편으론 AI 성능이 인상적이지만, 데이터 소스를 더 나은 방법으로 찾아야 할 것 같아요. 🤔
0
WalterWhite
April 18, 2025 at 3:33:48 PM EDT
OpenAIの件については少し悩んでいます。O’Reillyの本を無断で使うのはちょっと気持ち悪いですが、彼らが作っているAIはかなりクールですね。次回は本の使用料を払うべきかも?🤔
0
BruceClark
April 17, 2025 at 10:02:34 PM EDT
OpenAIが有料の書籍を使ってAIを訓練しているのは少し問題があるかもしれません。でも、AIの性能は本当に素晴らしいですね。データのソースをより良い方法で見つける必要があると思います。🤔
0
DennisGarcia
April 17, 2025 at 9:58:35 PM EDT
I'm kinda torn about this OpenAI thing. On one hand, using those O’Reilly books without permission feels a bit off, you know? But on the other hand, the AI they're building is pretty slick! Maybe they should just pay for the books next time? 🤔
0
AvaHill
April 16, 2025 at 2:00:26 PM EDT
Estoy un poco dividido sobre que OpenAI use libros de pago para entrenar su IA. Por un lado, es un poco sospechoso, pero por otro, la IA es impresionante. Creo que necesitan encontrar una mejor manera de obtener sus datos, ¿no? 🤔
0


This is wild! OpenAI sneaking in paywalled books to train their AI? Sounds like a plot twist from a sci-fi novel. Curious how they'll dodge this one—ethics in AI is getting messier by the day! 😅




OpenAI가 유료 책을 이용해 AI를 훈련했다니 좀 의심스럽네요. 한편으론 AI 성능이 인상적이지만, 데이터 소스를 더 나은 방법으로 찾아야 할 것 같아요. 🤔




OpenAIの件については少し悩んでいます。O’Reillyの本を無断で使うのはちょっと気持ち悪いですが、彼らが作っているAIはかなりクールですね。次回は本の使用料を払うべきかも?🤔




OpenAIが有料の書籍を使ってAIを訓練しているのは少し問題があるかもしれません。でも、AIの性能は本当に素晴らしいですね。データのソースをより良い方法で見つける必要があると思います。🤔




I'm kinda torn about this OpenAI thing. On one hand, using those O’Reilly books without permission feels a bit off, you know? But on the other hand, the AI they're building is pretty slick! Maybe they should just pay for the books next time? 🤔




Estoy un poco dividido sobre que OpenAI use libros de pago para entrenar su IA. Por un lado, es un poco sospechoso, pero por otro, la IA es impresionante. Creo que necesitan encontrar una mejor manera de obtener sus datos, ¿no? 🤔












