OpenAI's AI Trained on Paywalled O’Reilly Books, Researchers Claim

OpenAI has faced numerous accusations of using copyrighted material without permission to train its AI models. A recent study by the AI Disclosures Project, a nonprofit established in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, suggests that OpenAI may have used non-public books from O’Reilly Media to train its more advanced model, GPT-4o.
AI models, essentially sophisticated prediction engines, are trained on vast datasets including books, movies, and TV shows. They learn patterns and generate responses based on these patterns, not creating anything truly new but rather approximating from their extensive knowledge base. As real-world data sources like the public web become exhausted, some AI labs, including OpenAI, have started using AI-generated data for training, though few have completely abandoned real-world data due to the risks of degrading model performance.
The AI Disclosures Project's paper claims that OpenAI's GPT-4o model, which is the default in ChatGPT, shows a strong recognition of content from paywalled O’Reilly books, unlike the earlier GPT-3.5 Turbo model. The paper suggests that GPT-4o was likely trained on these non-public books, despite O’Reilly Media not having a licensing agreement with OpenAI.
The study employed a method called DE-COP, introduced in 2024, to detect copyrighted content in AI training data. This "membership inference attack" tests whether a model can distinguish between human-authored texts and AI-generated paraphrases, indicating prior knowledge of the text if it can do so reliably. The researchers tested GPT-4o, GPT-3.5 Turbo, and other OpenAI models using 13,962 paragraph excerpts from 34 O’Reilly books, finding that GPT-4o recognized significantly more paywalled content than the older models.
While the authors acknowledge that their method isn't foolproof and that the paywalled content might have been introduced by users copying and pasting into ChatGPT, the findings raise questions about OpenAI's data practices. The study did not evaluate OpenAI's latest models, such as GPT-4.5 and reasoning models like o3-mini and o1, leaving open the possibility that these might not have been trained on the same data.
OpenAI has been pushing for more relaxed copyright laws regarding AI training data and has been seeking higher-quality data sources. The company has even hired journalists to refine its models' outputs, a practice seen across the AI industry where experts in various fields are recruited to enhance AI systems.
OpenAI does pay for some of its training data, having licensing agreements with various content providers and offering opt-out mechanisms for copyright owners. However, as the company faces legal challenges over its data practices, the findings of the O’Reilly paper cast a shadow over its operations.
OpenAI did not respond to requests for comment on the study.
Related article
OpenAI Acquires AI Personal Finance Startup Hiro
OpenAI has acquired the personal finance startup Hiro Finance, founder Ethan Bloch announced on Monday, with OpenAI confirming the deal to TechCrunch. The startup was backed by top fintech venture capital firm Ribbit, along with General Catalyst and
Satya Nadella ready to exploit new OpenAI deal
On Wednesday, a Wall Street analyst asked Microsoft CEO Satya Nadella directly how the revised OpenAI partnership would affect the company’s financials.Nadella described the new agreement as a win for everyone. “We feel good about our partnership wit
OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week
As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha
Related Special Topic Recommendations
Comments (42)
0/500
This is wild! OpenAI sneaking in paywalled books to train their AI? Sounds like a plot twist from a sci-fi novel. Curious how they'll dodge this one—ethics in AI is getting messier by the day! 😅

OpenAI Acquires AI Personal Finance Startup Hiro
OpenAI has acquired the personal finance startup Hiro Finance, founder Ethan Bloch announced on Monday, with OpenAI confirming the deal to TechCrunch. The startup was backed by top fintech venture capital firm Ribbit, along with General Catalyst and
Satya Nadella ready to exploit new OpenAI deal
On Wednesday, a Wall Street analyst asked Microsoft CEO Satya Nadella directly how the revised OpenAI partnership would affect the company’s financials.Nadella described the new agreement as a win for everyone. “We feel good about our partnership wit
OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week
As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha
This is wild! OpenAI sneaking in paywalled books to train their AI? Sounds like a plot twist from a sci-fi novel. Curious how they'll dodge this one—ethics in AI is getting messier by the day! 😅





Home






