OpenAI's AI Trained on Paywalled O’Reilly Books, Researchers Claim

Home

News

April 7, 2025

JuanThomas

217

# openai

OpenAI

OpenAI has faced numerous accusations of using copyrighted material without permission to train its AI models. A recent study by the AI Disclosures Project, a nonprofit established in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, suggests that OpenAI may have used non-public books from O’Reilly Media to train its more advanced model, GPT-4o. AI models, essentially sophisticated prediction engines, are trained on vast datasets including books, movies, and TV shows. They learn patterns and generate responses based on these patterns, not creating anything truly new but rather approximating from their extensive knowledge base. As real-world data sources like the public web become exhausted, some AI labs, including OpenAI, have started using AI-generated data for training, though few have completely abandoned real-world data due to the risks of degrading model performance. The AI Disclosures Project's paper claims that OpenAI's GPT-4o model, which is the default in ChatGPT, shows a strong recognition of content from paywalled O’Reilly books, unlike the earlier GPT-3.5 Turbo model. The paper suggests that GPT-4o was likely trained on these non-public books, despite O’Reilly Media not having a licensing agreement with OpenAI. The study employed a method called DE-COP, introduced in 2024, to detect copyrighted content in AI training data. This "membership inference attack" tests whether a model can distinguish between human-authored texts and AI-generated paraphrases, indicating prior knowledge of the text if it can do so reliably. The researchers tested GPT-4o, GPT-3.5 Turbo, and other OpenAI models using 13,962 paragraph excerpts from 34 O’Reilly books, finding that GPT-4o recognized significantly more paywalled content than the older models. While the authors acknowledge that their method isn't foolproof and that the paywalled content might have been introduced by users copying and pasting into ChatGPT, the findings raise questions about OpenAI's data practices. The study did not evaluate OpenAI's latest models, such as GPT-4.5 and reasoning models like o3-mini and o1, leaving open the possibility that these might not have been trained on the same data. OpenAI has been pushing for more relaxed copyright laws regarding AI training data and has been seeking higher-quality data sources. The company has even hired journalists to refine its models' outputs, a practice seen across the AI industry where experts in various fields are recruited to enhance AI systems. OpenAI does pay for some of its training data, having licensing agreements with various content providers and offering opt-out mechanisms for copyright owners. However, as the company faces legal challenges over its data practices, the findings of the O’Reilly paper cast a shadow over its operations. OpenAI did not respond to requests for comment on the study.

OpenAI Acquires AI Personal Finance Startup Hiro OpenAI has acquired the personal finance startup Hiro Finance, founder Ethan Bloch announced on Monday, with OpenAI confirming the deal to TechCrunch. The startup was backed by top fintech venture capital firm Ribbit, along with General Catalyst and

Satya Nadella ready to exploit new OpenAI deal On Wednesday, a Wall Street analyst asked Microsoft CEO Satya Nadella directly how the revised OpenAI partnership would affect the company’s financials.Nadella described the new agreement as a win for everyone. “We feel good about our partnership wit

OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha

Related Special Topic Recommendations

chatbot

Create Your Own AI Love Story with These Roleplay Tools

Discover the 2026 latest top-rated AI roleplay tools for crafting immersive narratives. XIX.AI's curated list features powerful, game-changing assistants to unlock creative storytelling and emotional depth. Compare free vs paid options with real-world tests. Start your unique journey today.

10 tools

xix.ai

Text-to-speech

Top AI Voice Tools for Indie Game Devs: Save Time on Voice Acting for RPGs and Visual Novels

Discover the 2026 best AI voice tools for game devs! XIX.AI's curated list features top-rated, game-changing solutions to save you time and money on voice acting for RPGs and visual novels. Explore free vs paid comparisons, real-world tests, and weekly updated rankings. Find your perfect voice tool today!

10 tools

xix.ai

Education and Learning

Best AI Spaced Repetition Tools: Optimize Study Schedules for Medical & Law Students

Discover the 2026 best AI spaced repetition tools, curated by XIX.AI. Our top-rated, game-changing picks help medical and law students optimize study schedules for maximum retention. Compare free vs paid options with real-world tests and weekly updated rankings. Unlock your learning edge now.

10 tools

xix.ai

Video creation

Best AI Text to Video Platforms for Script Writing and Visual Storytelling

2026 Latest Best AI Text to Video Platforms: Top-rated tools for script writing and visual storytelling. Discover powerful, game-changing solutions to transform your text into engaging videos. Compare free vs paid options with our weekly updated rankings and real-world tests. Find your perfect platform to boost creativity and productivity. Explore the curated selection at XIX.AI.

10 tools

xix.ai

chatbot

AI Multi-Agent Orchestrators: Design Complex Automated Workflows through Natural Language

2026 Latest: Discover the best AI multi-agent orchestrators to design complex automated workflows through natural language. Our curated list features top-rated, powerful platforms for seamless task automation and intelligent process management. Compare free vs paid options with real-world insights. Unlock your AI edge with XIX.AI's expert weekly updated rankings.

10 tools

xix.ai

Image editing

Best AI Noise Reduction Software: Remove Grain & Artifacts from Low-Light Night Photography

Discover the 2026 best AI noise reduction software for low-light night photography. Our top-rated, curated list compares free vs paid tools, featuring real-world tests and weekly updated rankings. Remove grain & artifacts effortlessly. Unlock your AI edge at XIX.AI.

10 tools

xix.ai

Comments (42)

0/500

Please login first

RichardJackson

November 16, 2025 at 9:30:37 AM EST

こんなことされてしまうと、著作権料を払って制作している出版社側はたまったもんじゃないよね…AIの学習データの透明性、もっと求められるべきだと思う。🤔

PeterNelson

July 31, 2025 at 7:35:39 AM EDT

This is wild! OpenAI sneaking in paywalled books to train their AI? Sounds like a plot twist from a sci-fi novel. Curious how they'll dodge this one—ethics in AI is getting messier by the day! 😅

HarperJones

April 22, 2025 at 10:24:27 PM EDT

OpenAI가 유료 책을 이용해 AI를 훈련했다니 좀 의심스럽네요. 한편으론 AI 성능이 인상적이지만, 데이터 소스를 더 나은 방법으로 찾아야 할 것 같아요. 🤔

WalterWhite

April 18, 2025 at 3:33:48 PM EDT

OpenAIの件については少し悩んでいます。O’Reillyの本を無断で使うのはちょっと気持ち悪いですが、彼らが作っているAIはかなりクールですね。次回は本の使用料を払うべきかも？🤔

BruceClark

April 17, 2025 at 10:02:34 PM EDT

OpenAIが有料の書籍を使ってAIを訓練しているのは少し問題があるかもしれません。でも、AIの性能は本当に素晴らしいですね。データのソースをより良い方法で見つける必要があると思います。🤔

DennisGarcia

April 17, 2025 at 9:58:35 PM EDT

I'm kinda torn about this OpenAI thing. On one hand, using those O’Reilly books without permission feels a bit off, you know? But on the other hand, the AI they're building is pretty slick! Maybe they should just pay for the books next time? 🤔