OpenAI's AI Trained on Paywalled O’Reilly Books, Researchers Claim
April 7, 2025
JuanThomas
94

OpenAI has faced numerous accusations of using copyrighted material without permission to train its AI models. A recent study by the AI Disclosures Project, a nonprofit established in 2024 by media mogul Tim O’Reilly and economist Ilan Strauss, suggests that OpenAI may have used non-public books from O’Reilly Media to train its more advanced model, GPT-4o.
AI models, essentially sophisticated prediction engines, are trained on vast datasets including books, movies, and TV shows. They learn patterns and generate responses based on these patterns, not creating anything truly new but rather approximating from their extensive knowledge base. As real-world data sources like the public web become exhausted, some AI labs, including OpenAI, have started using AI-generated data for training, though few have completely abandoned real-world data due to the risks of degrading model performance.
The AI Disclosures Project's paper claims that OpenAI's GPT-4o model, which is the default in ChatGPT, shows a strong recognition of content from paywalled O’Reilly books, unlike the earlier GPT-3.5 Turbo model. The paper suggests that GPT-4o was likely trained on these non-public books, despite O’Reilly Media not having a licensing agreement with OpenAI.
The study employed a method called DE-COP, introduced in 2024, to detect copyrighted content in AI training data. This "membership inference attack" tests whether a model can distinguish between human-authored texts and AI-generated paraphrases, indicating prior knowledge of the text if it can do so reliably. The researchers tested GPT-4o, GPT-3.5 Turbo, and other OpenAI models using 13,962 paragraph excerpts from 34 O’Reilly books, finding that GPT-4o recognized significantly more paywalled content than the older models.
While the authors acknowledge that their method isn't foolproof and that the paywalled content might have been introduced by users copying and pasting into ChatGPT, the findings raise questions about OpenAI's data practices. The study did not evaluate OpenAI's latest models, such as GPT-4.5 and reasoning models like o3-mini and o1, leaving open the possibility that these might not have been trained on the same data.
OpenAI has been pushing for more relaxed copyright laws regarding AI training data and has been seeking higher-quality data sources. The company has even hired journalists to refine its models' outputs, a practice seen across the AI industry where experts in various fields are recruited to enhance AI systems.
OpenAI does pay for some of its training data, having licensing agreements with various content providers and offering opt-out mechanisms for copyright owners. However, as the company faces legal challenges over its data practices, the findings of the O’Reilly paper cast a shadow over its operations.
OpenAI did not respond to requests for comment on the study.
Related article
Google検索では、複雑なマルチパートクエリ用に「AIモード」が導入されています
Googleは「AIモード」を検索して、Prplexity AIとChatGptgoogleに対抗し、AIアリーナでゲームを強化し、検索エンジンで実験的な「AIモード」機能を開始します。 Perplexity AIやOpenaiのChatGPT検索などを引き受けることを目的としたこの新しいモードは、水で発表されました
chatgptのユーザー名の未承諾の使用は、一部の人の間で「不気味な」懸念を引き起こします
ChatGPTの一部のユーザーは最近、奇妙な新機能に遭遇しました。チャットボットは、問題を乗り越えながら名前を使用することがあります。これは以前の通常の動作の一部ではなく、多くのユーザーがChatGptが何を呼ぶかを言わずに自分の名前に言及すると報告しています。意見
OpenaiはChatGptを強化して、以前の会話を思い出します
Openaiは木曜日に、「Memory」と呼ばれるChatGptの新鮮な機能を展開することについて大きな発表を行いました。この気の利いたツールは、以前に話したことを思い出すことにより、AIとのチャットをよりパーソナライズするように設計されています。あなたが新しい詐欺を始めるたびに自分自身を繰り返す必要がないと想像してください
Comments (40)
0/200
RoyPerez
April 11, 2025 at 4:31:26 AM GMT
So, OpenAI's AI got trained on paywalled books? That's a bit shady, isn't it? I mean, I love the tech, but using copyrighted material without permission? Come on, OpenAI, you can do better than that. Maybe they should focus on creating their own content instead.
0
KeithGonzález
April 10, 2025 at 7:27:39 PM GMT
オープンAIのAIが有料の本で訓練されたって?ちょっと怪しいよね?技術は好きだけど、許可なく著作権物を使うなんて。オープンAI、もっとできるはずだよ。自分のコンテンツを作ることに集中すべきだね。
0
MatthewHill
April 7, 2025 at 7:28:56 PM GMT
오픈AI의 AI가 유료 책으로 훈련되었다고요? 좀 수상하죠? 기술은 좋아하지만, 허락 없이 저작권이 있는 자료를 사용하다니요. 오픈AI, 더 잘할 수 있어요. 자신의 콘텐츠를 만드는 데 집중해야 해요.
0
BenWalker
April 9, 2025 at 1:31:14 PM GMT
Então, a IA da OpenAI foi treinada com livros pagos? Isso é um pouco suspeito, não é? Eu gosto da tecnologia, mas usar material com direitos autorais sem permissão? Vamos, OpenAI, você pode fazer melhor do que isso. Talvez eles deveriam se concentrar em criar seu próprio conteúdo.
0
FrankMartínez
April 9, 2025 at 10:03:15 AM GMT
¿Así que la IA de OpenAI fue entrenada con libros de pago? Eso es un poco sospechoso, ¿no? Me gusta la tecnología, pero usar material con derechos de autor sin permiso... Vamos, OpenAI, puedes hacerlo mejor. Tal vez deberían centrarse en crear su propio contenido.
0
LarryHernández
April 10, 2025 at 6:32:40 AM GMT
I'm torn about OpenAI using O’Reilly books to train their AI. On one hand, it's impressive how advanced their models are getting. On the other, it feels a bit shady to use paywalled content. I guess innovation sometimes walks a fine line, huh? Maybe they should just pay for the books next time!
0










So, OpenAI's AI got trained on paywalled books? That's a bit shady, isn't it? I mean, I love the tech, but using copyrighted material without permission? Come on, OpenAI, you can do better than that. Maybe they should focus on creating their own content instead.




オープンAIのAIが有料の本で訓練されたって?ちょっと怪しいよね?技術は好きだけど、許可なく著作権物を使うなんて。オープンAI、もっとできるはずだよ。自分のコンテンツを作ることに集中すべきだね。




오픈AI의 AI가 유료 책으로 훈련되었다고요? 좀 수상하죠? 기술은 좋아하지만, 허락 없이 저작권이 있는 자료를 사용하다니요. 오픈AI, 더 잘할 수 있어요. 자신의 콘텐츠를 만드는 데 집중해야 해요.




Então, a IA da OpenAI foi treinada com livros pagos? Isso é um pouco suspeito, não é? Eu gosto da tecnologia, mas usar material com direitos autorais sem permissão? Vamos, OpenAI, você pode fazer melhor do que isso. Talvez eles deveriam se concentrar em criar seu próprio conteúdo.




¿Así que la IA de OpenAI fue entrenada con libros de pago? Eso es un poco sospechoso, ¿no? Me gusta la tecnología, pero usar material con derechos de autor sin permiso... Vamos, OpenAI, puedes hacerlo mejor. Tal vez deberían centrarse en crear su propio contenido.




I'm torn about OpenAI using O’Reilly books to train their AI. On one hand, it's impressive how advanced their models are getting. On the other, it feels a bit shady to use paywalled content. I guess innovation sometimes walks a fine line, huh? Maybe they should just pay for the books next time!












