Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

2025/4/4

TechCrunch Industry News

AI Deep Dive AI Chapters Transcript

People

AI Disclosures Project的研究人员

文

文章作者

Topics

@AI Disclosures Project的研究人员：我们研究发现，OpenAI可能在其大型语言模型GPT-4.0的训练过程中，未经授权使用了大量的O'Reilly Media的付费书籍内容。我们的研究使用了DECOP方法，该方法能够检测模型训练数据中是否存在特定的版权内容。通过对GPT-4.0和GPT-3.5 Turbo等模型进行测试，我们发现GPT-4.0对O'Reilly付费书籍内容的识别率显著高于GPT-3.5 Turbo，这表明GPT-4.0的训练数据中很可能包含了这些付费内容。虽然我们承认我们的方法并非完美无缺，也存在OpenAI可能从用户上传的内容中获取这些数据的可能性，但这项研究结果仍然令人担忧，因为它揭示了OpenAI可能存在的系统性版权侵犯行为。我们呼吁OpenAI对这一指控做出回应，并采取措施确保其未来的模型训练过程不会再出现类似问题。 @文章作者：OpenAI一直以来都面临着关于其训练数据来源和版权合规性的争议。虽然OpenAI声称其已经与一些新闻出版商、社交网络和媒体库达成了许可协议，并提供了一些退出机制，允许版权所有者标记他们不希望用于训练的内容，但这项新的研究报告再次突显了OpenAI在数据使用方面面临的挑战。O'Reilly的研究报告，虽然并非直接证据，但它揭示了OpenAI可能在追求更高质量训练数据时，忽视了版权保护的重要性。OpenAI的回应以及未来如何处理类似事件将对整个AI行业产生深远的影响，因为它将直接关系到AI模型训练数据来源的合法性和道德性。

Deep Dive

Shownotes Transcript

OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization makes the serious accusation that the company increasingly relied on non-public books it didn’t license to train more sophisticated AI models. AI models are essentially complex prediction engines.

Learn more about your ad choices. Visit podcastchoices.com/adchoices

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books 05:45 Share

TechCrunch Industry News

Deep Dive

Shownotes Transcript

Researchers suggest OpenAI trained AI models on paywalled O’Reilly books