New AI Models from OpenAI Exhibit Higher Hallucination Rates in Reasoning Tasks

OpenAI’s newly released o3 and o4-mini AI models excel in multiple areas but show increased hallucination tendencies compared to earlier models, generating more fabricated information.
Hallucinations remain a persistent challenge in AI, even for top-tier systems. Typically, newer models reduce hallucination rates, but o3 and o4-mini deviate from this trend.
Internal OpenAI tests reveal that o3 and o4-mini, designed as reasoning models, hallucinate more frequently than prior reasoning models like o1, o1-mini, and o3-mini, as well as non-reasoning models like GPT-4o.
The cause of this increase remains unclear to OpenAI, raising concerns.
OpenAI’s technical report on o3 and o4-mini notes that further research is needed to pinpoint why hallucination rates rise with scaled-up reasoning models. While these models outperform in areas like coding and math, their tendency to make more claims leads to both accurate and inaccurate outputs, according to the report.
On OpenAI’s PersonQA benchmark, o3 hallucinated in 33% of responses, doubling the rates of o1 (16%) and o3-mini (14.8%). O4-mini performed worse, hallucinating in 48% of cases.
Transluce, a nonprofit AI research group, found o3 fabricating actions, such as claiming it ran code on a 2021 MacBook Pro outside ChatGPT, despite lacking such capabilities.
“We suspect the reinforcement learning used in o-series models may exacerbate issues typically lessened by standard post-training methods,” said Transluce researcher and former OpenAI employee Neil Chowdhury in an email to TechCrunch.
Transluce co-founder Sarah Schwettmann noted that o3’s hallucination rate could reduce its practical utility.
Kian Katanforoosh, Stanford adjunct professor and Workera CEO, told TechCrunch his team found o3 superior for coding workflows but prone to generating broken website links.
While hallucinations can spark creative ideas, they pose challenges for industries like law, where accuracy is critical and errors in documents are unacceptable.
Integrating web search capabilities shows promise for improving accuracy. OpenAI’s GPT-4o with web search achieves 90% accuracy on SimpleQA, suggesting potential for reducing hallucination in reasoning models when users allow third-party search access.
If scaling reasoning models continues to increase hallucinations, finding solutions will become increasingly critical.
“Improving model accuracy and reliability is a key focus of our ongoing research,” said OpenAI spokesperson Niko Felix in an email to TechCrunch.
The AI industry has recently shifted toward reasoning models, which enhance performance without requiring extensive computing resources. However, this shift appears to increase hallucination risks, presenting a significant challenge.
Related article
Satya Nadella ready to exploit new OpenAI deal
On Wednesday, a Wall Street analyst asked Microsoft CEO Satya Nadella directly how the revised OpenAI partnership would affect the company’s financials.Nadella described the new agreement as a win for everyone. “We feel good about our partnership wit
OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week
As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha
Greg Brockman reveals how Elon Musk departed OpenAI
In late August 2017, key figures at OpenAI—then a small nonprofit research lab—met to discuss how they would establish a for-profit entity to commercialize their technology and raise the capital needed to achieve AGI.Elon Musk was demanding full cont
Related Special Topic Recommendations
Comments (4)
0/500
It's wild how OpenAI's new models are so advanced yet still make stuff up! 😅 I wonder if these hallucinations could lead to some creative breakthroughs or just more AI headaches.
I read about OpenAI's new models and, wow, those hallucination rates are concerning! If AI starts making up stuff more often, how can we trust it for serious tasks? 🤔 Still, their capabilities sound impressive.
These new AI models sound powerful, but more hallucinations? That's like a sci-fi plot gone wrong! 🧠 Hope they fix it soon.

OpenAI’s newly released o3 and o4-mini AI models excel in multiple areas but show increased hallucination tendencies compared to earlier models, generating more fabricated information.
Hallucinations remain a persistent challenge in AI, even for top-tier systems. Typically, newer models reduce hallucination rates, but o3 and o4-mini deviate from this trend.
Internal OpenAI tests reveal that o3 and o4-mini, designed as reasoning models, hallucinate more frequently than prior reasoning models like o1, o1-mini, and o3-mini, as well as non-reasoning models like GPT-4o.
The cause of this increase remains unclear to OpenAI, raising concerns.
OpenAI’s technical report on o3 and o4-mini notes that further research is needed to pinpoint why hallucination rates rise with scaled-up reasoning models. While these models outperform in areas like coding and math, their tendency to make more claims leads to both accurate and inaccurate outputs, according to the report.
On OpenAI’s PersonQA benchmark, o3 hallucinated in 33% of responses, doubling the rates of o1 (16%) and o3-mini (14.8%). O4-mini performed worse, hallucinating in 48% of cases.
Transluce, a nonprofit AI research group, found o3 fabricating actions, such as claiming it ran code on a 2021 MacBook Pro outside ChatGPT, despite lacking such capabilities.
“We suspect the reinforcement learning used in o-series models may exacerbate issues typically lessened by standard post-training methods,” said Transluce researcher and former OpenAI employee Neil Chowdhury in an email to TechCrunch.
Transluce co-founder Sarah Schwettmann noted that o3’s hallucination rate could reduce its practical utility.
Kian Katanforoosh, Stanford adjunct professor and Workera CEO, told TechCrunch his team found o3 superior for coding workflows but prone to generating broken website links.
While hallucinations can spark creative ideas, they pose challenges for industries like law, where accuracy is critical and errors in documents are unacceptable.
Integrating web search capabilities shows promise for improving accuracy. OpenAI’s GPT-4o with web search achieves 90% accuracy on SimpleQA, suggesting potential for reducing hallucination in reasoning models when users allow third-party search access.
If scaling reasoning models continues to increase hallucinations, finding solutions will become increasingly critical.
“Improving model accuracy and reliability is a key focus of our ongoing research,” said OpenAI spokesperson Niko Felix in an email to TechCrunch.
The AI industry has recently shifted toward reasoning models, which enhance performance without requiring extensive computing resources. However, this shift appears to increase hallucination risks, presenting a significant challenge.
Satya Nadella ready to exploit new OpenAI deal
On Wednesday, a Wall Street analyst asked Microsoft CEO Satya Nadella directly how the revised OpenAI partnership would affect the company’s financials.Nadella described the new agreement as a win for everyone. “We feel good about our partnership wit
OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week
As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha
Greg Brockman reveals how Elon Musk departed OpenAI
In late August 2017, key figures at OpenAI—then a small nonprofit research lab—met to discuss how they would establish a for-profit entity to commercialize their technology and raise the capital needed to achieve AGI.Elon Musk was demanding full cont
It's wild how OpenAI's new models are so advanced yet still make stuff up! 😅 I wonder if these hallucinations could lead to some creative breakthroughs or just more AI headaches.
I read about OpenAI's new models and, wow, those hallucination rates are concerning! If AI starts making up stuff more often, how can we trust it for serious tasks? 🤔 Still, their capabilities sound impressive.
These new AI models sound powerful, but more hallucinations? That's like a sci-fi plot gone wrong! 🧠 Hope they fix it soon.





Home






