AI Hallucinations aren’t a bug, they’re a feature of how we reward AI.

AI will keep hallucinating because we keep grading it like a game show: rewarding right answers, but never penalising confident wrong ones.

As Charlie Munger said: “Show me the incentive and I’ll show you the outcome”.

And that’s exactly what OpenAI’s latest research shows. Their new 36-page research paper breaks down how current training and evaluation methods are misaligned, and why models will keep making things up unless we change the way we score them.

The insight is surprisingly simple: hallucinations persist because we’ve trained models to guess. Not to stay silent when uncertain. Not to say “I don’t know”. But to make confident-sounding statements regardless of whether they’re true. This is all down to the system which gives them a reward for being right, and nothing happens when they’re wrong. There’s no penalty for confidently bluffing.

This makes sense if you're a model trying to maximise your reward. You’re not optimising for truth. You're optimising for reward signals. So you take the shot. It’s a bit like being in a school exam where you only gain marks for correct answers but don’t lose marks for wrong ones, you’d guess every time too.

Take the screenshot below, an example of how OpenAI tested this. They compared two models:

One that was taught it’s OK to say “I don’t know” (called gpt-5-thinking-mini)
One that was trained the usual way, to guess confidently every time (OpenAI o4-mini)

The first model abstained 52% of the time, answered correctly 22%, and only got it wrong 26% of the time. The second model almost never abstained (just 1%), and while it was slightly more accurate (24%), it was wrong a staggering 75% of the time.

The takeaway is confident guessing leads to confident hallucinations. Without a penalty, the model learns that it’s better to say something, even if it’s nonsense, than to admit it doesn’t know.

OpenAI goes further: OpenAI reckons the real issue isn’t just how we train models, it’s how we test them. The systems we use to rate and fine-tune models (like asking humans which answer sounds better, or training the model to predict the next word) reward confident-sounding responses, even when they’re wrong.

Take a real example from OpenAI’s own research: someone asked the model how to conjugate Māori verbs. The model made it up. The answer sounded fluent and looked helpful, so the human raters gave it a thumbs up. But it was completely false. And because the system rewards that kind of confident, fluent response (even when it’s wrong) the model learns to do it more.

So how do we fix it? OpenAI offers a few directions:

More nuanced scoring systems
Better evaluation data, and
Training methods that explicitly model uncertainty

One of the most promising ideas is to introduce a cost for wrong answers, like in some standardised tests. Make it better to abstain than to confidently hallucinate.

This matters. Trust in AI is fragile. Once users catch a model making things up (especially when it sounds authoritative) they disengage. It damages adoption and, more importantly, it distorts the perception of how reliable these systems really are.

So the next time a model hallucinates, ask yourself: what did we teach it to value? Because as the research shows, hallucinations aren’t a failure of intelligence, they’re a failure of incentives.

Written by Mike ✌

Passionate about all things AI, emerging tech and start-ups, Mike is the Founder of The AI Corner.

Subscribe to The AI Corner

The fastest way to keep up with AI in New Zealand, in just 5 minutes a week. Join thousands of readers who rely on us every Monday for the latest AI news.

AI Hallucinations aren’t a bug, they’re a feature of how we reward AI.

Subscribe to The AI Corner

The fastest way to keep up with AI in New Zealand, in just 5 minutes a week. Join thousands of readers who rely on us every Monday for the latest AI news.

Keep Reading

The AI Corner