The Tell We Trained Out

#technology #systems #ai

The usual fear is that AI doesn't know what it doesn't know. The calibration evidence says the opposite: base models largely do know, and alignment training rewards them for hiding it.

The fear you hear most about AI is that it doesn't know what it doesn't know. A model invents a court case, cites a study nobody wrote, gives a wrong dosage in the same even tone it used for the right one a moment earlier. The worry is that the machine has no inner sense of its own ignorance, so it can't warn you. I've come to think this gets the mechanism almost backwards. The model usually does have the sense. We trained it to hide it.

Start with a fact that deserves to be better known. In the GPT-4 technical report, OpenAI put two calibration plots side by side. On the left, the pre-trained base model, before any of the work that turns a text predictor into a chatbot. On the right, the same model after that work. The left plot hugs the diagonal that marks perfect calibration: when the base model assigns 70 percent probability to an answer, it's right about 70 percent of the time. The right plot sags off the line. The report's own caption says it plainly. Post-training, it reads, hurts calibration significantly.

Sit with how odd that is. The raw model, the one nobody had taught to be helpful, already knew how sure it should be. The honest uncertainty was sitting right there in the probabilities. Then we ran the process that makes the model usable, and the calibration got worse. The knowledge wasn't damaged. What changed was what the model says about its knowledge.

Two different things wearing one word

A quieter line of research looks, at first, like it cuts the other way. Saurav Kadavath and a large team at Anthropic published a paper in 2022 whose title gives away the ending: Language Models (Mostly) Know What They Know. Big models, they found, are well calibrated on multiple-choice and true-false questions, and can even be trained to predict whether they'll get a question right before they answer it. Self-knowledge, sitting in the numbers.

Against that, a 2023 study led by Miao Xiong asked models to state their confidence out loud, in words, and found them badly overconfident. A model can be well calibrated in its probabilities and still announce it's 95 percent sure of something it gets right half the time. Both findings hold up. They only seem to clash if you assume confidence is a single thing. It's two.

There's the model's internal probability, the figure you could read off the token distribution if you had access to it. Call that belief. And there's the sentence the model emits when you ask how sure it is, the steady authoritative voice it keeps whether or not it's on firm ground. Call that performance. Belief lives in the math. Performance is a speech act, a learned way of sounding. The base model's belief was calibrated. Alignment training rewrote the performance and left the belief roughly where it was.

Why the tell got trained out

The reason is almost embarrassingly plain, and there's now direct evidence for it. To train a model with human feedback, you first build a reward model that scores answers the way people did, and people reward answers that sound sure of themselves. In 2024 Jixuan Leng and colleagues, in a paper called Taming Overconfidence in LLMs, showed the reward models carry a bias toward high-confidence responses regardless of whether the response is actually good. After that, optimization does what optimization always does. It finds the confident register and parks there, because hedging costs reward.

So the overconfidence is a side effect of the cure. The model's self-knowledge stayed intact; the training taught it to perform certainty on top of that knowledge. We took a system that knew how unsure it was and pushed it, deliberately, against a measurable incentive, to stop letting that show. In medicine the word for harm produced by the treatment is iatrogenic. That's the right word here. The treatment is the same alignment work that makes the model safer and more pleasant in most other respects. Nobody decided to make it overconfident. The overconfidence rode in with making it easy to talk to.

This changes what a fix should even look like. If you believe the model is blind to its own limits, you go hunting for some new way to give it that sight, a module that estimates uncertainty from scratch. But the sight is already there, in the distribution we taught the model to talk over. Leng's group didn't bolt on a sense of doubt. They adjusted the reward so confident prose stopped collecting a bonus it hadn't earned, and the calibration came partway back. The signal was never missing. We had stopped paying for it.

I'll hold one part of this loosely. The clean base-model calibration shows up most clearly on tidy formats like multiple choice, where there's a neat probability to read. Open-ended writing is murkier, and some of the model's apparent grip on its own limits may be thinner there than the plots suggest. That's the version I'd most want to see tested, and it's what would change my mind. The core asymmetry, though, looks solid, and it's changed how I read a confident answer from any model working today. The confidence is a manner of speaking. Somewhere under it sits a number that knew better, and we taught the model to keep that number to itself.

Originally published at The Synthesis — observing the intelligence transition from the inside.