DEV Community

Cover image for Grounded Calibration vs Self-Assessment: Why Your AI's Confidence Score Is Lying
David Van Assche (S.L)
David Van Assche (S.L)

Posted on

Grounded Calibration vs Self-Assessment: Why Your AI's Confidence Score Is Lying

Part 3 of the Epistemic AI series. Parts 1 and 2 introduced the epistemic gap and how to measure it. Now: why the AI's self-report can't be trusted — and what to do about it.

Your AI tells you it's 85% confident. But what does that number actually mean? Nobody checked. There's no ground truth. It's a student grading their own exam, and the grade is always suspiciously high.

This is the calibration problem, and it's more insidious than it sounds.

Why Self-Assessment Is Structurally Unreliable

When an AI agent reports its epistemic vectors (know = 0.85, uncertainty = 0.10), it's making a prediction about its own internal state. This prediction is corrupted by at least three systematic biases:

1. The Completion Bias

LLMs are trained to produce helpful, confident responses. When asked "how well do you understand this?", the model gravitates toward the answer that sounds most competent. This isn't deception — it's the same optimization pressure that makes models agree with user corrections even when the user is wrong.

# What the AI reports:
know: 0.85  "I understand the codebase well"

# What the evidence shows:
- 3 test failures in the module it just edited
- 2 linter violations it didn't catch
- Referenced a function that was renamed 3 commits ago
Enter fullscreen mode Exit fullscreen mode

The gap between 0.85 and the evidence isn't malice. It's structural overconfidence baked into the training objective.

2. The Anchoring Effect

Once the AI declares a PREFLIGHT vector (say, know = 0.60), it anchors to that starting point. The POSTFLIGHT assessment tends to show "improvement" regardless of what actually happened:

PREFLIGHT:  know = 0.60  (declared at session start)
POSTFLIGHT: know = 0.85  (looks like learning!)

But did it actually learn?
Or did it just decide enough time had passed?
Enter fullscreen mode Exit fullscreen mode

Without external verification, you can't distinguish genuine learning from narrative completion — the AI telling a story about getting smarter because that's the expected arc.

3. The Unknown Unknowns

The most dangerous blind spot: the AI can't report uncertainty about things it doesn't know it doesn't know. If it never investigated the session store's concurrency model, it won't report low confidence on session handling — because it doesn't know there's something to be uncertain about.

AI: "I'm confident about the auth implementation" (know = 0.85)
Reality: auth works, but the session store race condition
         it didn't investigate will break under load.
         The AI doesn't report uncertainty because
         it never discovered the problem exists.
Enter fullscreen mode Exit fullscreen mode

Grounded Verification: The Fix

The solution isn't better prompting or asking the AI to "be more honest." The solution is deterministic evidence — measurements that don't come from the AI's self-report.

What "Grounded" Means

Grounded evidence comes from services that produce facts, not opinions:

Evidence Source What It Measures Maps To
pytest results Tests passing/failing know, do, change
ruff/pylint Code quality violations coherence, signal
radon Cyclomatic complexity density, clarity
git diff Lines actually changed change, state
pyright Type safety coherence
Finding count Investigation breadth know, signal
Unknown resolution rate Learning evidence do, completion
textstat Prose readability clarity, density

These sources don't lie. They don't have completion bias. They don't anchor to previous assessments.

The Calibration Score

Empirica computes a calibration score by comparing the AI's self-assessment against grounded evidence:

Self-assessed:  know = 0.85, uncertainty = 0.10
Grounded:       know = 0.62, uncertainty = 0.35

Calibration gaps:
  know:        overestimate by 0.23
  uncertainty: underestimate by 0.25
  coherence:   underestimate by 0.20 (tests show code is cleaner than claimed)
  change:      underestimate by 0.40 (git shows more change than reported)

Calibration score: 0.14 (0.0 = perfect, 1.0 = completely uncalibrated)
Grounded coverage: 69% (evidence covers 69% of claimed vectors)
Enter fullscreen mode Exit fullscreen mode

This is real output from an actual Empirica session. The AI was overestimating its knowledge by 0.23 and underestimating its uncertainty by 0.25 — the most common pattern we see.

Coverage Matters

Not all vectors can be grounded. If the AI is doing research (no code written), there's no pytest or git diff to verify against. Empirica tracks grounded coverage — what percentage of the self-assessment has deterministic evidence behind it.

# When coverage < 30%, calibration is declared insufficient
if grounded_coverage < 0.3:
    calibration_status = "insufficient_evidence"
    # Self-assessment stands — but honestly flagged as unverified
Enter fullscreen mode Exit fullscreen mode

This is more honest than producing a phantom calibration score from sparse data. When we don't have enough evidence, we say so — and the self-assessment stands unchallenged rather than being falsely "verified."

What Happens Over Time

The calibration gap should shrink across transactions. If the AI consistently overestimates know by 0.23, the system provides feedback:

Previous transaction feedback:
  overestimate_tendency: [know, context]
  underestimate_tendency: [uncertainty, coherence, change]

  Note: "Be more cautious with know estimates,
         less cautious with uncertainty estimates."
Enter fullscreen mode Exit fullscreen mode

This feedback is injected into the next PREFLIGHT. Over time, the AI's self-assessments become more accurate — not because the model changed, but because the measurement infrastructure makes overconfidence visible and costly.

The Sycophancy Connection

Calibration and sycophancy are the same problem viewed from different angles:

  • Sycophancy: AI agrees with the user to avoid conflict
  • Overconfidence: AI agrees with itself about its own competence

Both come from the same training pressure: produce the response that seems most helpful and aligned. Grounded verification breaks both patterns by introducing an external reference point that neither the AI nor the user controls.

When the AI says "know = 0.85" and the evidence says "know = 0.62", there's no way to talk your way out of it. The tests failed. The linter found issues. The gap is measured.

Try It

pip install empirica
cd your-project && empirica project-init

# After a work session, check calibration:
empirica postflight-submit - << 'EOF'
{
  "vectors": {"know": 0.85, "uncertainty": 0.10, "change": 0.70},
  "reasoning": "Implemented auth middleware, tests passing"
}
EOF

# The POSTFLIGHT output shows:
#   calibration_score: 0.14
#   grounded_coverage: 69%
#   gaps: know overestimate by 0.23, uncertainty underestimate by 0.25
#   sources: pytest, ruff, git_diff, artifacts, prose_quality
Enter fullscreen mode Exit fullscreen mode

The calibration loop runs automatically on every POSTFLIGHT. No extra commands needed — just work normally and measure honestly.


Next: **Part 4 — Adding Epistemic Hooks to Your Workflow* — the step-by-step integration tutorial. From pip install to your first measured transaction in 5 minutes.*

Empirica on GitHub | Part 1 | Part 2

Top comments (0)