Grounded Calibration vs Self-Assessment: Why Your AI's Confidence Score Is Lying

#ai #machinelearning #opensource #architecture

Part 3 of the Epistemic AI series. Parts 1 and 2 introduced the epistemic gap and how to measure it. Now: why the AI's self-report can't be trusted — and what to do about it.

Your AI tells you it's 85% confident. But what does that number actually mean? Nobody checked. There's no ground truth. It's a student grading their own exam, and the grade is always suspiciously high.

This is the calibration problem, and it's more insidious than it sounds.

Why Self-Assessment Is Structurally Unreliable

When an AI agent reports its epistemic vectors (know = 0.85, uncertainty = 0.10), it's making a prediction about its own internal state. This prediction is corrupted by at least three systematic biases:

1. The Completion Bias

LLMs are trained to produce helpful, confident responses. When asked "how well do you understand this?", the model gravitates toward the answer that sounds most competent. This isn't deception — it's the same optimization pressure that makes models agree with user corrections even when the user is wrong.

# What the AI reports:
know: 0.85  "I understand the codebase well"

# What the evidence shows:
- 3 test failures in the module it just edited
- 2 linter violations it didn't catch
- Referenced a function that was renamed 3 commits ago

The gap between 0.85 and the evidence isn't malice. It's structural overconfidence baked into the training objective.

2. The Anchoring Effect

Once the AI declares a PREFLIGHT vector (say, know = 0.60), it anchors to that starting point. The POSTFLIGHT assessment tends to show "improvement" regardless of what actually happened:

PREFLIGHT:  know = 0.60  (declared at session start)
POSTFLIGHT: know = 0.85  (looks like learning!)

But did it actually learn?
Or did it just decide enough time had passed?

Without external verification, you can't distinguish genuine learning from narrative completion — the AI telling a story about getting smarter because that's the expected arc.

3. The Unknown Unknowns

The most dangerous blind spot: the AI can't report uncertainty about things it doesn't know it doesn't know. If it never investigated the session store's concurrency model, it won't report low confidence on session handling — because it doesn't know there's something to be uncertain about.

AI: "I'm confident about the auth implementation" (know = 0.85)
Reality: auth works, but the session store race condition
         it didn't investigate will break under load.
         The AI doesn't report uncertainty because
         it never discovered the problem exists.

Grounded Verification: The Fix

The solution isn't better prompting or asking the AI to "be more honest." The solution is deterministic evidence — measurements that don't come from the AI's self-report.

What "Grounded" Means

Grounded evidence comes from services that produce facts, not opinions:

Evidence Source	What It Measures	Maps To
pytest results	Tests passing/failing	know, do, change
ruff/pylint	Code quality violations	coherence, signal
radon	Cyclomatic complexity	density, clarity
git diff	Lines actually changed	change, state
pyright	Type safety	coherence
Finding count	Investigation breadth	know, signal
Unknown resolution rate	Learning evidence	do, completion
textstat	Prose readability	clarity, density

These sources don't lie. They don't have completion bias. They don't anchor to previous assessments.

The Calibration Score

Empirica computes a calibration score by comparing the AI's self-assessment against grounded evidence:

Self-assessed:  know = 0.85, uncertainty = 0.10
Grounded:       know = 0.62, uncertainty = 0.35

Calibration gaps:
  know:        overestimate by 0.23
  uncertainty: underestimate by 0.25
  coherence:   underestimate by 0.20 (tests show code is cleaner than claimed)
  change:      underestimate by 0.40 (git shows more change than reported)

Calibration score: 0.14 (0.0 = perfect, 1.0 = completely uncalibrated)
Grounded coverage: 69% (evidence covers 69% of claimed vectors)

This is real output from an actual Empirica session. The AI was overestimating its knowledge by 0.23 and underestimating its uncertainty by 0.25 — the most common pattern we see.

Coverage Matters

Not all vectors can be grounded. If the AI is doing research (no code written), there's no pytest or git diff to verify against. Empirica tracks grounded coverage — what percentage of the self-assessment has deterministic evidence behind it.

# When coverage < 30%, calibration is declared insufficient
if grounded_coverage < 0.3:
    calibration_status = "insufficient_evidence"
    # Self-assessment stands — but honestly flagged as unverified

This is more honest than producing a phantom calibration score from sparse data. When we don't have enough evidence, we say so — and the self-assessment stands unchallenged rather than being falsely "verified."

What Happens Over Time

The calibration gap should shrink across transactions. If the AI consistently overestimates know by 0.23, the system provides feedback:

Previous transaction feedback:
  overestimate_tendency: [know, context]
  underestimate_tendency: [uncertainty, coherence, change]

  Note: "Be more cautious with know estimates,
         less cautious with uncertainty estimates."

This feedback is injected into the next PREFLIGHT. Over time, the AI's self-assessments become more accurate — not because the model changed, but because the measurement infrastructure makes overconfidence visible and costly.

The Sycophancy Connection

Calibration and sycophancy are the same problem viewed from different angles:

Sycophancy: AI agrees with the user to avoid conflict
Overconfidence: AI agrees with itself about its own competence

Both come from the same training pressure: produce the response that seems most helpful and aligned. Grounded verification breaks both patterns by introducing an external reference point that neither the AI nor the user controls.

When the AI says "know = 0.85" and the evidence says "know = 0.62", there's no way to talk your way out of it. The tests failed. The linter found issues. The gap is measured.

Try It

pip install empirica
cd your-project && empirica project-init

# After a work session, check calibration:
empirica postflight-submit - << 'EOF'
{
  "vectors": {"know": 0.85, "uncertainty": 0.10, "change": 0.70},
  "reasoning": "Implemented auth middleware, tests passing"
}
EOF

# The POSTFLIGHT output shows:
#   calibration_score: 0.14
#   grounded_coverage: 69%
#   gaps: know overestimate by 0.23, uncertainty underestimate by 0.25
#   sources: pytest, ruff, git_diff, artifacts, prose_quality

The calibration loop runs automatically on every POSTFLIGHT. No extra commands needed — just work normally and measure honestly.

Next: **Part 4 — Adding Epistemic Hooks to Your Workflow* — the step-by-step integration tutorial. From pip install to your first measured transaction in 5 minutes.*

Empirica on GitHub | Part 1 | Part 2