Measuring What Your AI Learned: Epistemic Vectors in Practice

#python #ai #opensource #tutorial

Part 2 of the Epistemic AI series. In Part 1, we defined the problem: AI tools don't track what they know. Here, we make it measurable.

When we talk about "what the AI knows," we're not being metaphorical. Knowledge has structure, and that structure is measurable — not perfectly, but well enough to catch the failures that matter.

The 13 Epistemic Vectors

Empirica tracks 13 dimensions of an AI's knowledge state. Not as a gimmick — each vector maps to a specific class of failure you've seen in practice:

vectors = {
    "know":        0.65,  # Domain understanding
    "uncertainty": 0.35,  # What I DON'T know (explicit!)
    "context":     0.70,  # Surrounding state awareness
    "clarity":     0.80,  # How clear the path forward is
    "coherence":   0.75,  # Internal consistency
    "signal":      0.60,  # Information quality vs noise
    "density":     0.55,  # Relevant knowledge per unit context
    "state":       0.70,  # Current system/project state
    "change":      0.40,  # Amount of change made
    "completion":  0.30,  # Progress toward goal
    "impact":      0.65,  # Significance of work
    "engagement":  0.85,  # How actively working the problem
    "do":          0.70,  # Ability to execute
}

Why 13? Because we kept finding failure modes that weren't captured:

know without uncertainty = overconfident AI
clarity without signal = clear path built on noise
completion without change = claiming done but nothing happened
engagement without do = actively spinning without capability

Each pair creates a tension that prevents gaming. You can't claim high know while uncertainty is also high — the measurement catches the contradiction.

The Transaction Lifecycle

Vectors aren't static. They change as the AI works. The epistemic transaction is the measurement window:

PREFLIGHT → [investigate] → CHECK → [implement] → POSTFLIGHT

PREFLIGHT: Declare Your Baseline

Before starting work, the AI declares what it thinks it knows:

empirica preflight-submit - << 'EOF'
{
  "task_context": "Implement JWT auth middleware",
  "vectors": {
    "know": 0.45,
    "uncertainty": 0.40,
    "context": 0.60,
    "clarity": 0.50
  },
  "reasoning": "Read the route definitions but haven't explored 
    the middleware chain yet. Moderate context from project structure."
}
EOF

This is the starting measurement. It's a prediction: "Here's how well I think I understand this before investigating."

Investigation Phase (Noetic)

The AI reads code, searches patterns, builds understanding. Everything it discovers gets logged:

# What you learned
empirica finding-log --finding "Auth middleware uses Express next() 
  pattern at routes/auth.js:45" --impact 0.5

# What you don't know
empirica unknown-log --unknown "How are user roles differentiated? 
  No role field in JWT payload schema."

# What didn't work
empirica deadend-log --approach "Tried passport.js integration"   --why-failed "Too heavy for JWT-only auth, would add 12 dependencies"

These aren't just notes — they're grounded evidence that the calibration system uses to verify self-assessments.

CHECK: Gate the Transition

empirica check-submit - << 'EOF'
{
  "vectors": {
    "know": 0.82,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.88
  },
  "reasoning": "Investigated middleware chain, understand JWT flow, 
    found role definitions in JWT claims. Ready to implement."
}
EOF

The system evaluates: did the vectors change in a way that's consistent with the evidence logged? If the AI claims know: 0.82 but logged zero findings and zero unknowns, that's a rushed assessment — the gate catches it.

This is the critical insight: you can't skip investigation and go straight to acting. The measurement forces understanding before execution.

POSTFLIGHT: Measure the Learning

After implementation:

empirica postflight-submit - << 'EOF'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.80,
    "completion": 1.0
  },
  "reasoning": "Auth middleware implemented with role guards. 
    Unit tests passing. Learned about Express 5 async changes."
}
EOF

The delta between PREFLIGHT and POSTFLIGHT is the learning:

know:        0.45 → 0.90  (+0.45)  # Learned a lot
uncertainty: 0.40 → 0.08  (-0.32)  # Resolved most unknowns
change:      0.00 → 0.80  (+0.80)  # Made substantial changes
completion:  0.00 → 1.00  (+1.00)  # Goal met

This delta IS the measurement. Over time, you can see:

Does the AI consistently overestimate its starting knowledge?
Does it underestimate uncertainty?
Do its estimates get more accurate across sessions?

Grounded Verification: The Part That Keeps It Honest

Self-assessment alone is self-serving. The grounded verification layer compares the AI's claims against deterministic evidence:

# AI claims: know=0.90, change=0.80
# Grounded evidence:
evidence = {
    "test_results": {"passed": 42, "failed": 3},     # 3 failures!
    "ruff_violations": 2,                              # lint issues
    "git_diff_lines": 156,                            # real change metric
    "findings_logged": 5,                              # investigation breadth
    "unknowns_resolved": 3,                            # learning evidence
}

# Grounded calibration:
# - test failures → know is probably ~0.75, not 0.90
# - git diff confirms change=0.80 is reasonable
# - 5 findings + 3 resolved unknowns → investigation was real

The calibration score measures the distance between self-assessment and grounded evidence. A score of 0.0 means perfect calibration. In practice, we see scores of 0.10-0.30 — the AI is usually overconfident, and the grounded layer catches it.

What This Looks Like in Practice

Here's a real POSTFLIGHT from an Empirica session (editing for clarity):

Calibration score: 0.134
Grounded coverage: 69.2%

Gaps:
  know:        overestimate by 0.33  (claimed 0.82, evidence shows 0.49)
  uncertainty: underestimate by 0.13 (claimed 0.15, evidence shows 0.28)
  coherence:   underestimate by 0.20 (claimed 0.75, evidence shows 0.95)

Sources: artifacts, codebase_model, prose_quality, 
         document_metrics, source_quality, action_verification
Sources failed: []  (all evidence collectors healthy)

The AI was overestimating its knowledge and underestimating its uncertainty — the most common pattern. But now we can see it, which means we can correct for it in the next transaction.

Try It

pip install empirica
cd your-project
empirica project-init
empirica setup-claude-code

# Start a measured session:
empirica session-create --ai-id claude-code
# → Opens transaction, gates investigation before action

The framework is open source, the measurement is real, and the calibration improves over time. Not because the model gets better — because the measurement infrastructure makes overconfidence visible.

Next in the series: **Part 3 — Grounded Calibration vs Self-Assessment* — why the AI's self-report is structurally unreliable and how deterministic evidence changes the game.*

Empirica on GitHub | Part 1: Your AI Doesn't Know What It Doesn't Know