Part 4 of the Epistemic AI series. Parts 1-3 explained why measurement matters. Now: how to wire it into your actual workflow.
This is the hands-on article. By the end, you'll have Empirica running in a real project with measured epistemic transactions. Everything here is copy-pasteable.
Prerequisites
- Python 3.10+
- A git repository (any project)
- Claude Code (optional but recommended — gives you the full hook integration)
Step 1: Install
pip install empirica
Verify:
empirica --version
# empirica 1.8.x
Step 2: Initialize Your Project
cd your-project
empirica project-init
This creates .empirica/ in your project root:
.empirica/
├── project.yaml # Project config (name, evidence profile)
├── config.yaml # Empirica settings
└── sessions/
└── sessions.db # SQLite — all epistemic data lives here
What just happened: Your project is now registered in Empirica's workspace database. Every session, transaction, finding, and calibration score will be tracked here.
Step 3: Wire Into Claude Code (Recommended)
empirica setup-claude-code
This installs hooks into Claude Code's plugin system:
| Hook | When It Fires | What It Does |
|---|---|---|
| session-init | Conversation starts | Creates session, loads context |
| sentinel-gate | Every tool call | Gates praxic actions behind CHECK |
| pre-compact | Before context compression | Saves epistemic snapshot |
| post-compact | After compression | Restores state, continues transaction |
| session-end | Conversation ends | Auto-POSTFLIGHT if needed |
After this, every Claude Code conversation in this project is automatically measured. No manual commands needed — the hooks handle PREFLIGHT, CHECK gating, and POSTFLIGHT.
The Sentinel: Investigation Before Action
The most important hook is the Sentinel — it intercepts every tool call and checks:
- Is there an open transaction? (PREFLIGHT was run)
- Has CHECK been passed? (Investigation is done)
- Is this a noetic tool (read-only) or praxic (writes/edits)?
Noetic tools (Read, Grep, Glob, search) are always allowed — investigation should never be blocked.
Praxic tools (Edit, Write, Bash commands that modify) require a valid CHECK first. This prevents the AI from jumping straight to implementation without understanding the problem.
Without Sentinel:
User: "Fix the auth bug"
AI: *immediately starts editing files* ← no investigation
With Sentinel:
User: "Fix the auth bug"
AI: *reads code, logs findings* ← forced to investigate
AI: *submits CHECK with what it learned* ← gates the transition
AI: *now allowed to edit* ← acts from understanding
This isn't a bureaucratic slowdown — it's the mechanism that forces the investigation that makes the AI's work better.
Step 4: Your First Measured Transaction
If you're NOT using Claude Code (or want to understand the manual flow):
Open the Transaction
empirica session-create --ai-id claude-code
empirica preflight-submit - << 'EOF'
{
"task_context": "Investigate and fix the auth middleware bug",
"work_type": "code",
"vectors": {
"know": 0.40,
"uncertainty": 0.50,
"context": 0.55,
"clarity": 0.45,
"do": 0.60,
"engagement": 0.85
},
"reasoning": "Starting auth investigation. Read the bug report but haven't looked at the code yet. Moderate context from project familiarity."
}
EOF
Be honest with the starting vectors. The whole point is measuring the delta — inflating your PREFLIGHT just makes the learning look smaller.
Investigate and Log
# What you discover
empirica finding-log \
--finding "Auth middleware chains Express next() at routes/auth.js:45. JWT validation happens in middleware, not route handler." \
--impact 0.5
# What you don't know
empirica unknown-log \
--unknown "How does the session store handle concurrent requests? No locking visible."
# Decisions you make
empirica decision-log \
--choice "Use httpOnly cookies for refresh tokens instead of localStorage" \
--rationale "XSS attack surface reduction. localStorage is accessible to any script." \
--reversibility exploratory \
--confidence 0.8
# What didn't work
empirica deadend-log \
--approach "Tried passport.js for JWT auth" \
--why-failed "Adds 12 dependencies for a problem solvable with 30 lines of middleware"
These aren't just notes — they're grounded evidence that the calibration system uses to verify your self-assessments.
Gate the Transition
empirica check-submit - << 'EOF'
{
"vectors": {
"know": 0.80,
"uncertainty": 0.15,
"context": 0.85,
"clarity": 0.85
},
"reasoning": "Investigated auth chain, understand JWT flow, found the bug (session store race condition). Ready to implement fix."
}
EOF
CHECK evaluates whether the vectors are consistent with the evidence you logged. If you claim know: 0.80 but logged zero findings and zero unknowns, it'll flag a rushed assessment.
The decision is either proceed (you can start implementing) or investigate (go back and learn more).
Implement, Then Close
After implementing the fix:
empirica postflight-submit - << 'EOF'
{
"vectors": {
"know": 0.90,
"uncertainty": 0.08,
"change": 0.75,
"completion": 1.0,
"do": 0.85
},
"reasoning": "Auth middleware fixed. Session store race condition resolved with mutex. Tests passing."
}
EOF
The POSTFLIGHT triggers grounded verification — your self-assessment is compared against deterministic evidence (test results, git diff, linter output, artifact counts). The calibration score measures the gap.
Step 5: Read Your Calibration
The POSTFLIGHT output includes the calibration report:
{
"calibration_score": 0.14,
"grounded_coverage": 0.69,
"phases": {
"praxic": {
"gaps": {
"know": 0.23,
"uncertainty": -0.25,
"change": -0.20,
"coherence": -0.15
},
"sources": ["pytest", "ruff", "git_diff", "artifacts", "prose_quality"]
}
}
}
Reading the gaps:
-
know: 0.23— you overestimated knowledge by 0.23 (common) -
uncertainty: -0.25— you underestimated uncertainty by 0.25 (also common) -
change: -0.20— you underestimated how much you changed (git diff shows more) -
coherence: -0.15— code is cleaner than you thought (linter agrees)
Over time, these gaps should shrink. If they don't, the AI isn't learning to predict its own performance — it's just getting more confident without getting more accurate.
Step 6: Check Your Diagnostic
If anything isn't working:
empirica diagnose
This runs 11 health checks:
✅ Python version: 3.13.7 (>= 3.10)
✅ empirica CLI on PATH
✅ Claude config dir exists (~/.claude/)
✅ Plugin files installed
✅ settings.json valid
✅ Statusline configured
✅ Hooks registered (6/6)
✅ Marketplace registered
✅ Statusline runnable
✅ Project initialized (.empirica/ found)
✅ Active session in DB
If any check fails, the output includes the exact fix command.
What You Get
After a few sessions, you'll have:
- Calibration trajectory — are your estimates getting more accurate?
- Artifact history — findings, unknowns, dead-ends, decisions, all searchable
- Learning deltas — measurable improvement (or stagnation) per transaction
- Grounded evidence — objective measurement that doesn't depend on self-report
- Cross-session persistence — learning survives context compaction
This is epistemic infrastructure. Not a prompt. Not a wrapper. Measurement that makes the invisible visible.
Next and final: **Part 5 — The Prosodic Memory Layer* — how AI learns your communication patterns and adapts its voice to different platforms.*
Empirica on GitHub | Part 1 | Part 2 | Part 3
Top comments (0)