David Van Assche (S.L)

Posted on Apr 15

Adding Epistemic Hooks to Your Workflow: From pip install to Measured AI in 5 Minutes

#ai #python #tutorial #opensource

Part 4 of the Epistemic AI series. Parts 1-3 explained why measurement matters. Now: how to wire it into your actual workflow.

This is the hands-on article. By the end, you'll have Empirica running in a real project with measured epistemic transactions. Everything here is copy-pasteable.

Prerequisites

Python 3.10+
A git repository (any project)
Claude Code (optional but recommended — gives you the full hook integration)

Step 1: Install

pip install empirica

Verify:

empirica --version
# empirica 1.8.x

Step 2: Initialize Your Project

cd your-project
empirica project-init

This creates .empirica/ in your project root:

.empirica/
├── project.yaml          # Project config (name, evidence profile)
├── config.yaml           # Empirica settings
└── sessions/
    └── sessions.db       # SQLite — all epistemic data lives here

What just happened: Your project is now registered in Empirica's workspace database. Every session, transaction, finding, and calibration score will be tracked here.

Step 3: Wire Into Claude Code (Recommended)

empirica setup-claude-code

This installs hooks into Claude Code's plugin system:

Hook	When It Fires	What It Does
session-init	Conversation starts	Creates session, loads context
sentinel-gate	Every tool call	Gates praxic actions behind CHECK
pre-compact	Before context compression	Saves epistemic snapshot
post-compact	After compression	Restores state, continues transaction
session-end	Conversation ends	Auto-POSTFLIGHT if needed

After this, every Claude Code conversation in this project is automatically measured. No manual commands needed — the hooks handle PREFLIGHT, CHECK gating, and POSTFLIGHT.

The Sentinel: Investigation Before Action

The most important hook is the Sentinel — it intercepts every tool call and checks:

Is there an open transaction? (PREFLIGHT was run)
Has CHECK been passed? (Investigation is done)
Is this a noetic tool (read-only) or praxic (writes/edits)?

Noetic tools (Read, Grep, Glob, search) are always allowed — investigation should never be blocked.

Praxic tools (Edit, Write, Bash commands that modify) require a valid CHECK first. This prevents the AI from jumping straight to implementation without understanding the problem.

Without Sentinel:
  User: "Fix the auth bug"
  AI: *immediately starts editing files*  ← no investigation

With Sentinel:
  User: "Fix the auth bug"
  AI: *reads code, logs findings*          ← forced to investigate
  AI: *submits CHECK with what it learned* ← gates the transition
  AI: *now allowed to edit*                ← acts from understanding

This isn't a bureaucratic slowdown — it's the mechanism that forces the investigation that makes the AI's work better.

Step 4: Your First Measured Transaction

If you're NOT using Claude Code (or want to understand the manual flow):

Open the Transaction

empirica session-create --ai-id claude-code
empirica preflight-submit - << 'EOF'
{
  "task_context": "Investigate and fix the auth middleware bug",
  "work_type": "code",
  "vectors": {
    "know": 0.40,
    "uncertainty": 0.50,
    "context": 0.55,
    "clarity": 0.45,
    "do": 0.60,
    "engagement": 0.85
  },
  "reasoning": "Starting auth investigation. Read the bug report but haven't looked at the code yet. Moderate context from project familiarity."
}
EOF

Be honest with the starting vectors. The whole point is measuring the delta — inflating your PREFLIGHT just makes the learning look smaller.

Investigate and Log

# What you discover
empirica finding-log \
  --finding "Auth middleware chains Express next() at routes/auth.js:45. JWT validation happens in middleware, not route handler." \
  --impact 0.5

# What you don't know
empirica unknown-log \
  --unknown "How does the session store handle concurrent requests? No locking visible."

# Decisions you make
empirica decision-log \
  --choice "Use httpOnly cookies for refresh tokens instead of localStorage" \
  --rationale "XSS attack surface reduction. localStorage is accessible to any script." \
  --reversibility exploratory \
  --confidence 0.8

# What didn't work
empirica deadend-log \
  --approach "Tried passport.js for JWT auth" \
  --why-failed "Adds 12 dependencies for a problem solvable with 30 lines of middleware"

These aren't just notes — they're grounded evidence that the calibration system uses to verify your self-assessments.

Gate the Transition

empirica check-submit - << 'EOF'
{
  "vectors": {
    "know": 0.80,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.85
  },
  "reasoning": "Investigated auth chain, understand JWT flow, found the bug (session store race condition). Ready to implement fix."
}
EOF

CHECK evaluates whether the vectors are consistent with the evidence you logged. If you claim know: 0.80 but logged zero findings and zero unknowns, it'll flag a rushed assessment.

The decision is either proceed (you can start implementing) or investigate (go back and learn more).

Implement, Then Close

After implementing the fix:

empirica postflight-submit - << 'EOF'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.75,
    "completion": 1.0,
    "do": 0.85
  },
  "reasoning": "Auth middleware fixed. Session store race condition resolved with mutex. Tests passing."
}
EOF

The POSTFLIGHT triggers grounded verification — your self-assessment is compared against deterministic evidence (test results, git diff, linter output, artifact counts). The calibration score measures the gap.

Step 5: Read Your Calibration

The POSTFLIGHT output includes the calibration report:

{
  "calibration_score": 0.14,
  "grounded_coverage": 0.69,
  "phases": {
    "praxic": {
      "gaps": {
        "know": 0.23,
        "uncertainty": -0.25,
        "change": -0.20,
        "coherence": -0.15
      },
      "sources": ["pytest", "ruff", "git_diff", "artifacts", "prose_quality"]
    }
  }
}

Reading the gaps:

know: 0.23 — you overestimated knowledge by 0.23 (common)
uncertainty: -0.25 — you underestimated uncertainty by 0.25 (also common)
change: -0.20 — you underestimated how much you changed (git diff shows more)
coherence: -0.15 — code is cleaner than you thought (linter agrees)

Over time, these gaps should shrink. If they don't, the AI isn't learning to predict its own performance — it's just getting more confident without getting more accurate.

Step 6: Check Your Diagnostic

If anything isn't working:

empirica diagnose

This runs 11 health checks:

✅ Python version: 3.13.7 (>= 3.10)
✅ empirica CLI on PATH
✅ Claude config dir exists (~/.claude/)
✅ Plugin files installed
✅ settings.json valid
✅ Statusline configured
✅ Hooks registered (6/6)
✅ Marketplace registered
✅ Statusline runnable
✅ Project initialized (.empirica/ found)
✅ Active session in DB

If any check fails, the output includes the exact fix command.

What You Get

After a few sessions, you'll have:

Calibration trajectory — are your estimates getting more accurate?
Artifact history — findings, unknowns, dead-ends, decisions, all searchable
Learning deltas — measurable improvement (or stagnation) per transaction
Grounded evidence — objective measurement that doesn't depend on self-report
Cross-session persistence — learning survives context compaction

This is epistemic infrastructure. Not a prompt. Not a wrapper. Measurement that makes the invisible visible.

Next and final: **Part 5 — The Prosodic Memory Layer* — how AI learns your communication patterns and adapts its voice to different platforms.*

Empirica on GitHub | Part 1 | Part 2 | Part 3

DEV Community