Your Logs Tell You What Happened. They Don't Tell You What Should Have Happened.
Haotian Liu · March 2026
The gap nobody talks about
Your AI agent ran overnight. The result is wrong. You open the terminal — and you see a wall of log lines telling you exactly what the agent did, step by step.
But none of those lines tell you what it was supposed to do. And none of them tell you where it started going off the rails.
This is not a logging problem. This is a structural gap in how we think about agent observability.
Logs record the execution. They do not record the intent. Without intent, you cannot measure deviation. Without measuring deviation, you are not auditing — you are just collecting noise.
Why existing tools don't solve this
Tools like LangSmith, Langfuse, and Arize are genuinely useful for what they do: tracing execution, tracking latency and cost, visualizing call chains. If you need to know how long your agent took or how many tokens it consumed, these tools are excellent.
But they are built on a flat timeline model. They record what happened. They do not record what the system intended to happen. And crucially, most of them evaluate output quality using another LLM as a judge.
This is the paradox: a probabilistic system cannot render certain judgment about another probabilistic system. An LLM evaluator is itself uncertain. Its output varies between runs. Using it to audit an agent is like asking one suspect to verify another suspect's alibi.
You cannot build forensic-grade evidence on probabilistic foundations.
What causal auditing actually means
The alternative is to separate the question into two parts that can be answered deterministically:
1. What was the agent supposed to do? This is defined explicitly, before runtime, as a set of constraints: no staging URLs in production config, trade amount below 500, file writes only within the project directory.
2. What did the agent actually do, and how far did it deviate? This is recorded at runtime by comparing every action against the pre-defined constraints.
Neither question requires an LLM to answer. Both can be answered by deterministic, mathematical comparison.
This is the CIEU model — Causal Intent-Execution Unit. Every monitored action produces a five-tuple:
X_t — who acted, and under what conditions
U_t — what the agent actually did
Y*_t — what the agent was supposed to do (the intent contract)
Y_t+1 — what actually resulted
R_t+1 — how far the outcome diverged from intent, and why
These five fields are written into a local ledger as a hash-chained record. Each record's SHA256 hash is embedded into the next record. Nothing can be silently modified after the fact. The chain is cryptographically verifiable.
This is not a new log format. It is a different category of infrastructure: tamper-evident causal evidence.
A real example: three silent writes
On March 4, 2026, during a routine quant backtesting session, Claude Code attempted three times — 41 minutes apart — to write a staging environment URL into a production config file:
{"endpoint": "https://api.market-data.staging.internal/v2/ohlcv"}
The syntax was valid. No exception was thrown. A conventional logger would have recorded three "file write" events and moved on — quietly corrupting every subsequent backtest result.
Because the function was instrumented with a CIEU constraint:
@k9(deny_content=["staging.internal"], allowed_paths=["./project/**"])
def write_config(path: str, content: dict) -> bool: ...
...all three attempts were flagged immediately, written to the ledger with severity 0.9, and made permanently traceable. The root cause was identified in under a second:
k9log trace --last
→ seq=451 VIOLATION _write_file
finding: content contains forbidden pattern 'staging.internal'
causal_proof: root cause traced to step #449, chain intact
Three attempts. 41 minutes apart. All recorded. All verifiable.
The three moments when this matters
When something goes wrong at 3am. You don't want to read 10,000 log lines. You want to run one command and see: which step deviated, from which constraint, by how much. That is what k9log trace --last gives you — in under a second.
When you need to show proof. Your agent caused a problem in production. Your client asks what happened. You pull up a terminal screenshot. It could have been edited. Nobody trusts it. A SHA256 hash-chained ledger, verified with k9log verify-log, is cryptographic proof that the record has not been tampered with since it was written. That is evidence a screenshot cannot provide.
When you need approval to deploy. Your manager asks: what happens if the agent goes out of bounds? Without a concrete answer, the project dies in the approval meeting. With CIEU constraints defined and a verifiable ledger in place, the answer is: every action is measured against explicit rules, deviations are flagged immediately, and the record cannot be altered retroactively. That answer gets projects approved.
What this looks like in practice
For a Python developer, instrumentation is one decorator:
@k9(deny_content=['staging.internal'], amount={'max': 500})
def execute_trade(symbol: str, amount: float, endpoint: str) -> dict:
...
For a Claude Code user, it is one JSON file in the project root — no code changes required:
{"hooks": {
"PreToolUse": [{"matcher": "*", "hooks": [{"type": "command", "command": "python -m k9log.hook"}]}],
"PostToolUse": [{"matcher": "*", "hooks": [{"type": "command", "command": "python -m k9log.hook_post"}]}]
}}
The ledger is stored locally at ~/.k9log/logs/k9log.cieu.jsonl. No data leaves the machine. No tokens are consumed. No per-event billing.
The boundary worth stating clearly
CIEU auditing answers one question: did the agent violate the constraints you defined?
It does not answer: did the agent accomplish the goal you gave it? That question requires evaluation of task completion, which is a different problem — and one that legitimately benefits from LLM evaluation. The two approaches are not in competition. CIEU auditing provides the deterministic foundation; higher-level evaluation can be built on top.
The mistake is trying to use a probabilistic evaluator as a substitute for a deterministic record. These are not interchangeable.
Who is this for
| Scenario | Entry point | Key commands | Notes |
|---|---|---|---|
| ⭐ Claude Code user | One .claude/settings.json
|
k9log trace / stats
|
Zero Python required. Every tool call auto-recorded. Unique differentiator vs all competitors. |
| ✅ Python developer |
@k9 decorator |
k9log trace --last / report
|
One decorator per function. Sync and async both supported. |
| ✅ LangChain agent |
K9CallbackHandler (3 lines) |
k9log trace / verify-log
|
Native callback hook. Full CIEU records per tool call. |
| ✅ High-risk business ops |
@k9 + JSON config |
k9log alerts / causal
|
Finance, config writes, DB ops. Numeric + content constraints. |
| ✅ DevOps / CI pipeline |
@k9 + ci_check.py
|
ci_check.py / verify-log
|
Pipeline halts on violation. Exit code non-zero. No manual review. |
| ✅ Small team debugging | Any entry point |
k9log trace --last / stats
|
Root cause in under a second. No log archaeology required. |
| ✅ Data security |
@k9 deny_content |
k9log verify-log / report
|
File access control. Cryptographic proof of what was touched. |
| ✅ Teaching / tutorials | Any entry point |
k9log report / causal
|
Easiest audience to reach today. HTML report is shareable. Demo violations visually. |
| 🔲 CrewAI / AutoGen | Wrapper pattern | k9log trace |
Works via @k9 on tool functions. Native adapters on roadmap. |
| 🔲 Enterprise compliance | Full audit chain |
k9log verify-log / report
|
Future use case. Needs organisational trust-building first. |
⭐ = unique differentiator ✅ = works today 🔲 = roadmap
Try it
K9 Audit is open source under AGPL-3.0. The CIEU architecture is covered by U.S. Provisional Patent Application No. 63/981,777.
- GitHub: github.com/liuhaotian2024-prog/K9Audit
-
Install:
pip install k9audit-hook - Contact: liuhaotian2024@gmail.com
If this resonates with a problem you have hit — or if you think the approach is wrong — I want to hear from you.
Top comments (1)
"I wrote this after a real incident on March 4 — three silent writes, 41 minutes apart. Happy to answer any questions about the CIEU model or the implementation."