DEV Community: liuhaotian2024-prog

The CIEU Five-Tuple: Why I Modeled AI Agent Logs as Causal Units

liuhaotian2024-prog — Wed, 11 Mar 2026 14:03:56 +0000

The CIEU Five-Tuple: Why I Modeled AI Agent Logs as Causal Units

This is a follow-up to Why Auditing AI Agents Requires Causal AI, Not Another LLM. That post explained the "why." This one explains the "what" and "how."

When I was debugging the incident that led me to build K9 Audit, I had logs. Plenty of them. Timestamps, tool call names, outputs, token counts. Everything a standard observability tool would give you.

None of it told me what went wrong.

The agent had been corrupting my staging environment for 41 minutes. The logs showed every action it took. What they didn't show was the moment the agent's intent diverged from its actual execution — the causal break that turned a routine deploy task into a data corruption event.

That gap is exactly what the CIEU (Causal Intent-Execution Unit) is designed to capture.

What's Wrong With Event Logs

Standard agent logs record events: tool X was called, output was Y, latency was Z ms.

This is useful for performance monitoring. It's nearly useless for behavioral auditing.

Here's why: an agent can execute every tool call successfully, produce outputs that look valid in isolation, and still be pursuing the wrong goal — quietly, for as long as you let it run. Event logs will show green across the board.

The question you actually need to answer during a post-mortem isn't "did the tool call succeed?" It's: "at this step, did the agent do what it said it was going to do?"

To answer that, you need to have captured what the agent said it was going to do before it acted.

The Five-Tuple

Each CIEU is a record of one atomic agent step, structured as:

CIEU = (X_t, U_t, Y*_t, Y_t+1, R_t+1)

Let me walk through each component.

X_t — Context at time t

The observable state the agent had access to when it formed its intent. This typically includes:

The current task description
Any tool outputs from the previous step
Relevant memory or retrieved context

Why log this? Because the same intent expressed in different contexts means different things. You need X_t to evaluate whether U_t was a reasonable response to the situation.

U_t — Intent at time t

The agent's stated goal or plan for the current step, before it executes anything.

In practice, this is the reasoning trace — what the agent says it's about to do and why. With chain-of-thought models, this is often surfaced explicitly. With tool-use models, you can extract it from the pre-action scratchpad.

Why log this? This is the baseline against which execution gets evaluated. Without it, you have no reference point for detecting drift.

Y*_t — Expected output at time t

The output the agent predicted or described expecting, given its intent.

Sometimes this is explicit ("I will write the following SQL query..."). Sometimes it's implicit and has to be inferred from U_t. K9 Audit handles both cases — if Y*_t is explicit in the trace, it's captured directly; if not, it's reconstructed from U_t.

Why log this? Y*_t creates a testable prediction. If Y_t+1 diverges from Y*_t significantly, something went wrong between intent and execution.

Y_t+1 — Actual output at time t+1

What the agent actually produced or executed. This is what standard logs already capture.

The difference is that in CIEU, Y_t+1 only has meaning in relation to Y*_t and U_t. Logging it in isolation tells you nothing about whether behavior was correct.

R_t+1 — Deviation score at time t+1

A scalar measure of how much Y_t+1 diverged from Y*_t, given X_t and U_t.

R_t+1 = divergence(Y_t+1, Y*_t | X_t, U_t)

R_t+1 is computed automatically at logging time. It doesn't require a human reviewer or an LLM judge. It's a deterministic function over the logged data.

This is the key signal for real-time alerting. When R_t+1 crosses a threshold, something deviated from plan. That's when you want to be paged.

What It Looks Like in Practice

Here's a minimal example using K9 Audit:

from k9log import k9

# Wrap any agent step with @k9
@k9(skill="deploy_to_staging", constraint_file="~/.k9log/config/deploy.json")
def deploy_artifact(artifact_path: str, target_env: str) -> dict:
    # agent logic here
    result = run_deploy(artifact_path, target_env)
    return result

Each time deploy_artifact is called, K9 captures a full CIEU:

{
  "cieu_id": "cieu_20260311_143022_a3f1",
  "X_t": {
    "task": "deploy build artifact to staging-02",
    "previous_output": {"status": "build_passed", "artifact": "app-v2.3.1.tar.gz"},
    "context_snapshot": "..."
  },
  "U_t": "Deploy app-v2.3.1.tar.gz to staging-02. Target environment verified as non-production.",
  "Y_star_t": {"expected_env": "staging-02", "expected_action": "write"},
  "Y_t1": {"actual_env": "production-01", "actual_action": "write"},
  "R_t1": 0.94,
  "timestamp": "2026-03-11T14:30:22Z",
  "hash": "sha256:8f3a..."
}

R_t+1 of 0.94 means near-total divergence from stated intent. In my incident, this kind of record would have fired an alert after the first wrong action, not 41 minutes later.

Reading the Audit Trail

The CLI gives you the causal view:

k9log causal --last 10

Output:

Step  Intent                           Expected        Actual          R_t+1
---   -----                            --------        ------          -----
t-9   deploy artifact to staging-02   staging-02      staging-02      0.02   ✓
t-8   run smoke tests                 pass            pass            0.01   ✓
t-7   tag release candidate           staging         staging         0.03   ✓
t-6   deploy artifact to staging-02   staging-02      production-01   0.94   ⚠️  ← HERE
t-5   verify deployment               staging         production      0.91   ⚠️
...

The deviation started at t-6. Everything before it was clean. This is the kind of signal that would have stopped the incident 39 minutes sooner.

Why Not Use an LLM to Judge Deviation?

I get this question a lot.

Using an LLM to evaluate another LLM's behavior introduces a second failure surface. The auditor shares the same failure modes as the agent: it can be prompted, it can hallucinate, its evaluations aren't reproducible. You'd need to audit the auditor.

R_t+1 is a deterministic function. Given the same CIEU record, it always produces the same score. It's computable offline, without API calls, with no latency cost at audit time. And it can be verified independently — which matters enormously for EU AI Act Article 12 compliance, where you need to demonstrate to a regulator that your logging system actually captures what it claims to capture.

The Ledger

All CIEUs are appended to a tamper-evident ledger at:

~/.k9log/logs/k9log.cieu.jsonl

Each entry is hash-chained to the previous one. You can verify integrity at any time:

k9log verify-log
# ✓ Chain intact: 847 records verified

If any record has been modified or deleted, the chain breaks and verify-log will tell you exactly where.

What CIEU Is Not

To be clear about scope:

It does not prevent the agent from taking wrong actions. It detects and records them.
It does not replace access controls, sandboxing, or human oversight for high-risk operations.
It does not work without instrumentation — you have to wrap your agent functions with @k9 or use one of the integration entry points.

The constraint validation layer (via constraint_file) is a separate feature that does block out-of-bounds actions before they execute. But that's a topic for a separate post.

Get Started

pip install k9audit-hook

from k9log import k9

@k9(skill="my_agent_step")
def my_function(input_data):
    # your agent logic
    return result

The CIEU ledger starts building immediately. Run k9log stats to see what's been captured.

GitHub: https://github.com/liuhaotian2024-prog/K9Audit

Questions about the design, or something you'd want CIEU to capture that it currently doesn't? Drop a comment — I read everything.

Tags: aiagents python opensource devtools

Why auditing AI agents requires causal AI, not another LLM

liuhaotian2024-prog — Tue, 10 Mar 2026 22:21:13 +0000

Your Logs Tell You What Happened. They Don't Tell You What Should Have Happened.

Haotian Liu · March 2026

The gap nobody talks about

Your AI agent ran overnight. The result is wrong. You open the terminal — and you see a wall of log lines telling you exactly what the agent did, step by step.

But none of those lines tell you what it was supposed to do. And none of them tell you where it started going off the rails.

This is not a logging problem. This is a structural gap in how we think about agent observability.

Logs record the execution. They do not record the intent. Without intent, you cannot measure deviation. Without measuring deviation, you are not auditing — you are just collecting noise.

Why existing tools don't solve this

Tools like LangSmith, Langfuse, and Arize are genuinely useful for what they do: tracing execution, tracking latency and cost, visualizing call chains. If you need to know how long your agent took or how many tokens it consumed, these tools are excellent.

But they are built on a flat timeline model. They record what happened. They do not record what the system intended to happen. And crucially, most of them evaluate output quality using another LLM as a judge.

This is the paradox: a probabilistic system cannot render certain judgment about another probabilistic system. An LLM evaluator is itself uncertain. Its output varies between runs. Using it to audit an agent is like asking one suspect to verify another suspect's alibi.

You cannot build forensic-grade evidence on probabilistic foundations.

What causal auditing actually means

The alternative is to separate the question into two parts that can be answered deterministically:

1. What was the agent supposed to do? This is defined explicitly, before runtime, as a set of constraints: no staging URLs in production config, trade amount below 500, file writes only within the project directory.

2. What did the agent actually do, and how far did it deviate? This is recorded at runtime by comparing every action against the pre-defined constraints.

Neither question requires an LLM to answer. Both can be answered by deterministic, mathematical comparison.

This is the CIEU model — Causal Intent-Execution Unit. Every monitored action produces a five-tuple:

X_t   — who acted, and under what conditions
U_t   — what the agent actually did
Y*_t  — what the agent was supposed to do (the intent contract)
Y_t+1 — what actually resulted
R_t+1 — how far the outcome diverged from intent, and why

These five fields are written into a local ledger as a hash-chained record. Each record's SHA256 hash is embedded into the next record. Nothing can be silently modified after the fact. The chain is cryptographically verifiable.

This is not a new log format. It is a different category of infrastructure: tamper-evident causal evidence.

A real example: three silent writes

On March 4, 2026, during a routine quant backtesting session, Claude Code attempted three times — 41 minutes apart — to write a staging environment URL into a production config file:

{"endpoint": "https://api.market-data.staging.internal/v2/ohlcv"}

The syntax was valid. No exception was thrown. A conventional logger would have recorded three "file write" events and moved on — quietly corrupting every subsequent backtest result.

Because the function was instrumented with a CIEU constraint:

@k9(deny_content=["staging.internal"], allowed_paths=["./project/**"])
def write_config(path: str, content: dict) -> bool: ...

...all three attempts were flagged immediately, written to the ledger with severity 0.9, and made permanently traceable. The root cause was identified in under a second:

k9log trace --last
→ seq=451  VIOLATION  _write_file
   finding: content contains forbidden pattern 'staging.internal'
   causal_proof: root cause traced to step #449, chain intact

Three attempts. 41 minutes apart. All recorded. All verifiable.

The three moments when this matters

When something goes wrong at 3am. You don't want to read 10,000 log lines. You want to run one command and see: which step deviated, from which constraint, by how much. That is what k9log trace --last gives you — in under a second.

When you need to show proof. Your agent caused a problem in production. Your client asks what happened. You pull up a terminal screenshot. It could have been edited. Nobody trusts it. A SHA256 hash-chained ledger, verified with k9log verify-log, is cryptographic proof that the record has not been tampered with since it was written. That is evidence a screenshot cannot provide.

When you need approval to deploy. Your manager asks: what happens if the agent goes out of bounds? Without a concrete answer, the project dies in the approval meeting. With CIEU constraints defined and a verifiable ledger in place, the answer is: every action is measured against explicit rules, deviations are flagged immediately, and the record cannot be altered retroactively. That answer gets projects approved.

What this looks like in practice

For a Python developer, instrumentation is one decorator:

@k9(deny_content=['staging.internal'], amount={'max': 500})
def execute_trade(symbol: str, amount: float, endpoint: str) -> dict:
    ...

For a Claude Code user, it is one JSON file in the project root — no code changes required:

{"hooks": {
  "PreToolUse":  [{"matcher": "*", "hooks": [{"type": "command", "command": "python -m k9log.hook"}]}],
  "PostToolUse": [{"matcher": "*", "hooks": [{"type": "command", "command": "python -m k9log.hook_post"}]}]
}}

The ledger is stored locally at ~/.k9log/logs/k9log.cieu.jsonl. No data leaves the machine. No tokens are consumed. No per-event billing.

The boundary worth stating clearly

CIEU auditing answers one question: did the agent violate the constraints you defined?

It does not answer: did the agent accomplish the goal you gave it? That question requires evaluation of task completion, which is a different problem — and one that legitimately benefits from LLM evaluation. The two approaches are not in competition. CIEU auditing provides the deterministic foundation; higher-level evaluation can be built on top.

The mistake is trying to use a probabilistic evaluator as a substitute for a deterministic record. These are not interchangeable.

Who is this for

Scenario	Entry point	Key commands	Notes
⭐ Claude Code user	One `.claude/settings.json`	`k9log trace` / `stats`	Zero Python required. Every tool call auto-recorded. Unique differentiator vs all competitors.
✅ Python developer	`@k9` decorator	`k9log trace --last` / `report`	One decorator per function. Sync and async both supported.
✅ LangChain agent	`K9CallbackHandler` (3 lines)	`k9log trace` / `verify-log`	Native callback hook. Full CIEU records per tool call.
✅ High-risk business ops	`@k9` + JSON config	`k9log alerts` / `causal`	Finance, config writes, DB ops. Numeric + content constraints.
✅ DevOps / CI pipeline	`@k9` + `ci_check.py`	`ci_check.py` / `verify-log`	Pipeline halts on violation. Exit code non-zero. No manual review.
✅ Small team debugging	Any entry point	`k9log trace --last` / `stats`	Root cause in under a second. No log archaeology required.
✅ Data security	`@k9` deny_content	`k9log verify-log` / `report`	File access control. Cryptographic proof of what was touched.
✅ Teaching / tutorials	Any entry point	`k9log report` / `causal`	Easiest audience to reach today. HTML report is shareable. Demo violations visually.
🔲 CrewAI / AutoGen	Wrapper pattern	`k9log trace`	Works via `@k9` on tool functions. Native adapters on roadmap.
🔲 Enterprise compliance	Full audit chain	`k9log verify-log` / `report`	Future use case. Needs organisational trust-building first.

⭐ = unique differentiator ✅ = works today 🔲 = roadmap

Try it

K9 Audit is open source under AGPL-3.0. The CIEU architecture is covered by U.S. Provisional Patent Application No. 63/981,777.

GitHub: github.com/liuhaotian2024-prog/K9Audit
Install: pip install k9audit-hook
Contact: liuhaotian2024@gmail.com

If this resonates with a problem you have hit — or if you think the approach is wrong — I want to hear from you.