Coding agents produce causal DAGs, not logs

#observability #ai #opensource #python

Coding agents produce causal DAGs, not logs

I've been building tracing hooks for coding agents — Claude Code, Codex CLI, Copilot, and others. The goal was simple: record what these agents do, so I could debug them when things went wrong.

What I found surprised me. The standard observability abstraction — a flat, chronological log of events — is the wrong primitive for coding agents. These agents don't produce meaningful timelines. They produce causal DAGs.

This post explains why, shows the difference with real traces, and argues that runtime observability for AI coding agents needs a fundamentally different data model.

1. A debugging story

A Claude Code session ran 87 tool calls over 12 minutes. The agent was asked to fix a bug. It read files, grepped for patterns, edited code, ran tests, failed, read more files, edited again, and eventually succeeded.

At the end, I wanted to understand one thing: why did it delete a particular line?

The flat timeline looked like this (simplified):

[10:02:13] Read(file_path=src/parser.py)
[10:02:15] Read(file_path=src/utils.py)
[10:02:18] Grep(pattern="parse_token")
[10:02:20] Edit(file_path=src/parser.py)
[10:02:22] Bash(command=pytest tests/ -x)
[10:02:25] Read(file_path=src/parser.py)
[10:02:28] Edit(file_path=src/parser.py)
[10:02:30] Bash(command=pytest tests/)
[10:02:33] Read(file_path=docs/api.md)
[10:02:35] Edit(file_path=docs/api.md)

Chronological order, zero insight. Two Read(parser.py) calls — which one was before the critical edit? Why the Read(docs/api.md) after the tests passed? The flat timeline records when but tells you nothing about why.

This isn't a visualization problem. It's an abstraction problem.

2. Why flat timelines fail for coding agents

Conventional logging (syslog, OpenTelemetry spans, application logs) assumes events are independent observations of a system. You log them in sequence, and the sequence itself carries meaning: request A happened, then request B happened.

Coding agent tool calls are different. They form a dependency graph. Every tool call is a direct response to the output of a prior call:

The agent reads a file because it found a relevant symbol in a grep result
It edits code because it understood the context from a read
It runs tests because it made an edit and needs to verify
It reads documentation because the test failure revealed a misunderstanding

These are not independent events. They are edges in a directed graph where each node is a tool call and each edge is a causal dependency.

A flat timeline encodes zero dependency information. It's a list of nodes without edges — a graph with no structure.

3. The causal DAG model

Instead of a flat list, record each tool call with an explicit parent_event_id — a reference to the event that caused it.

Here's the same 9-event session rendered as a causal tree:

[03:13:37] Read(file_path=src/main.py)
    └─ [03:13:37] Grep(pattern=FIXME)
      └─ [03:13:37] Read(file_path=src/utils.py)
[03:13:37] Read(file_path=src/utils.py)  [caused by: need_context]
    └─ [03:13:38] Edit(file_path=src/utils.py)
      └─ [03:13:38] Bash(command=python -m pytest tests/ -x)
[03:13:38] Grep(pattern=counter)
    └─ [03:13:38] Edit(file_path=docs/api.md)
      └─ [03:13:38] Bash(command=python -m pytest tests/)

Three independent causal trees, each with a clear "why":

Tree 1: The agent read main.py, found a FIXME in the grep output, read utils.py for context
Tree 2 (user-requested context): The user asked for context on utils.py, agent edited it, ran tests to verify
Tree 3: The agent searched for "counter" in docs, found an out-of-date reference, edited api.md, ran all tests

This is not a nicer visualization of the same data. It's a different data model.

With parent_event_id, you can:

Trace root cause: Given any event, walk the parent chain to find the original trigger:

causetrace why ses_10d2f16e <event_id>

  [03:13:38] Grep(pattern=counter) ──→
  [03:13:38] Edit(file_path=docs/api.md) ──→
  [03:13:38] Bash(command=python -m pytest tests/) ◀── TARGET

Handle fan-in: A tool call can consume outputs from multiple prior calls. Multi-parent causality via comma-separated parent_event_id:

  [02:42:40] Bash(pytest)  ← Edit(docs/api.md)
  [02:42:41] Grep(FIXME)   ← Read(main.py)
  [02:42:41] Read(utils.py) ← Grep(FIXME)
  [02:42:41] Edit(utils.py) ← Read(utils.py)
  [02:42:41] Bash(pytest -x) ← Edit(utils.py)

Replay with provenance: Walk the DAG in causal order (not chronological), replaying each event with its context. This reveals the agent's decision path, not just its action sequence.

4. Fidelity varies by agent

Not every coding agent exposes clean causal primitives:

Agent	Method	Causal Fidelity
Claude Code	PreToolUse / PostToolUse hooks	Perfect (native `event_id` / `parent_event_id`)
OpenCode	SQLite DB extraction	High (native parent-child in DB schema)
Codex CLI	Rollout JSONL parser	Medium (call_id pairing + heuristic)
Aider	Process wrapper (stdout parsing)	Medium (best-effort from output)
Continue.dev	Log tailing + temporal inference	Low (~80% heuristic)
GitHub Copilot	Extension host log parsing	Low (~80% heuristic)

For log-based agents without native causality, temporal proximity heuristics recover most of the structure: if event B happened within 2 seconds of event A's completion, B is likely a child of A. It's not perfect, but it recovers the ~80% case.

The important architectural point: causality is not binary. The data model supports perfect chains when available and degrades gracefully to heuristic inference for unstructured logs.

5. What this means for agent observability

Current runtime observability for coding agents has the wrong abstraction.

Tools like OpenTelemetry are designed for distributed systems where spans form trees based on request propagation. Agent tool calls are different — they form trees based on information flow, not request context propagation. An agent reads a file because of what it found in a prior read — that's not a parent span in the OpenTelemetry sense, but it's the essential causal link for understanding agent behavior.

Building the right abstraction means:

Capture edges, not just nodes. A log is a list of nodes. A causal trace records the edges between them.
Design for dependency, not chronology. Chronological order is easy to reconstruct from timestamps. Dependency order is not.
Accept graceful degradation. Native causal hooks → perfect traces. Log tailing → heuristic traces. The data model should accommodate both without breaking.

6. An open-source implementation

I've packaged this into an open-source runtime tracer called causetrace. It records tool calls with explicit parent_event_id chains, renders trees and DAGs, and supports root-cause tracing and replay.

The storage model is intentionally minimal: append-only JSONL per session, zero external dependencies. Each session is a file in ~/.causetrace/data/<session_id>.jsonl.

Quick start:

pip install causetrace

causetrace tree ses_10d2f16e    # causal tree
causetrace graph ses_3e23bcc8   # multi-parent DAG
causetrace why ses_10d2f16e <e> # root cause trace
causetrace replay ses_10d2f16e  # replay with provenance
causetrace doctor               # check agent configuration

The project supports 6 coding agent runtimes and is MIT-licensed.

But the point of this post isn't the tool. It's the abstraction: coding agents produce causal DAGs, not logs. If you're building observability for AI coding agents, start from that premise.

Discuss on Hacker News
GitHub: https://github.com/milkoor/causetrace