DEV Community: MilkoorY

Reverse engineering Codex CLI rollout traces

MilkoorY — Thu, 14 May 2026 09:35:33 +0000

Reverse engineering Codex CLI rollout traces

I spent a weekend building a DeepSeek proxy for Codex CLI so I could generate real tracing data. The proxying was straightforward. What I found when I looked at the actual trace files was not.

The documented format and the real format don't match.

This is the story of that mismatch: what I expected, what I found, and why it matters if you're building runtime tooling for coding agents.

The setup: why a proxy?

Codex CLI uses OpenAI's Responses API (via their SDK). DeepSeek only supports Chat Completions. To use DeepSeek as the backend, I needed a translation proxy — intercept Responses API calls and translate them to Chat Completions.

The proxy (tools/codex_deepseek_proxy.py) was straightforward:

Accept POST requests at /responses (or /v1/responses)
Translate the input field (Responses API format) to Chat Completions messages
Translate tool definitions from {"name": "bash", "parameters": {...}} to {"type": "function", "function": {"name": "bash", "parameters": {...}}}
Send to DeepSeek, receive a Chat Completions response, translate it back to Responses API event format
Return the response as a JSON body (not SSE stream — DeepSeek's streaming output is incompatible)

The translation is mechanical, but there was one critical detail: Codex expects function_call items with status "in_progress", not "completed". The status tells Codex the tool has been invoked but hasn't completed yet — it's waiting for function_call_output to arrive. Set it to "completed" and Codex thinks the tool already ran and has no output.

After getting the proxy working, Codex CLI ran happily against DeepSeek. Now I had real trace data.

What I expected to see

Codex CLI stores session data as rollout JSONL files in ~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl. Each line is a JSON object with a type field and a payload.

Based on reading Codex CLI's source code (specifically protocol.rs), I expected event types like:

exec_command_begin / exec_command_end — command execution boundaries
mcp_tool_call_begin / mcp_tool_call_end — MCP tool call boundaries
agent_reasoning — the model's internal reasoning

I modelled my first parser around these. It produced exactly 1 event per session. Something was very wrong.

What the real format looks like

I dumped a raw rollout file. Here's the actual format (validated against Codex v0.130.0):

{"timestamp":"...","type":"session_meta","payload":{"model_provider":"deepseek","cli_version":"0.130.0"}}
{"timestamp":"...","type":"event_msg","payload":{"type":"task_started",...}}
{"timestamp":"...","type":"response_item","payload":{"type":"message","role":"developer","content":[...]}}
{"timestamp":"...","type":"turn_context","payload":{"model":"deepseek-chat",...}}
{"timestamp":"...","type":"event_msg","payload":{"type":"token_count",...}}
{"timestamp":"...","type":"response_item","payload":{"type":"function_call","name":"exec_command","arguments":"{}","call_id":"call_xxx"}}
{"timestamp":"...","type":"response_item","payload":{"type":"function_call_output","call_id":"call_xxx","output":"..."}}
{"timestamp":"...","type":"response_item","payload":{"type":"message","role":"assistant","content":[{"type":"output_text","text":"..."}]}}
{"timestamp":"...","type":"event_msg","payload":{"type":"agent_message","message":"..."}}
{"timestamp":"...","type":"event_msg","payload":{"type":"task_complete",...}}

The core pattern is:

response_item/function_call — the model requested a tool invocation. Contains name, arguments (JSON string), and call_id.
response_item/function_call_output — the result of that invocation. Contains call_id (paired with the corresponding function_call) and output (string).
event_msg/agent_message — the model's reasoning text. This is where thinking/reasoning blocks live.
response_item/message (role=assistant) — the model's text response to the user.
event_msg/token_count — token usage tracking, interspersed everywhere.

There are no exec_command_begin, exec_command_end, mcp_tool_call_begin, or mcp_tool_call_end events. At least not in the v0.130.0 rollout format.

The three things that surprised me

1. call_id pairing, not nesting

Tool invocations and their results are flat, linked by call_id — not nested events. The function_call line appears, then later (potentially many lines later) the matching function_call_output appears with the same call_id. Between them can be token_count events, reasoning messages, or other function calls.

This means the parser must buffer pending calls and match them by call_id, rather than assuming a begin/end nesting structure.

2. Token counts are everywhere

event_msg/token_count events appear between almost every meaningful event. They don't seem to follow a predictable cadence — sometimes before a function call, sometimes after, sometimes between reasoning blocks. They're noise for causal tracing but you need to handle them without breaking the event chain.

3. No explicit causality

The rollout format has no parent_event_id or equivalent causal field. Causality must be inferred from ordering: the model receives function_call_output, then decides what to do next — so the next function_call or agent_message after an output is causally dependent on it. This is the same temporal heuristic that log-based tailers for Copilot and Continue.dev use.

The translation chain

Here's what actually happens when Codex CLI runs through the DeepSeek proxy:

Codex CLI (Responses API)
  → POST /responses { input: [...], tools: [...] }
    → Proxy translates input → messages, tools → function definitions
    → DeepSeek Chat Completions API
      → Proxy translates response → Responses API events
        → Codex receives function_call with call_id
        → Codex executes the tool
          → Codex sends function_call_output back
            → Proxy translates to next request
              → Loop until task_complete

Each loop iteration in the proxy is a single Chat Completions call. The response contains either:

Tool calls (function_call items) → translate to Responses API output items
A text response (message content) → translate as assistant message
Both (the model can return text + tool calls in the same response)

Implications for runtime tooling

If you're building observability or tracing for coding agents, the Codex CLI format teaches a few lessons:

Don't trust source code comments, trust wire data. protocol.rs suggested one format; the actual rollout files used another. The source code showed the internal data structures, not the serialization format.
call_id pairing is a recurring pattern. Both Codex CLI and OpenAI's Responses API use call_id to link function calls to their outputs. It's not nesting — it's a flat key-value relationship. Parser design should match: buffer by call_id, match on arrival.
Log-based causality is the fallback, not the primary model. Codex CLI rollout data has no causal links. They must be inferred. This is fine for the ~80% case, but it means you can't always tell which function_call_output triggered which subsequent function_call.
The event stream is heterogeneous. Token counts, metadata, and control events are mixed with function calls. A robust parser must distinguish signal from noise without assuming a fixed event sequence.

The open-source implementation

The parser I ended up building (causetrace/hooks/codex_parser.py) handles:

response_item/function_call → creates a pending call tracked by call_id
response_item/function_call_output → matches by call_id, updates the corresponding event with tool_output
event_msg/agent_message → creates a reasoning event with causal parent linking
response_item/message (assistant) → creates a response text event

It turns a 465-line rollout file into 116 causally-linked events — a parser accuracy improvement from effectively 0% (the protocol.rs-based attempt) to 100% of discoverable events in the real format.

The full source is available at:
https://github.com/milkoor/causetrace/blob/main/causetrace/hooks/codex_parser.py

And the DeepSeek proxy that made the real traces possible:
https://github.com/milkoor/causetrace/blob/main/tools/codex_deepseek_proxy.py

This is the second in a series about coding agent runtime observability. First post: Coding agents produce causal DAGs, not logs.

Coding agents produce causal DAGs, not logs

MilkoorY — Thu, 14 May 2026 09:22:21 +0000

Coding agents produce causal DAGs, not logs

I've been building tracing hooks for coding agents — Claude Code, Codex CLI, Copilot, and others. The goal was simple: record what these agents do, so I could debug them when things went wrong.

What I found surprised me. The standard observability abstraction — a flat, chronological log of events — is the wrong primitive for coding agents. These agents don't produce meaningful timelines. They produce causal DAGs.

This post explains why, shows the difference with real traces, and argues that runtime observability for AI coding agents needs a fundamentally different data model.

1. A debugging story

A Claude Code session ran 87 tool calls over 12 minutes. The agent was asked to fix a bug. It read files, grepped for patterns, edited code, ran tests, failed, read more files, edited again, and eventually succeeded.

At the end, I wanted to understand one thing: why did it delete a particular line?

The flat timeline looked like this (simplified):

[10:02:13] Read(file_path=src/parser.py)
[10:02:15] Read(file_path=src/utils.py)
[10:02:18] Grep(pattern="parse_token")
[10:02:20] Edit(file_path=src/parser.py)
[10:02:22] Bash(command=pytest tests/ -x)
[10:02:25] Read(file_path=src/parser.py)
[10:02:28] Edit(file_path=src/parser.py)
[10:02:30] Bash(command=pytest tests/)
[10:02:33] Read(file_path=docs/api.md)
[10:02:35] Edit(file_path=docs/api.md)

Chronological order, zero insight. Two Read(parser.py) calls — which one was before the critical edit? Why the Read(docs/api.md) after the tests passed? The flat timeline records when but tells you nothing about why.

This isn't a visualization problem. It's an abstraction problem.

2. Why flat timelines fail for coding agents

Conventional logging (syslog, OpenTelemetry spans, application logs) assumes events are independent observations of a system. You log them in sequence, and the sequence itself carries meaning: request A happened, then request B happened.

Coding agent tool calls are different. They form a dependency graph. Every tool call is a direct response to the output of a prior call:

The agent reads a file because it found a relevant symbol in a grep result
It edits code because it understood the context from a read
It runs tests because it made an edit and needs to verify
It reads documentation because the test failure revealed a misunderstanding

These are not independent events. They are edges in a directed graph where each node is a tool call and each edge is a causal dependency.

A flat timeline encodes zero dependency information. It's a list of nodes without edges — a graph with no structure.

3. The causal DAG model

Instead of a flat list, record each tool call with an explicit parent_event_id — a reference to the event that caused it.

Here's the same 9-event session rendered as a causal tree:

[03:13:37] Read(file_path=src/main.py)
    └─ [03:13:37] Grep(pattern=FIXME)
      └─ [03:13:37] Read(file_path=src/utils.py)
[03:13:37] Read(file_path=src/utils.py)  [caused by: need_context]
    └─ [03:13:38] Edit(file_path=src/utils.py)
      └─ [03:13:38] Bash(command=python -m pytest tests/ -x)
[03:13:38] Grep(pattern=counter)
    └─ [03:13:38] Edit(file_path=docs/api.md)
      └─ [03:13:38] Bash(command=python -m pytest tests/)

Three independent causal trees, each with a clear "why":

Tree 1: The agent read main.py, found a FIXME in the grep output, read utils.py for context
Tree 2 (user-requested context): The user asked for context on utils.py, agent edited it, ran tests to verify
Tree 3: The agent searched for "counter" in docs, found an out-of-date reference, edited api.md, ran all tests

This is not a nicer visualization of the same data. It's a different data model.

With parent_event_id, you can:

Trace root cause: Given any event, walk the parent chain to find the original trigger:

causetrace why ses_10d2f16e <event_id>

  [03:13:38] Grep(pattern=counter) ──→
  [03:13:38] Edit(file_path=docs/api.md) ──→
  [03:13:38] Bash(command=python -m pytest tests/) ◀── TARGET

Handle fan-in: A tool call can consume outputs from multiple prior calls. Multi-parent causality via comma-separated parent_event_id:

  [02:42:40] Bash(pytest)  ← Edit(docs/api.md)
  [02:42:41] Grep(FIXME)   ← Read(main.py)
  [02:42:41] Read(utils.py) ← Grep(FIXME)
  [02:42:41] Edit(utils.py) ← Read(utils.py)
  [02:42:41] Bash(pytest -x) ← Edit(utils.py)

Replay with provenance: Walk the DAG in causal order (not chronological), replaying each event with its context. This reveals the agent's decision path, not just its action sequence.

4. Fidelity varies by agent

Not every coding agent exposes clean causal primitives:

Agent	Method	Causal Fidelity
Claude Code	PreToolUse / PostToolUse hooks	Perfect (native `event_id` / `parent_event_id`)
OpenCode	SQLite DB extraction	High (native parent-child in DB schema)
Codex CLI	Rollout JSONL parser	Medium (call_id pairing + heuristic)
Aider	Process wrapper (stdout parsing)	Medium (best-effort from output)
Continue.dev	Log tailing + temporal inference	Low (~80% heuristic)
GitHub Copilot	Extension host log parsing	Low (~80% heuristic)

For log-based agents without native causality, temporal proximity heuristics recover most of the structure: if event B happened within 2 seconds of event A's completion, B is likely a child of A. It's not perfect, but it recovers the ~80% case.

The important architectural point: causality is not binary. The data model supports perfect chains when available and degrades gracefully to heuristic inference for unstructured logs.

5. What this means for agent observability

Current runtime observability for coding agents has the wrong abstraction.

Tools like OpenTelemetry are designed for distributed systems where spans form trees based on request propagation. Agent tool calls are different — they form trees based on information flow, not request context propagation. An agent reads a file because of what it found in a prior read — that's not a parent span in the OpenTelemetry sense, but it's the essential causal link for understanding agent behavior.

Building the right abstraction means:

Capture edges, not just nodes. A log is a list of nodes. A causal trace records the edges between them.
Design for dependency, not chronology. Chronological order is easy to reconstruct from timestamps. Dependency order is not.
Accept graceful degradation. Native causal hooks → perfect traces. Log tailing → heuristic traces. The data model should accommodate both without breaking.

6. An open-source implementation

I've packaged this into an open-source runtime tracer called causetrace. It records tool calls with explicit parent_event_id chains, renders trees and DAGs, and supports root-cause tracing and replay.

The storage model is intentionally minimal: append-only JSONL per session, zero external dependencies. Each session is a file in ~/.causetrace/data/<session_id>.jsonl.

Quick start:

pip install causetrace

causetrace tree ses_10d2f16e    # causal tree
causetrace graph ses_3e23bcc8   # multi-parent DAG
causetrace why ses_10d2f16e <e> # root cause trace
causetrace replay ses_10d2f16e  # replay with provenance
causetrace doctor               # check agent configuration

The project supports 6 coding agent runtimes and is MIT-licensed.

But the point of this post isn't the tool. It's the abstraction: coding agents produce causal DAGs, not logs. If you're building observability for AI coding agents, start from that premise.

Discuss on Hacker News
GitHub: https://github.com/milkoor/causetrace