<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: MilkoorY</title>
    <description>The latest articles on DEV Community by MilkoorY (@milkoor).</description>
    <link>https://dev.to/milkoor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3930862%2Fe0e9ace4-a6f3-44b1-9bb1-0460ca936dd0.png</url>
      <title>DEV Community: MilkoorY</title>
      <link>https://dev.to/milkoor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/milkoor"/>
    <language>en</language>
    <item>
      <title>Reverse engineering Codex CLI rollout traces</title>
      <dc:creator>MilkoorY</dc:creator>
      <pubDate>Thu, 14 May 2026 09:35:33 +0000</pubDate>
      <link>https://dev.to/milkoor/reverse-engineering-codex-cli-rollout-traces-3b9b</link>
      <guid>https://dev.to/milkoor/reverse-engineering-codex-cli-rollout-traces-3b9b</guid>
      <description>&lt;h1&gt;
  
  
  Reverse engineering Codex CLI rollout traces
&lt;/h1&gt;




&lt;p&gt;I spent a weekend building a DeepSeek proxy for Codex CLI so I could generate real tracing data. The proxying was straightforward. What I found when I looked at the actual trace files was not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The documented format and the real format don't match.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the story of that mismatch: what I expected, what I found, and why it matters if you're building runtime tooling for coding agents.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup: why a proxy?
&lt;/h2&gt;

&lt;p&gt;Codex CLI uses OpenAI's Responses API (via their SDK). DeepSeek only supports Chat Completions. To use DeepSeek as the backend, I needed a translation proxy — intercept Responses API calls and translate them to Chat Completions.&lt;/p&gt;

&lt;p&gt;The proxy (&lt;code&gt;tools/codex_deepseek_proxy.py&lt;/code&gt;) was straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Accept POST requests at &lt;code&gt;/responses&lt;/code&gt; (or &lt;code&gt;/v1/responses&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Translate the &lt;code&gt;input&lt;/code&gt; field (Responses API format) to Chat Completions &lt;code&gt;messages&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Translate tool definitions from &lt;code&gt;{"name": "bash", "parameters": {...}}&lt;/code&gt; to &lt;code&gt;{"type": "function", "function": {"name": "bash", "parameters": {...}}}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Send to DeepSeek, receive a Chat Completions response, translate it back to Responses API event format&lt;/li&gt;
&lt;li&gt;Return the response as a JSON body (not SSE stream — DeepSeek's streaming output is incompatible)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The translation is mechanical, but there was one critical detail: &lt;strong&gt;Codex expects &lt;code&gt;function_call&lt;/code&gt; items with status &lt;code&gt;"in_progress"&lt;/code&gt;, not &lt;code&gt;"completed"&lt;/code&gt;.&lt;/strong&gt; The status tells Codex the tool has been invoked but hasn't completed yet — it's waiting for &lt;code&gt;function_call_output&lt;/code&gt; to arrive. Set it to &lt;code&gt;"completed"&lt;/code&gt; and Codex thinks the tool already ran and has no output.&lt;/p&gt;

&lt;p&gt;After getting the proxy working, Codex CLI ran happily against DeepSeek. Now I had real trace data.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I expected to see
&lt;/h2&gt;

&lt;p&gt;Codex CLI stores session data as rollout JSONL files in &lt;code&gt;~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl&lt;/code&gt;. Each line is a JSON object with a &lt;code&gt;type&lt;/code&gt; field and a &lt;code&gt;payload&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Based on reading Codex CLI's source code (specifically &lt;code&gt;protocol.rs&lt;/code&gt;), I expected event types like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;exec_command_begin&lt;/code&gt; / &lt;code&gt;exec_command_end&lt;/code&gt; — command execution boundaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mcp_tool_call_begin&lt;/code&gt; / &lt;code&gt;mcp_tool_call_end&lt;/code&gt; — MCP tool call boundaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;agent_reasoning&lt;/code&gt; — the model's internal reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I modelled my first parser around these. It produced exactly &lt;strong&gt;1 event per session&lt;/strong&gt;. Something was very wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the real format looks like
&lt;/h2&gt;

&lt;p&gt;I dumped a raw rollout file. Here's the actual format (validated against Codex v0.130.0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"session_meta"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"model_provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"deepseek"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"cli_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"0.130.0"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"event_msg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"task_started"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"response_item"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"developer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"turn_context"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"deepseek-chat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"event_msg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"token_count"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"response_item"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"function_call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"exec_command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"call_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"call_xxx"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"response_item"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"function_call_output"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"call_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"call_xxx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"response_item"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"output_text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}]}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"event_msg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"agent_message"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"event_msg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"task_complete"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The core pattern is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;response_item/function_call&lt;/code&gt;&lt;/strong&gt; — the model requested a tool invocation. Contains &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;arguments&lt;/code&gt; (JSON string), and &lt;code&gt;call_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;response_item/function_call_output&lt;/code&gt;&lt;/strong&gt; — the result of that invocation. Contains &lt;code&gt;call_id&lt;/code&gt; (paired with the corresponding &lt;code&gt;function_call&lt;/code&gt;) and &lt;code&gt;output&lt;/code&gt; (string).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;event_msg/agent_message&lt;/code&gt;&lt;/strong&gt; — the model's reasoning text. This is where thinking/reasoning blocks live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;response_item/message&lt;/code&gt; (role=assistant)&lt;/strong&gt; — the model's text response to the user.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;event_msg/token_count&lt;/code&gt;&lt;/strong&gt; — token usage tracking, interspersed everywhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are no &lt;code&gt;exec_command_begin&lt;/code&gt;, &lt;code&gt;exec_command_end&lt;/code&gt;, &lt;code&gt;mcp_tool_call_begin&lt;/code&gt;, or &lt;code&gt;mcp_tool_call_end&lt;/code&gt; events. At least not in the v0.130.0 rollout format.&lt;/p&gt;




&lt;h2&gt;
  
  
  The three things that surprised me
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. call_id pairing, not nesting
&lt;/h3&gt;

&lt;p&gt;Tool invocations and their results are &lt;strong&gt;flat, linked by &lt;code&gt;call_id&lt;/code&gt;&lt;/strong&gt; — not nested events. The &lt;code&gt;function_call&lt;/code&gt; line appears, then later (potentially many lines later) the matching &lt;code&gt;function_call_output&lt;/code&gt; appears with the same &lt;code&gt;call_id&lt;/code&gt;. Between them can be token_count events, reasoning messages, or other function calls.&lt;/p&gt;

&lt;p&gt;This means the parser must buffer pending calls and match them by &lt;code&gt;call_id&lt;/code&gt;, rather than assuming a begin/end nesting structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Token counts are everywhere
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;event_msg/token_count&lt;/code&gt; events appear between almost every meaningful event. They don't seem to follow a predictable cadence — sometimes before a function call, sometimes after, sometimes between reasoning blocks. They're noise for causal tracing but you need to handle them without breaking the event chain.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. No explicit causality
&lt;/h3&gt;

&lt;p&gt;The rollout format has no &lt;code&gt;parent_event_id&lt;/code&gt; or equivalent causal field. Causality must be inferred from ordering: the model receives &lt;code&gt;function_call_output&lt;/code&gt;, then decides what to do next — so the next &lt;code&gt;function_call&lt;/code&gt; or &lt;code&gt;agent_message&lt;/code&gt; after an output is causally dependent on it. This is the same temporal heuristic that log-based tailers for Copilot and Continue.dev use.&lt;/p&gt;




&lt;h2&gt;
  
  
  The translation chain
&lt;/h2&gt;

&lt;p&gt;Here's what actually happens when Codex CLI runs through the DeepSeek proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Codex CLI (Responses API)
  → POST /responses { input: [...], tools: [...] }
    → Proxy translates input → messages, tools → function definitions
    → DeepSeek Chat Completions API
      → Proxy translates response → Responses API events
        → Codex receives function_call with call_id
        → Codex executes the tool
          → Codex sends function_call_output back
            → Proxy translates to next request
              → Loop until task_complete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each loop iteration in the proxy is a single Chat Completions call. The response contains either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool calls (function_call items) → translate to Responses API output items&lt;/li&gt;
&lt;li&gt;A text response (message content) → translate as assistant message&lt;/li&gt;
&lt;li&gt;Both (the model can return text + tool calls in the same response)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Implications for runtime tooling
&lt;/h2&gt;

&lt;p&gt;If you're building observability or tracing for coding agents, the Codex CLI format teaches a few lessons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't trust source code comments, trust wire data.&lt;/strong&gt; &lt;code&gt;protocol.rs&lt;/code&gt; suggested one format; the actual rollout files used another. The source code showed the internal data structures, not the serialization format.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;call_id pairing is a recurring pattern.&lt;/strong&gt; Both Codex CLI and OpenAI's Responses API use &lt;code&gt;call_id&lt;/code&gt; to link function calls to their outputs. It's not nesting — it's a flat key-value relationship. Parser design should match: buffer by &lt;code&gt;call_id&lt;/code&gt;, match on arrival.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Log-based causality is the fallback, not the primary model.&lt;/strong&gt; Codex CLI rollout data has no causal links. They must be inferred. This is fine for the ~80% case, but it means you can't always tell which &lt;code&gt;function_call_output&lt;/code&gt; triggered which subsequent &lt;code&gt;function_call&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The event stream is heterogeneous.&lt;/strong&gt; Token counts, metadata, and control events are mixed with function calls. A robust parser must distinguish signal from noise without assuming a fixed event sequence.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The open-source implementation
&lt;/h2&gt;

&lt;p&gt;The parser I ended up building (&lt;code&gt;causetrace/hooks/codex_parser.py&lt;/code&gt;) handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;response_item/function_call&lt;/code&gt; → creates a pending call tracked by &lt;code&gt;call_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;response_item/function_call_output&lt;/code&gt; → matches by &lt;code&gt;call_id&lt;/code&gt;, updates the corresponding event with &lt;code&gt;tool_output&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;event_msg/agent_message&lt;/code&gt; → creates a reasoning event with causal parent linking&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;response_item/message (assistant)&lt;/code&gt; → creates a response text event&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It turns a 465-line rollout file into 116 causally-linked events — a parser accuracy improvement from effectively 0% (the protocol.rs-based attempt) to 100% of discoverable events in the real format.&lt;/p&gt;

&lt;p&gt;The full source is available at:&lt;br&gt;
&lt;a href="https://github.com/milkoor/causetrace/blob/main/causetrace/hooks/codex_parser.py" rel="noopener noreferrer"&gt;https://github.com/milkoor/causetrace/blob/main/causetrace/hooks/codex_parser.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the DeepSeek proxy that made the real traces possible:&lt;br&gt;
&lt;a href="https://github.com/milkoor/causetrace/blob/main/tools/codex_deepseek_proxy.py" rel="noopener noreferrer"&gt;https://github.com/milkoor/causetrace/blob/main/tools/codex_deepseek_proxy.py&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is the second in a series about coding agent runtime observability. First post: &lt;a href="https://dev.to/milkoor/coding-agents-produce-causal-dags-not-logs-ne6"&gt;Coding agents produce causal DAGs, not logs&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>reverseengineering</category>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>Coding agents produce causal DAGs, not logs</title>
      <dc:creator>MilkoorY</dc:creator>
      <pubDate>Thu, 14 May 2026 09:22:21 +0000</pubDate>
      <link>https://dev.to/milkoor/coding-agents-produce-causal-dags-not-logs-ne6</link>
      <guid>https://dev.to/milkoor/coding-agents-produce-causal-dags-not-logs-ne6</guid>
      <description>&lt;h1&gt;
  
  
  Coding agents produce causal DAGs, not logs
&lt;/h1&gt;




&lt;p&gt;I've been building tracing hooks for coding agents — Claude Code, Codex CLI, Copilot, and others. The goal was simple: record what these agents do, so I could debug them when things went wrong.&lt;/p&gt;

&lt;p&gt;What I found surprised me. &lt;strong&gt;The standard observability abstraction — a flat, chronological log of events — is the wrong primitive for coding agents.&lt;/strong&gt; These agents don't produce meaningful timelines. They produce causal DAGs.&lt;/p&gt;

&lt;p&gt;This post explains why, shows the difference with real traces, and argues that runtime observability for AI coding agents needs a fundamentally different data model.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. A debugging story
&lt;/h2&gt;

&lt;p&gt;A Claude Code session ran 87 tool calls over 12 minutes. The agent was asked to fix a bug. It read files, grepped for patterns, edited code, ran tests, failed, read more files, edited again, and eventually succeeded.&lt;/p&gt;

&lt;p&gt;At the end, I wanted to understand one thing: &lt;strong&gt;why did it delete a particular line?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The flat timeline looked like this (simplified):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[10:02:13] Read(file_path=src/parser.py)
[10:02:15] Read(file_path=src/utils.py)
[10:02:18] Grep(pattern="parse_token")
[10:02:20] Edit(file_path=src/parser.py)
[10:02:22] Bash(command=pytest tests/ -x)
[10:02:25] Read(file_path=src/parser.py)
[10:02:28] Edit(file_path=src/parser.py)
[10:02:30] Bash(command=pytest tests/)
[10:02:33] Read(file_path=docs/api.md)
[10:02:35] Edit(file_path=docs/api.md)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chronological order, zero insight. Two &lt;code&gt;Read(parser.py)&lt;/code&gt; calls — which one was before the critical edit? Why the &lt;code&gt;Read(docs/api.md)&lt;/code&gt; after the tests passed? The flat timeline records &lt;em&gt;when&lt;/em&gt; but tells you nothing about &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This isn't a visualization problem. It's an &lt;strong&gt;abstraction problem&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Why flat timelines fail for coding agents
&lt;/h2&gt;

&lt;p&gt;Conventional logging (syslog, OpenTelemetry spans, application logs) assumes events are &lt;strong&gt;independent observations&lt;/strong&gt; of a system. You log them in sequence, and the sequence itself carries meaning: request A happened, then request B happened.&lt;/p&gt;

&lt;p&gt;Coding agent tool calls are different. They form a &lt;strong&gt;dependency graph&lt;/strong&gt;. Every tool call is a direct response to the output of a prior call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent reads a file &lt;strong&gt;because&lt;/strong&gt; it found a relevant symbol in a grep result&lt;/li&gt;
&lt;li&gt;It edits code &lt;strong&gt;because&lt;/strong&gt; it understood the context from a read&lt;/li&gt;
&lt;li&gt;It runs tests &lt;strong&gt;because&lt;/strong&gt; it made an edit and needs to verify&lt;/li&gt;
&lt;li&gt;It reads documentation &lt;strong&gt;because&lt;/strong&gt; the test failure revealed a misunderstanding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not independent events. They are &lt;strong&gt;edges in a directed graph&lt;/strong&gt; where each node is a tool call and each edge is a causal dependency.&lt;/p&gt;

&lt;p&gt;A flat timeline encodes zero dependency information. It's a list of nodes without edges — a graph with no structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The causal DAG model
&lt;/h2&gt;

&lt;p&gt;Instead of a flat list, record each tool call with an explicit &lt;code&gt;parent_event_id&lt;/code&gt; — a reference to the event that caused it.&lt;/p&gt;

&lt;p&gt;Here's the same 9-event session rendered as a causal tree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[03:13:37] Read(file_path=src/main.py)
    └─ [03:13:37] Grep(pattern=FIXME)
      └─ [03:13:37] Read(file_path=src/utils.py)
[03:13:37] Read(file_path=src/utils.py)  [caused by: need_context]
    └─ [03:13:38] Edit(file_path=src/utils.py)
      └─ [03:13:38] Bash(command=python -m pytest tests/ -x)
[03:13:38] Grep(pattern=counter)
    └─ [03:13:38] Edit(file_path=docs/api.md)
      └─ [03:13:38] Bash(command=python -m pytest tests/)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three independent causal trees, each with a clear "why":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tree 1:&lt;/strong&gt; The agent read &lt;code&gt;main.py&lt;/code&gt;, found a FIXME in the grep output, read &lt;code&gt;utils.py&lt;/code&gt; for context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tree 2 (user-requested context):&lt;/strong&gt; The user asked for context on &lt;code&gt;utils.py&lt;/code&gt;, agent edited it, ran tests to verify&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tree 3:&lt;/strong&gt; The agent searched for "counter" in docs, found an out-of-date reference, edited &lt;code&gt;api.md&lt;/code&gt;, ran all tests&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not a nicer visualization of the same data. &lt;strong&gt;It's a different data model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;parent_event_id&lt;/code&gt;, you can:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace root cause:&lt;/strong&gt; Given any event, walk the parent chain to find the original trigger:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;causetrace why ses_10d2f16e &amp;lt;event_id&amp;gt;

  [03:13:38] Grep(pattern=counter) ──→
  [03:13:38] Edit(file_path=docs/api.md) ──→
  [03:13:38] Bash(command=python -m pytest tests/) ◀── TARGET
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Handle fan-in:&lt;/strong&gt; A tool call can consume outputs from multiple prior calls. Multi-parent causality via comma-separated &lt;code&gt;parent_event_id&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  [02:42:40] Bash(pytest)  ← Edit(docs/api.md)
  [02:42:41] Grep(FIXME)   ← Read(main.py)
  [02:42:41] Read(utils.py) ← Grep(FIXME)
  [02:42:41] Edit(utils.py) ← Read(utils.py)
  [02:42:41] Bash(pytest -x) ← Edit(utils.py)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Replay with provenance:&lt;/strong&gt; Walk the DAG in causal order (not chronological), replaying each event with its context. This reveals the agent's decision path, not just its action sequence.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Fidelity varies by agent
&lt;/h2&gt;

&lt;p&gt;Not every coding agent exposes clean causal primitives:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Causal Fidelity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;PreToolUse / PostToolUse hooks&lt;/td&gt;
&lt;td&gt;Perfect (native &lt;code&gt;event_id&lt;/code&gt; / &lt;code&gt;parent_event_id&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenCode&lt;/td&gt;
&lt;td&gt;SQLite DB extraction&lt;/td&gt;
&lt;td&gt;High (native parent-child in DB schema)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex CLI&lt;/td&gt;
&lt;td&gt;Rollout JSONL parser&lt;/td&gt;
&lt;td&gt;Medium (call_id pairing + heuristic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;td&gt;Process wrapper (stdout parsing)&lt;/td&gt;
&lt;td&gt;Medium (best-effort from output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Continue.dev&lt;/td&gt;
&lt;td&gt;Log tailing + temporal inference&lt;/td&gt;
&lt;td&gt;Low (~80% heuristic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Copilot&lt;/td&gt;
&lt;td&gt;Extension host log parsing&lt;/td&gt;
&lt;td&gt;Low (~80% heuristic)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For log-based agents without native causality, temporal proximity heuristics recover most of the structure: if event B happened within 2 seconds of event A's completion, B is likely a child of A. It's not perfect, but it recovers the ~80% case.&lt;/p&gt;

&lt;p&gt;The important architectural point: &lt;strong&gt;causality is not binary.&lt;/strong&gt; The data model supports perfect chains when available and degrades gracefully to heuristic inference for unstructured logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. What this means for agent observability
&lt;/h2&gt;

&lt;p&gt;Current runtime observability for coding agents has the wrong abstraction.&lt;/p&gt;

&lt;p&gt;Tools like OpenTelemetry are designed for distributed systems where spans form trees based on request propagation. Agent tool calls are different — they form trees based on &lt;strong&gt;information flow&lt;/strong&gt;, not request context propagation. An agent reads a file because of what it found in a prior read — that's not a parent span in the OpenTelemetry sense, but it's the essential causal link for understanding agent behavior.&lt;/p&gt;

&lt;p&gt;Building the right abstraction means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Capture edges, not just nodes.&lt;/strong&gt; A log is a list of nodes. A causal trace records the edges between them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design for dependency, not chronology.&lt;/strong&gt; Chronological order is easy to reconstruct from timestamps. Dependency order is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accept graceful degradation.&lt;/strong&gt; Native causal hooks → perfect traces. Log tailing → heuristic traces. The data model should accommodate both without breaking.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  6. An open-source implementation
&lt;/h2&gt;

&lt;p&gt;I've packaged this into an open-source runtime tracer called &lt;strong&gt;causetrace&lt;/strong&gt;. It records tool calls with explicit &lt;code&gt;parent_event_id&lt;/code&gt; chains, renders trees and DAGs, and supports root-cause tracing and replay.&lt;/p&gt;

&lt;p&gt;The storage model is intentionally minimal: append-only JSONL per session, zero external dependencies. Each session is a file in &lt;code&gt;~/.causetrace/data/&amp;lt;session_id&amp;gt;.jsonl&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quick start:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;causetrace

causetrace tree ses_10d2f16e    &lt;span class="c"&gt;# causal tree&lt;/span&gt;
causetrace graph ses_3e23bcc8   &lt;span class="c"&gt;# multi-parent DAG&lt;/span&gt;
causetrace why ses_10d2f16e &amp;lt;e&amp;gt; &lt;span class="c"&gt;# root cause trace&lt;/span&gt;
causetrace replay ses_10d2f16e  &lt;span class="c"&gt;# replay with provenance&lt;/span&gt;
causetrace doctor               &lt;span class="c"&gt;# check agent configuration&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project supports 6 coding agent runtimes and is MIT-licensed.&lt;/p&gt;

&lt;p&gt;But the point of this post isn't the tool. It's the abstraction: &lt;strong&gt;coding agents produce causal DAGs, not logs.&lt;/strong&gt; If you're building observability for AI coding agents, start from that premise.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Discuss on &lt;a href="https://news.ycombinator.com/item?id=48132377" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;GitHub: &lt;a href="https://github.com/milkoor/causetrace" rel="noopener noreferrer"&gt;https://github.com/milkoor/causetrace&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
    </item>
  </channel>
</rss>
