agent-replay-trace: Load and Step Through Agent Traces for Debugging

#hermeschallenge #ai #python #agents

1. The 2am debugging session

The agent ran for three hours and produced the wrong answer. The logs were there. A JSONL file with 847 lines, each line a JSON object with a timestamp, event kind, input, output, and some metadata.

I opened it in a text editor. Searched for "error". Found nothing. Searched for "tool_call". Found 34 matches. Tried to understand the sequence manually. Gave up and wrote a 40-line script to parse it.

The script worked. Then it broke when the log format changed slightly. Then I added filtering. Then I added duration calculation. Then I had 200 lines of one-off parsing code in a random utils file that nobody else could use.

agent-replay-trace is the reusable version of that script. Load the trace. Filter by event kind. Compute durations. Step through it one event at a time. All with a consistent API.

The core insight is that agentic debugging has a different shape than regular debugging. You cannot set a breakpoint in production. You have the log. The log is the ground truth. What you need is a way to navigate the log without writing a new parsing script every time the log format drifts by one field name.

The library is intentionally format-light. It requires kind and timestamp. Everything else is optional. If your trace has a correlation_id, durations are computed automatically. If not, you still get filtering and step-through. The library meets your trace where it is rather than demanding a specific schema upfront.

2. Shape of the fix

from agent_replay_trace import TraceLoader, Debugger

# Load a trace from a JSONL file
trace = TraceLoader.from_jsonl("/logs/agent-run-2026-05-24.jsonl")

# How many events total?
print(len(trace))  # 847

# Get all tool calls that took more than 5 seconds:
slow_calls = trace.where(kind="tool_call").by(lambda e: e.duration_s > 5)
for event in slow_calls:
    print(event.name, event.duration_s)

# Find the first error:
first_error = trace.where(kind="error").first()
if first_error:
    print(first_error.timestamp, first_error.data)

# Step through the trace interactively:
dbg = Debugger(trace)
dbg.step()   # prints event 0, advances cursor
dbg.step()   # prints event 1, advances cursor
dbg.back()   # moves cursor back to event 0
dbg.jump(42) # jumps to event 42
dbg.peek()   # prints current event without advancing

For programmatic replay:

# Replay all llm_call events and inspect inputs:
for event in trace.where(kind="llm_call"):
    print(f"Turn {event.turn}: {len(event.data.get('messages', []))} messages")
    print(f"  model: {event.data.get('model')}")
    print(f"  duration: {event.duration_s:.2f}s")

3. What it does NOT do

It does not generate traces. You get traces from your logging layer, agentsnap, agenttrace, or whatever your pipeline writes. This library reads traces that already exist.

It does not modify traces. All operations are read-only. The Debugger has a cursor but never writes back to the file.

It does not support live traces. The TraceLoader reads the whole JSONL file at load time. If the agent is still running and appending to the log, you will not see new events without reloading. Live tail is a different use case.

It does not reconstruct agent state. It shows you events in sequence. It does not maintain a running state object that reflects what variables held what values at each point. For that you need your agent to emit richer state events.

It does not parse non-JSONL formats. One event per line, each line valid JSON. If your logs have a different format, write an adapter that normalizes to that structure before loading.

4. Inside the library

The repo is at MukundaKatta/agent-replay-trace. There are 28 tests.

Core types:

TraceEvent: dataclass with kind (str), timestamp (datetime), name (str or None), data (dict), duration_s (float or None), turn (int or None).
Trace: sequence of TraceEvent objects. Supports __len__, __iter__, __getitem__. Has where() and by() for filtering.
TraceFilter: returned by trace.where(). Supports by(predicate), first(), last(), all(). Chainable.
TraceLoader: static methods from_jsonl(path) and from_list(events).
Debugger: wraps a Trace with a cursor. Methods: step(), back(), jump(n), peek(), reset(), remaining().

Duration calculation: the loader tries to compute duration_s from consecutive events with matching correlation_id or request_id fields. If neither is present, duration is None. You can also emit events with an explicit duration_ms field in the JSON and the loader will use that.

The where() / by() API is a thin filter chain. where(kind="tool_call") returns a TraceFilter. .by(lambda e: e.duration_s > 5) applies an additional predicate. .all() materializes to a list. .first() returns the first match or None.

Line parsing: malformed lines are skipped with a warning. The loader does not fail hard on bad JSON lines. The assumption is that real log files sometimes have partial writes at the tail.

5. When this is useful, when it is not

Useful when:

Something went wrong and you have the JSONL trace. You need to find where and why.
You are building an eval harness and want to replay historical traces to test how your evaluation logic scores past runs.
You are doing performance work and want to find which tool calls are slow across many trace files.
You are writing tests for agent behavior and want to load fixture traces to assert event sequences.

Not useful when:

Your agent does not write JSONL traces. Integrate a logging layer first. agenttrace and agent-step-log both produce compatible output.
You want live debugging. Use a proper debugger or structured logging with a live viewer.
Your traces are large (millions of events). The whole file loads into memory. For massive traces, use streaming tools.
You need to reconstruct agent state. The library shows you events, not state.

6. Install

The package is pending PyPI publication.

# PyPI (pending):
pip install agent-replay-trace

# From source:
git clone https://github.com/MukundaKatta/agent-replay-trace
cd agent-replay-trace
pip install -e .

No runtime dependencies. Python 3.9+.

# Run the tests:
pytest tests/ -v
# 28 tests, all passing

7. Siblings in the stack

Library	What it does
`agentsnap`	Snapshot agent state at any point
`agenttrace`	Cost and latency aggregation per run
`agent-decision-log`	Structured WHY-layer decision events
`agent-event-bus`	In-process pub/sub for agent events
`agent-shadow-mode`	Record tool calls without executing them
`agent-step-log`	Per-step JSONL logger (produces traces this library reads)

The natural chain: agent-step-log writes events to JSONL as the agent runs. After the run, agent-replay-trace loads the file so you can inspect what happened. agenttrace gives you the cost and latency summary. Together they cover the "what happened and how expensive was it" question.

8. What comes next

Two things I want to add.

First, a diff(trace_a, trace_b) method. You run the same agent on the same input twice with different settings. The diff shows you where the event sequences diverge. Useful for A/B testing prompt changes.

Second, export to timeline formats. Specifically, output a JSON file that a flamegraph viewer can render. Right now you step through events linearly. A flamegraph would show you nested duration blocks for concurrent tool calls.

Third (if time permits), a TraceLoader.from_directory(path) that loads all JSONL files in a directory and merges them into a single Trace, sorted by timestamp. For production systems where each agent run writes a separate log file.

The library is small on purpose. It loads traces and lets you walk through them. The analysis layer is yours to build.

Source: github.com/MukundaKatta/agent-replay-trace