DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

agent-replay-trace: step through JSONL agent traces for debugging

Your agent failed in production on Thursday night. You have a trace file. It is 4,000 lines of JSONL.

You know something went wrong around step 23. You know there was a tool call to a retrieval service that returned something unexpected. You know the agent made a bad decision somewhere after that. What you do not know is exactly what state the agent was in when it made that decision, or what the tool returned, or how long any of it took.

So you start grepping. You search for "kind": "tool_call" and count the matches. You pipe through jq to pull out specific fields. You scroll. You lose track of which line you were on. You search again. Forty-five minutes later you have a rough picture of what happened, but you are not sure you caught everything, and you are not confident in the sequence.

A step-through debugger would have taken five minutes.

That is what agent-replay-trace is. It reads trace JSONL files and gives you structured access: filter by event kind, group events, compute duration, or walk the trace one step at a time with a debugger interface.


The shape of the fix

Install it:

pip install agent-replay-trace
Enter fullscreen mode Exit fullscreen mode

Load a trace and start asking questions:

from agent_replay_trace import Trace

trace = Trace.load("run-2026-05-23.jsonl")

# How many events total?
print(len(trace.events))

# Show only tool calls
calls = trace.where(kind="tool_call")
for event in calls:
    print(event)

# Group all events by kind
grouped = trace.by_kind()
for kind, events in grouped.items():
    print(f"{kind}: {len(events)} events")

# How long did the whole trace run?
print(f"Duration: {trace.duration_s():.2f}s")
Enter fullscreen mode Exit fullscreen mode

That covers most of what you need for quick triage. But when you want to walk through the trace step by step, you use Debugger:

from agent_replay_trace import Debugger

dbg = Debugger.load("run-2026-05-23.jsonl")

# Move forward one event at a time
event = dbg.step()
print(event)

# Check current position
print(f"Step {dbg.position} of {dbg.total}")

# Look at full state at this point
print(dbg.current)

# Jump to a specific step
dbg.seek(22)
event = dbg.step()  # now on step 23
print(event)
Enter fullscreen mode Exit fullscreen mode

You load the trace, seek to the step before the suspected failure, and read the exact event. No grepping. No losing your place. No reconstructing sequence from context.


What it does NOT do

A few things this library deliberately does not handle:

  • It does not produce traces. That is the job of agentsnap or agenttrace. This library reads files those tools produce.
  • It does not replay agent execution. It reads the recorded events. It does not re-run your tools or re-invoke your LLM.
  • It does not enforce a schema beyond the kind field. What the fields mean, what values are valid, is up to the trace producer.
  • It does not write to trace files. It is read-only.

The scope is narrow on purpose. Read traces, navigate them, filter them, measure them. That is it.


Inside the lib: the format-agnostic kind_field design

The most interesting design decision in this library is one you might not notice until you try to use it with your own trace format.

Every trace format uses a field to identify the type of event. agentsnap uses kind. Some other tracing libraries use type. Some use event_type. The field name varies.

The naive approach is to hardcode kind and require all callers to transform their traces before loading. That creates friction. If your trace uses type, you either write a preprocessing script or you do not use the library.

agent-replay-trace passes the field name as a parameter:

# Default: reads the "kind" field
trace = Trace.load("run.jsonl")

# Your trace uses "type" instead
trace = Trace.load("run.jsonl", kind_field="type")

# Your trace uses "event_type"
trace = Trace.load("run.jsonl", kind_field="event_type")
Enter fullscreen mode Exit fullscreen mode

That is the whole change. One parameter. But it means the library works with traces you did not produce yourself, traces from third-party tools, traces from older systems that predate the kind convention.

The rest of the library is consistent. where(kind="tool_call") still uses kind as the filter parameter name regardless of what field the raw JSONL uses. The library maps the raw field to the internal kind concept on load. Your filter code does not need to know what the underlying field was named.

This is format-agnostic without being schema-agnostic. The library still expects JSONL where each line is a valid JSON object. It still expects each object to have the field you tell it to look for. Beyond that, the contents are yours.


When this is useful

The main use case is debugging production failures. You have a trace, you know approximately where things went wrong, and you want to walk forward through the events until you find the exact point.

The second use case is writing tests that operate on trace files. You load a fixture trace, assert that certain events appear in order, assert that the total duration is under a threshold, assert that a particular tool was called exactly twice. where and by_kind make those assertions easy to write.

The third use case is auditing. You have a trace from a run that produced a result you are trying to explain to a stakeholder. You can walk through the trace and narrate each step: here is what the agent saw, here is what tool it called, here is what came back.

All three use cases benefit from the same thing: structured access to a JSONL file you would otherwise read with raw text tools.


When NOT to use it

Do not reach for this if you do not have trace files. If you are not already recording what your agent does at each step, start there. agentsnap produces trace JSONL in the format this library expects. Wire that in first.

Do not use it as a production monitoring tool. It reads files from disk. If you need live event streaming from a running agent, that is a different problem requiring a different approach.

Do not use it to validate trace schema correctness. It does not enforce field types or required fields beyond what you configure. If you need to validate that trace events conform to a specific schema, add a separate validation step before or after loading.


Install

pip install agent-replay-trace
Enter fullscreen mode Exit fullscreen mode

Zero dependencies. Python 3.8+.

Source: github.com/MukundaKatta/agent-replay-trace

28 tests, all passing.


Siblings

These libraries work together in the same tracing and observability layer:

Lib Boundary Repo
agentsnap Produces the trace JSONL files this library reads MukundaKatta/agentsnap
agenttrace Cost and latency rollup over the same trace events MukundaKatta/agenttrace
agent-decision-log Records WHY events that appear in the trace MukundaKatta/agent-decision-log
agent-resume Records checkpoint events in the trace for crash recovery MukundaKatta/agent-resume

The typical stack: agentsnap records calls, agenttrace computes cost per run, agent-decision-log records the reasoning behind each branch, agent-resume checkpoints progress. When something goes wrong, agent-replay-trace is how you read all of that back.


What is next

The library is stable at v0.1.0. A few things that would make sense to add:

A diff function for comparing two traces side by side. You run the same agent on the same input with two different prompts and you want to see where the event sequences diverge.

A slice method to extract a sub-trace between two step indices, useful when a trace file is large and you only care about events in a specific window.

Better duration handling when events include nested timestamps rather than a single top-level timestamp.

For now, load, filter, step. That covers the case that costs people 45 minutes in production.

Top comments (0)