DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Debug Agent Failures After They Happen with JSONL Traces

The 3 AM Failure You Were Not Watching

The agent failed at 3:17 AM. The error is in the logs: ToolError: search returned 0 results. But that is not the real question. The real question is: what was the conversation state when it made that tool call? What did the model receive? What prompt was active? What did the previous tool return?

Standard Python logging does not capture any of that. You might have the error line. You do not have the context.

agenttap is a wire-level capture library. It captures the full request and response JSON for every LLM call and every tool call, writing each to a JSONL file in real time. If it was running before the failure, you have everything.

agent-replay-trace reads that JSONL file and lets you step through the run: view each event in order, filter by type, inspect state at any point, and jump to the failure.

Together they are a post-mortem debugger for agent runs.


Main Code Example

Step 1: Wire up agenttap before your run

Install both:

pip install agenttap agent-replay-trace
Enter fullscreen mode Exit fullscreen mode

Wrap your Anthropic client:

from agenttap import Tap
import anthropic

client = anthropic.Anthropic()
tap = Tap(output_path="runs/run-20260524.jsonl")

# Wrap the client: all calls go through the tap
tapped_client = tap.wrap(client)

# Use tapped_client exactly like the original
response = tapped_client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize the Q1 revenue report."}],
)
Enter fullscreen mode Exit fullscreen mode

For tool calls, report them explicitly:

def run_tool_with_tap(tap, tool_name, args):
    tap.record_tool_call(tool_name=tool_name, args=args)
    try:
        result = tool_registry[tool_name](**args)
        tap.record_tool_result(tool_name=tool_name, result=result, error=None)
        return result
    except Exception as exc:
        tap.record_tool_result(tool_name=tool_name, result=None, error=str(exc))
        raise
Enter fullscreen mode Exit fullscreen mode

The JSONL file grows one entry per event. Each entry has a timestamp, event type, and the full payload. File rotation and async writes are handled internally so the tap does not slow down the main path.

Step 2: Load and step through the trace after the failure

from agent_replay_trace import Debugger

dbg = Debugger.load("runs/run-20260524.jsonl")

# How many events in the run?
print(f"Total events: {dbg.count()}")

# What types of events are present?
for event_type, count in dbg.event_counts().items():
    print(f"  {event_type}: {count}")
Enter fullscreen mode Exit fullscreen mode

Output:

Total events: 34
  model.request: 5
  model.response: 5
  tool.called: 12
  tool.result: 12
Enter fullscreen mode Exit fullscreen mode

Jump to a specific step:

# Step through all tool calls
for event in dbg.where(event_type="tool.called"):
    print(f"[{event.timestamp}] {event.tool_name}({event.args})")
Enter fullscreen mode Exit fullscreen mode

Find the failure:

failure = dbg.first(event_type="tool.result", where=lambda e: e.error is not None)
if failure:
    print(f"First failure: {failure.tool_name} at {failure.timestamp}")
    print(f"Error: {failure.error}")

    # What was the model's last message before this failure?
    context = dbg.before(failure, event_type="model.response", n=1)
    print(f"Last model response: {context[0].content[:200]}")
Enter fullscreen mode Exit fullscreen mode

Step through the run interactively:

session = dbg.step_session()

while True:
    event = session.next()
    if event is None:
        print("End of trace.")
        break
    print(f"[{event.index}] {event.event_type} | {event.summary()}")
    cmd = input("(n)ext / (j)ump N / (i)nspect / (q)uit > ").strip()
    if cmd == "q":
        break
    elif cmd.startswith("j "):
        session.jump(int(cmd.split()[1]))
    elif cmd == "i":
        import json
        print(json.dumps(event.payload, indent=2))
Enter fullscreen mode Exit fullscreen mode

The interactive session lets you jump to any index. If you know the failure was around event 28, you jump to 25 and step forward from there.


What This Does NOT Do

agenttap requires you to have it running before the failure. It is not a retroactive tool. If your agent crashed at 3 AM and you had no tap configured, there is no trace to replay. The JSONL file only exists if you set it up.

agent-replay-trace does not re-execute any code. It reads recorded events. You cannot replay a trace and "fix" the failure mid-stream to see what would have happened differently. It is strictly a viewer.

Neither library captures side effects outside the LLM and tool boundary. If your tool calls a database and the database state changed, the trace shows what the tool returned but not the database state at the time.

The tap does not capture Python exceptions that happen outside the wrapped client or the explicit record_tool_result() calls. If your agent dies in orchestration code before it calls the client, the tap has no entry for it.


Design Reasoning

The tap writes JSONL, not a binary format, for two reasons. First, JSONL is streamable. You can tail -f a running trace file and see events as they happen. Second, JSONL is inspectable without any library. If you need to grep for a specific tool name across 100 run files, standard shell tools work.

Separating capture (agenttap) from replay (agent-replay-trace) is intentional. You want the capture library to be as lightweight as possible. It runs in production. It cannot import a heavy debugging toolkit. The replay library can be as heavy as it needs to be because it only runs when you are investigating a failure.

The before() and after() methods on Debugger are the most useful primitives for post-mortems. The failure event is usually not where the root cause is. The root cause is in the state before the failure. before(failure, event_type="model.request", n=1) gives you the exact prompt that led to the bad tool call.


When This Applies (and When It Does Not)

Use this approach when:

  • Your agent runs unattended in production and failures need to be diagnosed without being present
  • You have multi-turn conversations where the failure cause is not obvious from the error alone
  • You want to build a suite of regression tests from recorded traces of known failure cases

Skip it when:

  • Your agent is short-lived (single turn, fast) and logs are sufficient
  • You are in early development and re-running the agent is cheap enough that you do not need traces
  • Your production environment cannot write files (ephemeral containers, strict I/O constraints)

For the ephemeral container case, agenttap supports a custom writer callback instead of a file path. You can write to a log aggregation service or object storage instead.


Install or Quick-Start

pip install agenttap agent-replay-trace
Enter fullscreen mode Exit fullscreen mode

Minimal tap setup:

from agenttap import Tap
import anthropic

tap = Tap(output_path="trace.jsonl")
client = tap.wrap(anthropic.Anthropic())
# use client normally from here
Enter fullscreen mode Exit fullscreen mode

Minimal replay:

from agent_replay_trace import Debugger

dbg = Debugger.load("trace.jsonl")
for event in dbg.all():
    print(event.summary())
Enter fullscreen mode Exit fullscreen mode

GitHub:


Siblings Table

Library What it does Integration with tap/replay
agent-step-log Per-step JSONL logger Writes step-level entries; different granularity than tap
agent-event-bus In-process pub/sub Subscribe to tool events; write them to a secondary tap sink
llm-fixture-replay VCR-style record/replay for tests Uses tap output as the fixture source for test playback
agenttrace Cost and latency aggregation per run Post-process tap JSONL to extract cost/latency metrics
agent-citation Claim-level citation tracking Attach citation store state at each model response event

What is Next

The immediate next feature is a diff viewer: load two trace files for the same agent (before and after a prompt change) and see which events changed. This is useful for prompt engineering: you want to know that your prompt change only affected the target behavior and did not break anything else.

Integration with llm-fixture-replay is also planned. The idea is that a production trace becomes a test fixture automatically. The first time you see a failure pattern in a trace, you write a test that replays it. llm-fixture-replay handles the replay; agent-replay-trace provides the viewer for understanding the trace before you write the test.

For the Hermes Agent Challenge sprint, post-mortem debugging is the piece that makes everything else trustworthy. Budgets and guardrails tell you when something went wrong. Traces tell you why.

Top comments (0)