Step Through Your Agent's Failures Like a Debugger

#hermeschallenge #ai #python #agents

The production run failed at step 23 out of 47. The logs show "tool call error" but not why. The conversation history that led to step 23 is not in the logs. Reproducing the failure requires setting up the same initial conditions and running all 23 steps again.

agent-debug-replay is a step-through replay for agents that log their runs with agent-step-log. Load a run's JSONL file, advance step by step, inspect state at each step, and find exactly where things went wrong.

The Shape of the Fix

from agent_debug_replay import DebugReplay

replay = DebugReplay(log_path="./logs/run-a3f8b2c1.jsonl")

print(f"Total steps: {replay.total_steps}")
print(f"Run ID: {replay.run_id}")
print(f"Result: {replay.result}")

# Step through
for step in replay.steps():
    print(f"Step {step.index}: {step.action_type}")
    print(f"  Input: {step.input_summary}")
    print(f"  Output: {step.output_summary}")

    if step.error:
        print(f"  ERROR: {step.error}")
        print(f"  Full context at this step:")
        print(replay.context_at(step.index))
        break

Load the JSONL. Step through. Find the failure. Fix it.

What It Does NOT Do

agent-debug-replay does not re-execute the steps. It replays the recorded state. If you want to re-run from step 23 with a fix, you need to use agent-resume with a checkpoint up to step 22.

It does not visualize. It is a data access API. Build your own visualization on top (a CLI, a notebook, a web UI) using the step data it provides.

It does not diff two runs. For comparison between a failing and a passing run of the same task, load both as separate DebugReplay objects and compare step by step yourself.

Inside the Library

Each step in the JSONL is a structured record from agent-step-log:

{"run_id": "a3f8b2c1", "step": 23, "action": "tool_call", 
 "tool": "web_search", "args": {"query": "..."},
 "result": null, "error": "timeout", "ts": 1748107200,
 "messages_count": 46, "tokens_used": 12847}

DebugReplay reads the file and indexes steps by step number. Navigation:

class DebugReplay:
    def steps(self) -> Iterator[Step]:
        for record in self._records:
            yield Step.from_record(record)

    def step_at(self, n: int) -> Step:
        return self._by_index[n]

    def context_at(self, n: int) -> ReplayContext:
        """Return all state accumulated up to step n."""
        return ReplayContext(
            steps=self._records[:n+1],
            total_tokens=sum(r["tokens_used"] for r in self._records[:n+1]),
            tools_called=[r["tool"] for r in self._records[:n+1] if r.get("tool")],
            errors=[r["error"] for r in self._records[:n+1] if r.get("error")],
        )

    def find_first_error(self) -> Step | None:
        for step in self.steps():
            if step.error:
                return step
        return None

context_at(n) is the key method for debugging: it aggregates all state that existed at step n — total tokens, all tools called, all errors seen. This gives you the full picture at the failure point without scanning the log manually.

When to Use It

Use it whenever a production agent run fails and you need to understand why. The workflow:

Agent runs with agent-step-log recording each step
Run fails
Load the JSONL with DebugReplay
Call find_first_error() to jump to the failure
Call context_at(error_step.index) to see accumulated state
Find the root cause

Without step logging, you are working from timestamp-correlated logs and hope. With it, you have a complete, structured record of every agent action.

Skip it for agents that run for under 5 steps. The overhead of step logging is not worth it for short runs.

Install

pip install git+https://github.com/MukundaKatta/agent-debug-replay

from agent_debug_replay import DebugReplay
import sys

def debug_failed_run(log_path: str) -> None:
    replay = DebugReplay(log_path=log_path)

    first_error = replay.find_first_error()
    if not first_error:
        print("No errors found in this run.")
        return

    print(f"First error at step {first_error.index}")
    print(f"Action: {first_error.action_type}")
    print(f"Error: {first_error.error}")

    ctx = replay.context_at(first_error.index)
    print(f"\nContext at failure:")
    print(f"  Total tokens used: {ctx.total_tokens}")
    print(f"  Tools called: {', '.join(ctx.tools_called)}")
    print(f"  Previous errors: {ctx.errors[:-1]}")  # all errors before this one

    print(f"\nStep before failure:")
    if first_error.index > 0:
        prev = replay.step_at(first_error.index - 1)
        print(f"  {prev.action_type}: {prev.output_summary}")

if __name__ == "__main__":
    debug_failed_run(sys.argv[1])

Sibling Libraries

Library	What it solves
`agent-step-log`	Per-step JSONL logging (produces the log that debug-replay reads)
`agent-run-id`	Run IDs for correlating logs to specific runs
`agent-resume`	Resume from a checkpoint after finding and fixing the failure
`agenttap`	Wire-level capture for even more detailed replay
`agent-decision-log`	WHY-layer logs that complement step-level WHAT logs

The debugging workflow: agent-step-log records the run, agent-debug-replay lets you navigate it, agent-resume lets you re-run from a checkpoint after the fix.

What's Next

Notebook integration: a Jupyter widget that renders the step-through as an interactive cell. Click forward/backward through steps and see state update in-place.

Diff replay: given two run logs (one passing, one failing), highlight the divergence point — where the two runs started producing different steps. This is the most common debugging scenario for regressions.

A web UI: a minimal Flask/FastAPI endpoint that serves a step-through UI in the browser. Load the JSONL, browse steps, click into context. This would be significantly more usable than the programmatic API for non-engineering stakeholders.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.