DEV Community

Cover image for I can now replay any AI agent stream from production. Here's how.
Abhishek Chatterjee
Abhishek Chatterjee

Posted on

I can now replay any AI agent stream from production. Here's how.

I can now replay any AI agent stream from production.

Here's how.
In my last post, I wrote about the four SSE bugs that break AI agent UIs at 2am — chunk boundary splits, missing token batching, hanging done states, and retry logic that retries the wrong things.

There's a fifth problem I didn't cover, because the fix didn't exist yet.

What do you do the morning after something broke?

The stream is gone. The event sequence that caused the bug evaporated the moment the connection closed. You have a user complaint, maybe a generic error log, and zero ability to reproduce the issue locally because local dev doesn't have real network conditions, real token rates, or the specific sequence of tool calls that triggered the failure.

Today I shipped AgentStreamRecorder to the agent-stream library to solve exactly this.

The debugging gap nobody talks about
When a REST API fails, you have the request and response in your logs. You can replay it with curl. You can write a regression test. The failure is reproducible.

When an AI agent stream fails mid-flight, you have nothing. The SSE connection is stateful and ephemeral. The events exist in a buffer, get consumed by the client, and are gone. If the frontend shows wrong state — tools that didn't clear, progress that froze at 60%, text that truncated — you're debugging from memory and screenshots.

We hit this at Praxiom repeatedly while building 36 production agent tools. The failure pattern was always the same:

  1. User reports the stream "felt wrong" or the UI got stuck
  2. You try to reproduce — works fine in dev
  3. You add more logging — doesn't help, the issue is in the event sequence, not individual events
  4. You fix something and hope — no regression test possible
  5. Issue reappears three weeks later in a different context
  6. The root problem: we had full observability on every other layer of the stack except the stream itself.

What AgentStreamRecorder does
It's a drop-in async wrapper. You add two lines to your existing endpoint and every stream gets saved to a .jsonl file automatically.

Before:

@app.post("/chat")
async def chat(req: ChatRequest):
    return agent_stream_response(run_agent(req.message))

After:

from agent_stream.recorder import AgentStreamRecorder

recorder = AgentStreamRecorder("streams/production.jsonl")

@app.post("/chat")
async def chat(req: ChatRequest):
    async def generate():
        async for sse_str in recorder.record(run_agent(req.message)):
            yield sse_str  # passes through unchanged

    return agent_stream_response(generate())
Enter fullscreen mode Exit fullscreen mode

That's it. The recorder wraps the async generator, tees each SSE event to the file, and re-yields the string unchanged. Your StreamingResponse sees nothing different. The client sees nothing different. Zero impact on latency.

What gets recorded
Each session in the .jsonl file starts with a header line, then one line per event:

{% embed {"session": "f3a2c1b0-...", "started_at": "2026-03-31T02:14:00+00:00", "t": 0}
{"t": 0.0,   "event": "token",      "data": {"text": "Hello, here"}}
{"t": 0.041, "event": "token",      "data": {"text": " is what I found"}}
{"t": 0.052, "event": "tool_use",   "data": {"tool_name": "web_search", "status": "running"}}
{"t": 0.894, "event": "tool_result","data": {"tool_name": "web_search", "duration_ms": 842, "status": "done"}}
{"t": 1.204, "event": "done",       "data": {"num_turns": 1, "tool_count": 1, "duration_ms": 1204}} %}
Enter fullscreen mode Exit fullscreen mode

The t field is seconds since stream start — relative, not absolute — so files are portable across machines and time zones. The format is append-safe: each new session appends to the file with its own UUID and t=0 baseline, so you can keep one file per day and scan it with grep.

Three design decisions worth explaining:

.jsonl not binary. You can grep it. grep '"event": "error"' production.jsonl instantly shows every stream that hit an error event, with timing. Binary formats are faster to write but terrible to investigate.

Relative timestamps. Absolute wall-clock timestamps tell you when something happened. Relative timestamps tell you how long it took. Relative is almost always what you need for debugging — "tool_result came back 842ms after tool_use" is more useful than two UTC timestamps you have to subtract.

Flush after every write. The file is flushed after every line, not buffered. If the process crashes mid-stream, you still have everything up to the last event. This matters — crashes are exactly when you most need the recording.

Replaying it
The agent-stream CLI (installed with the package) plays back recordings:

What sessions are in this file?

agent-stream replay production.jsonl --list

SESSION                                STARTED                   EVENTS   DURATION  TYPES
--------------------------------------------------------------------------------------------------
f3a2c1b0-4e5d-...                     2026-03-31T02:14:00       8         4.21s     token tool_use tool_result done
a8b3c2d1-5f6e-...                     2026-03-31T02:31:45       12        7.83s     token thinking tool_use tool_result turn done

# Replay the most recent session at original speed
agent-stream replay production.jsonl

# Replay at 2× speed (50ms gaps become 25ms)
agent-stream replay production.jsonl --speed 2

# Replay at 0.1× speed to watch a fast tool sequence in slow motion
agent-stream replay production.jsonl --speed 0.1
Enter fullscreen mode Exit fullscreen mode

The output is valid SSE piped to stdout. You can pipe it to a local dev server, feed it into a test harness, or just watch it in the terminal to see exactly what the client received.

The use case that made this necessary
Here's the specific failure pattern that pushed us to build this. We had a multi-turn agent that called several tools in sequence. In production with certain inputs, the activeTools array in the React hook wasn't clearing properly — a tool would finish but its name would stay in the "currently running" UI indefinitely.

Couldn't reproduce it locally. The tool sequence was always different. Adding console logs to the hook showed correct state at each individual event, but the sequence mattered.

With AgentStreamRecorder, we recorded the failing session from production, replayed it through a local frontend, and watched the hook state update in real-time. Spotted the issue immediately: a tool_result event was arriving before the matching tool_use in one specific sequence, because two tools were running in parallel and the faster one resolved first. The hook was looking for tool names in order; the order wasn't guaranteed.

Fifteen-minute fix. Would have taken days without the recording.

The bigger pattern: AI agents need a different kind of observability
Standard application observability — request logs, error rates, latency percentiles — doesn't map cleanly onto AI agent streams. The unit of interest isn't a request, it's a session. The signal isn't an error code, it's an event sequence. The failure mode isn't a stack trace, it's a state machine reaching an unexpected state.

AgentStreamRecorder is a small step toward stream-native observability. Each .jsonl session is a complete, replayable trace of exactly what the agent did and how long each step took. You can diff two sessions to understand why one succeeded and one failed. You can build a test suite from real production recordings. You can grep across thousands of sessions to understand behavioral patterns at scale.

None of this is revolutionary. It's the kind of thing you'd take for granted in any mature system — request tracing, structured logging, replay-from-log. We just haven't had it for agent streams.

Getting it
pip install agent-event-stream
from agent_stream.recorder import AgentStreamRecorder

Full source, spec, and React client hook at github.com/abhichat85/agent-stream.

Extracted from Praxiom — Product Intelligence that compounds. Think Cursor for Product Managers.

Top comments (0)