You just watched your AI agent make a terrible decision. It sent the wrong email. It queried the database with the wrong filter. It hallucinated a fact and ran with it.
Now you have to figure out why.
Traditional debugging is hard enough. But with LLMs, it gets exponentially worse because every interaction is non-deterministic. Re-running the same prompt with the same input gives you a different output. The bug is gone. You're back to square one.
The Problem: AI Debugging is Broken
Here's what happens in most teams right now:
- Agent makes a mistake → You get an error message or wrong output
- You try to reproduce it → The agent behaves fine this time (different LLM response)
-
You add logging → You sprinkle
console.log()or similar everywhere - You trace execution → Manually follow the decision tree to find where it diverged
- You're still lost → What was the exact prompt? What was the LLM thinking?
This is where session replay comes in.
Session Replay: Record Everything, Debug Anything
The core idea is simple: record every decision point in your AI workflow, then replay it to understand what happened.
What you capture:
- Every LLM prompt (exact text sent to Claude/GPT)
- Every LLM response (with tokens, confidence scores if available)
- Every tool invocation (what your agent called, what it got back)
- Every decision (why the agent chose path A over path B)
- Code changes (what the agent actually modified in your codebase)
- Time travel (jump to any point in the session and inspect state)
Real Example: The Email Bug
Your agent is supposed to send payment reminders. Yesterday it sent 500 emails to the wrong customers.
Without replay: You manually trace through logs, reconstruct what happened, add a fix, and hope it works.
With replay: You jump to the exact moment the agent decided who to email. You see:
- The prompt:
"Send payment reminders to customers with overdue invoices" - The LLM response:
["customer_1", "customer_2", ...](the bad list) - Why it was bad: The agent queried
overdue_invoicestable but didn't filter byactive=true - The fix: Add one more constraint to the prompt or tool definition
Time saved: 2 hours vs 10 minutes.
Why This Matters for Teams
1. Faster Debugging
Instead of: "Let me add more logging and re-run this"
You get: "Let me jump back to the decision point and inspect the exact LLM reasoning"
2. Knowledge Preservation
When one engineer debugs an agent issue, they can save that session as a reference. Other engineers can replay it and learn.
3. Training Your Agents
Replay successful agent interactions to train new ones. Share "how agent X solved this problem" as a replayable workflow.
4. Audit Compliance
For regulated industries (fintech, healthcare), replay gives you a full audit trail: what the agent decided, why, and when.
How to Build Session Replay
Here's a minimal implementation:
from dataclasses import dataclass
from datetime import datetime
from typing import Any
import json
@dataclass
class SessionEvent:
timestamp: datetime
event_type: str # "prompt", "response", "decision", "tool_call"
data: dict[str, Any]
class SessionRecorder:
def __init__(self):
self.events = []
def record_prompt(self, prompt: str, model: str):
self.events.append(SessionEvent(
timestamp=datetime.now(),
event_type="prompt",
data={"prompt": prompt, "model": model}
))
def record_response(self, response: str, tokens: int = None):
self.events.append(SessionEvent(
timestamp=datetime.now(),
event_type="response",
data={"response": response, "tokens": tokens}
))
def record_decision(self, decision: str, reasoning: str):
self.events.append(SessionEvent(
timestamp=datetime.now(),
event_type="decision",
data={"decision": decision, "reasoning": reasoning}
))
def replay(self, from_index: int = 0):
"""Replay from a specific point in the session"""
return self.events[from_index:]
def export(self) -> str:
return json.dumps([
{
"timestamp": e.timestamp.isoformat(),
"type": e.event_type,
**e.data
}
for e in self.events
], indent=2)
# Usage
recorder = SessionRecorder()
recorder.record_prompt("Summarize this article", model="gpt-4")
recorder.record_response("The article discusses...")
recorder.record_decision("send_email", reasoning="User asked for summary, article is relevant")
# Later: replay the exact sequence
print(recorder.export())
But real session replay needs more:
- Distributed tracing (across multiple agents/services)
- Time-travel debugging (inspect state at any point)
- Full code diffs (what actually changed in your codebase)
- Search (find sessions that match a pattern)
- Sharing (send a replay to a teammate)
The Tools Landscape
MCP (Model Context Protocol) handles agent-to-tool communication, but doesn't record sessions.
Bee/Claude agents capture some context, but not in a replayable format.
Session replay for AI is still new. The best tools right now are:
- Mantra — Full session replay for AI workflows (open source, self-hostable)
- PostHog — Product analytics + session recordings (but not AI-specific)
- Datadog — Enterprise observability (heavy, expensive)
- LangSmith — LLM observability (limited to LangChain)
Next Steps
If you're building multi-agent systems, start here:
- Instrument your agents — Log every LLM call, tool invocation, and decision
- Store events sequentially — Timestamp everything, keep the order
- Build a replay viewer — Let engineers jump to any point and inspect state
- Share replays — Make it easy to send a session to a teammate for debugging
The difference between blind debugging and session replay debugging is the difference between guessing in the dark and having a video recording of exactly what happened.
Have you hit this problem with your AI agents? How do you currently debug them? Drop a comment below — I'd love to hear your approach.
Top comments (0)