DEV Community

decker
decker

Posted on

How to Debug Multi-Agent AI Systems: Session Replay for LLM Workflows

You just watched your AI agent make a terrible decision. It sent the wrong email. It queried the database with the wrong filter. It hallucinated a fact and ran with it.

Now you have to figure out why.

Traditional debugging is hard enough. But with LLMs, it gets exponentially worse because every interaction is non-deterministic. Re-running the same prompt with the same input gives you a different output. The bug is gone. You're back to square one.

The Problem: AI Debugging is Broken

Here's what happens in most teams right now:

  1. Agent makes a mistake → You get an error message or wrong output
  2. You try to reproduce it → The agent behaves fine this time (different LLM response)
  3. You add logging → You sprinkle console.log() or similar everywhere
  4. You trace execution → Manually follow the decision tree to find where it diverged
  5. You're still lost → What was the exact prompt? What was the LLM thinking?

This is where session replay comes in.

Session Replay: Record Everything, Debug Anything

The core idea is simple: record every decision point in your AI workflow, then replay it to understand what happened.

What you capture:

  • Every LLM prompt (exact text sent to Claude/GPT)
  • Every LLM response (with tokens, confidence scores if available)
  • Every tool invocation (what your agent called, what it got back)
  • Every decision (why the agent chose path A over path B)
  • Code changes (what the agent actually modified in your codebase)
  • Time travel (jump to any point in the session and inspect state)

Real Example: The Email Bug

Your agent is supposed to send payment reminders. Yesterday it sent 500 emails to the wrong customers.

Without replay: You manually trace through logs, reconstruct what happened, add a fix, and hope it works.

With replay: You jump to the exact moment the agent decided who to email. You see:

  • The prompt: "Send payment reminders to customers with overdue invoices"
  • The LLM response: ["customer_1", "customer_2", ...] (the bad list)
  • Why it was bad: The agent queried overdue_invoices table but didn't filter by active=true
  • The fix: Add one more constraint to the prompt or tool definition

Time saved: 2 hours vs 10 minutes.

Why This Matters for Teams

1. Faster Debugging

Instead of: "Let me add more logging and re-run this"

You get: "Let me jump back to the decision point and inspect the exact LLM reasoning"

2. Knowledge Preservation

When one engineer debugs an agent issue, they can save that session as a reference. Other engineers can replay it and learn.

3. Training Your Agents

Replay successful agent interactions to train new ones. Share "how agent X solved this problem" as a replayable workflow.

4. Audit Compliance

For regulated industries (fintech, healthcare), replay gives you a full audit trail: what the agent decided, why, and when.

How to Build Session Replay

Here's a minimal implementation:

from dataclasses import dataclass
from datetime import datetime
from typing import Any
import json

@dataclass
class SessionEvent:
    timestamp: datetime
    event_type: str  # "prompt", "response", "decision", "tool_call"
    data: dict[str, Any]

class SessionRecorder:
    def __init__(self):
        self.events = []

    def record_prompt(self, prompt: str, model: str):
        self.events.append(SessionEvent(
            timestamp=datetime.now(),
            event_type="prompt",
            data={"prompt": prompt, "model": model}
        ))

    def record_response(self, response: str, tokens: int = None):
        self.events.append(SessionEvent(
            timestamp=datetime.now(),
            event_type="response",
            data={"response": response, "tokens": tokens}
        ))

    def record_decision(self, decision: str, reasoning: str):
        self.events.append(SessionEvent(
            timestamp=datetime.now(),
            event_type="decision",
            data={"decision": decision, "reasoning": reasoning}
        ))

    def replay(self, from_index: int = 0):
        """Replay from a specific point in the session"""
        return self.events[from_index:]

    def export(self) -> str:
        return json.dumps([
            {
                "timestamp": e.timestamp.isoformat(),
                "type": e.event_type,
                **e.data
            }
            for e in self.events
        ], indent=2)

# Usage
recorder = SessionRecorder()
recorder.record_prompt("Summarize this article", model="gpt-4")
recorder.record_response("The article discusses...")
recorder.record_decision("send_email", reasoning="User asked for summary, article is relevant")

# Later: replay the exact sequence
print(recorder.export())
Enter fullscreen mode Exit fullscreen mode

But real session replay needs more:

  • Distributed tracing (across multiple agents/services)
  • Time-travel debugging (inspect state at any point)
  • Full code diffs (what actually changed in your codebase)
  • Search (find sessions that match a pattern)
  • Sharing (send a replay to a teammate)

The Tools Landscape

MCP (Model Context Protocol) handles agent-to-tool communication, but doesn't record sessions.

Bee/Claude agents capture some context, but not in a replayable format.

Session replay for AI is still new. The best tools right now are:

  • Mantra — Full session replay for AI workflows (open source, self-hostable)
  • PostHog — Product analytics + session recordings (but not AI-specific)
  • Datadog — Enterprise observability (heavy, expensive)
  • LangSmith — LLM observability (limited to LangChain)

Next Steps

If you're building multi-agent systems, start here:

  1. Instrument your agents — Log every LLM call, tool invocation, and decision
  2. Store events sequentially — Timestamp everything, keep the order
  3. Build a replay viewer — Let engineers jump to any point and inspect state
  4. Share replays — Make it easy to send a session to a teammate for debugging

The difference between blind debugging and session replay debugging is the difference between guessing in the dark and having a video recording of exactly what happened.


Have you hit this problem with your AI agents? How do you currently debug them? Drop a comment below — I'd love to hear your approach.

Top comments (0)