Jason Shotwell

Posted on Mar 16

Build a DVR for AI Agents: Episode Replay UI That Actually Works

#airblackbox #aidebugging #observability #tutorial

Build a DVR for AI Agents: Episode Replay UI That Actually Works

Your autonomous agent just spent 47 minutes and $23.50 failing to book a dinner reservation, and all you have is a chat log that says "task completed successfully."

The Problem: Agent Debugging is Archaeology Without Artifacts

Here's what happens when your AI agent breaks in production: you get a support ticket that says "the AI is acting weird" with zero context about what "weird" means. Was it the prompt? The tool selection? The memory retrieval? The moon phase?

Current debugging approaches are embarrassingly primitive:

Log spelunking: Grepping through thousands of lines hoping to spot the moment everything went sideways
Vibes-based debugging: Staring at the final output and reverse-engineering what might have gone wrong
The nuclear option: Adding more logging and hoping the agent fails again

What developers actually need is a DVR for AI agents — the ability to scrub through an episode timeline, see exactly what the agent observed at each step, and understand the decision tree that led to disaster.

The core challenge isn't just capturing the data (though that's hard enough). It's building a replay interface that doesn't make you want to throw your laptop out the window.

Architecture: How Agent DVR Actually Works

graph TB
    subgraph "Agent Runtime"
        A[LLM Call] --> G[Gateway]
        T[Tool Call] --> G
        M[Memory Access] --> G
        G --> C[Collector]
    end

    subgraph "Storage Layer"
        C --> S[Session Store]
        C --> E[Event Store]
        S --> D[(SQLite)]
        E --> D
    end

    subgraph "Replay UI"
        D --> API[FastAPI Backend]
        API --> UI[React Timeline]
        UI --> V[Event Viewer]
        V --> P[Prompt Inspector]
        V --> R[Response Analyzer]
    end

    UI --> |scrub| API
    API --> |fetch events| D

The architecture has three layers:

Capture Layer: Gateway intercepts every LLM call, tool execution, and memory access
Storage Layer: Events get indexed by session, timestamp, and type for efficient timeline queries
Replay Layer: Timeline UI that lets you scrub through the episode and inspect each decision point

The secret sauce is in the event indexing. We're not just dumping logs — we're creating a queryable timeline where you can jump to any point and see the agent's complete context.

Implementation: Building the Agent DVR

Step 1: Event Capture Gateway

First, we need to capture agent events in a way that preserves temporal relationships:

from typing import Dict, Any, List
import json
import sqlite3
from datetime import datetime
from fastapi import FastAPI
from pydantic import BaseModel
import uuid

class AgentEvent(BaseModel):
    event_id: str
    session_id: str
    timestamp: datetime
    event_type: str  # 'llm_call', 'tool_call', 'memory_access'
    input_data: Dict[str, Any]
    output_data: Dict[str, Any]
    context: Dict[str, Any]
    parent_event_id: str = None

class EventCollector:
    def __init__(self, db_path: str = "agent_events.db"):
        self.db_path = db_path
        self.init_db()

    def init_db(self):
        conn = sqlite3.connect(self.db_path)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS events (
                event_id TEXT PRIMARY KEY,
                session_id TEXT,
                timestamp TEXT,
                event_type TEXT,
                input_data TEXT,
                output_data TEXT,
                context TEXT,
                parent_event_id TEXT
            )
        """)
        conn.execute("""
            CREATE INDEX IF NOT EXISTS idx_session_time 
            ON events(session_id, timestamp)
        """)
        conn.commit()
        conn.close()

    def record_event(self, event: AgentEvent):
        conn = sqlite3.connect(self.db_path)
        conn.execute("""
            INSERT INTO events VALUES (?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            event.event_id,
            event.session_id,
            event.timestamp.isoformat(),
            event.event_type,
            json.dumps(event.input_data),
            json.dumps(event.output_data),
            json.dumps(event.context),
            event.parent_event_id
        ))
        conn.commit()
        conn.close()

Step 2: Agent Integration Wrapper

Now wrap your existing agent to capture events without changing its core logic:

import functools
from openai import OpenAI

class DVRAgent:
    def __init__(self, session_id: str = None):
        self.session_id = session_id or str(uuid.uuid4())
        self.collector = EventCollector()
        self.client = OpenAI()
        self.context_stack = []

    def record_llm_call(func):
        @functools.wraps(func)
        def wrapper(self, *args, **kwargs):
            event_id = str(uuid.uuid4())
            start_time = datetime.now()

            # Capture input
            input_data = {
                'model': kwargs.get('model', 'gpt-3.5-turbo'),
                'messages': kwargs.get('messages', []),
                'temperature': kwargs.get('temperature', 0.7)
            }

            try:
                # Execute the actual call
                response = func(self, *args, **kwargs)

                # Capture output
                output_data = {
                    'content': response.choices[0].message.content,
                    'usage': dict(response.usage),
                    'model': response.model
                }

                # Record the event
                event = AgentEvent(
                    event_id=event_id,
                    session_id=self.session_id,
                    timestamp=start_time,
                    event_type='llm_call',
                    input_data=input_data,
                    output_data=output_data,
                    context={'stack': self.context_stack.copy()}
                )
                self.collector.record_event(event)

                return response

            except Exception as e:
                # Record failures too
                output_data = {'error': str(e)}
                event = AgentEvent(
                    event_id=event_id,
                    session_id=self.session_id,
                    timestamp=start_time,
                    event_type='llm_call',
                    input_data=input_data,
                    output_data=output_data,
                    context={'stack': self.context_stack.copy()}
                )
                self.collector.record_event(event)
                raise

        return wrapper

    @record_llm_call
    def chat_completion(self, messages: List[Dict], **kwargs):
        return self.client.chat.completions.create(
            messages=messages,
            **kwargs
        )

    def use_tool(self, tool_name: str, tool_args: Dict) -> Dict:
        event_id = str(uuid.uuid4())

        # Simulate tool execution
        result = self._execute_tool(tool_name, tool_args)

        event = AgentEvent(
            event_id=event_id,
            session_id=self.session_id,
            timestamp=datetime.now(),
            event_type='tool_call',
            input_data={'tool': tool_name, 'args': tool_args},
            output_data={'result': result},
            context={'stack': self.context_stack.copy()}
        )
        self.collector.record_event(event)

        return result

    def _execute_tool(self, tool_name: str, args: Dict) -> Dict:
        # Your actual tool implementations
        if tool_name == 'web_search':
            return {'results': f"Search results for {args.get('query')}"}
        return {'error': f'Unknown tool: {tool_name}'}

Step 3: Timeline API Backend

Build a FastAPI backend that serves timeline data for the UI:

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from typing import List, Optional

app = FastAPI()
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

collector = EventCollector()

@app.get("/sessions")
async def get_sessions():
    conn = sqlite3.connect(collector.db_path)
    cursor = conn.execute("""
        SELECT session_id, MIN(timestamp) as start_time, 
               COUNT(*) as event_count
        FROM events 
        GROUP BY session_id 
        ORDER BY start_time DESC
    """)
    sessions = []
    for row in cursor.fetchall():
        sessions.append({
            'session_id': row[0],
            'start_time': row[1],
            'event_count': row[2]
        })
    conn.close()
    return sessions

@app.get("/sessions/{session_id}/timeline")
async def get_timeline(session_id: str, start_time: Optional[str] = None):
    conn = sqlite3.connect(collector.db_path)

    query = "SELECT * FROM events WHERE session_id = ?"
    params = [session_id]

    if start_time:
        query += " AND timestamp >= ?"
        params.append(start_time)

    query += " ORDER BY timestamp"

    cursor = conn.execute(query, params)
    events = []
    for row in cursor.fetchall():
        events.append({
            'event_id': row[0],
            'session_id': row[1],
            'timestamp': row[2],
            'event_type': row[3],
            'input_data': json.loads(row[4]),
            'output_data': json.loads(row[5]),
            'context': json.loads(row[6]),
            'parent_event_id': row[7]
        })

    conn.close()
    return {'events': events}

@app.get("/events/{event_id}")
async def get_event_details(event_id: str):
    conn = sqlite3.connect(collector.db_path)
    cursor = conn.execute("SELECT * FROM events WHERE event_id = ?", [event_id])
    row = cursor.fetchone()

    if not row:
        raise HTTPException(status_code=404, detail="Event not found")

    event = {
        'event_id': row[0],
        'session_id': row[1],
        'timestamp': row[2],
        'event_type': row[3],
        'input_data': json.loads(row[4]),
        'output_data': json.loads(row[5]),
        'context': json.loads(row[6]),
        'parent_event_id': row[7]
    }

    conn.close()
    return event

Step 4: React Timeline Component

The frontend needs a scrub-friendly timeline that doesn't lag with hundreds of events:

import React, { useState, useEffect, useRef } from 'react';
import axios from 'axios';

const AgentDVR = ({ sessionId }) => {
  const [events, setEvents] = useState([]);
  const [currentEvent, setCurrentEvent] = useState(null);
  const [playbackPosition, setPlaybackPosition] = useState(0);
  const [isPlaying, setIsPlaying] = useState(false);
  const timelineRef = useRef(null);

  useEffect(() => {
    loadTimeline();
  }, [sessionId]);

  const loadTimeline = async () => {
    try {
      const response = await axios.get(`/sessions/${sessionId}/timeline`);
      setEvents(response.data.events);
      if (response.data.events.length > 0) {
        setCurrentEvent(response.data.events[0]);
      }
    } catch (error) {
      console.error('Failed to load timeline:', error);
    }
  };

  const handleTimelineClick = (eventIndex) => {
    setPlaybackPosition(eventIndex);
    setCurrentEvent(events[eventIndex]);
    setIsPlaying(false);
  };

  const playTimeline = () => {
    if (isPlaying) {
      setIsPlaying(false);
      return;
    }

    setIsPlaying(true);
    const interval = setInterval(() => {
      setPlaybackPosition(prev => {
        const next = prev + 1;
        if (next >= events.length) {
          setIsPlaying(false);
          clearInterval(interval);
          return prev;
        }
        setCurrentEvent(events[next]);
        return next;
      });
    }, 1000);
  };

  return (
    <div className="agent-dvr">
      <div className="timeline-container">
        <div className="playback-controls">
          <button onClick={playTimeline}>
            {isPlaying ? '⏸️' : '▶️'}
          </button>
          <span>{playbackPosition + 1} / {events.length}</span>
        </div>

        <div className="timeline" ref={timelineRef}>
          {events.map((event, index) => (
            <div
              key={event.event_id}
              className={`timeline-event ${index === playbackPosition ? 'active' : ''} ${event.event_type}`}
              onClick={() => handleTimelineClick(index)}
            >
              <div className="event-marker" />
              <div className="event-label">{event.event_type}</div>
            </div>
          ))}
        </div>
      </div>

      <div className="event-inspector">
        {currentEvent && (
          <EventInspector event={currentEvent} />
        )}
      </div>
    </div>
  );
};

const EventInspector = ({ event }) => {
  return (
    <div className="event-details">
      <h3>{event.event_type} at {new Date(event.timestamp).toLocaleTimeString()}</h3>

      <div className="event-section">
        <h4>Input</h4>
        <pre>{JSON.stringify(event.input_data, null, 2)}</pre>
      </div>

      <div className="event-section">
        <h4>Output</h4>
        <pre>{JSON.stringify(event.output_data, null, 2)}</pre>
      </div>

      {event.context && (
        <div className="event-section">
          <h4>Context</h4>
          <pre>{JSON.stringify(event.context, null, 2)}</pre>
        </div>
      )}
    </div>
  );
};

export default AgentDVR;

Pitfalls: What Will Break and How to Handle It

Performance Death by a Thousand Events
When your agent makes 200 LLM calls in a session, the timeline becomes unusable. Solution: implement virtual scrolling and event aggregation. Group rapid-fire events into collapsed sections.

Memory Explosion
Storing full prompt/response pairs for every call will eat your disk. Use a retention policy and compress old events. Keep full detail for the last 48 hours, summaries beyond that.

Context Loss
The hardest part isn't capturing events — it's preserving enough context to understand WHY each decision was made. Always include the agent's memory state and reasoning chain, not just the I/O.

Race Conditions in Event Ordering
Concurrent tool calls can arrive out of order. Use high-resolution timestamps and implement event reordering on the backend. Don't trust arrival order.

UI Performance with Large Timelines
React will choke on 1000+ timeline elements. Implement windowing — only render visible events plus a buffer. Libraries like react-window are your friend.

Measurement: How to Know It's Working

Your agent DVR is successful when:

Time to root cause drops below 2 minutes — You can identify the failing step without reading logs
Debugging becomes collaborative — You can share a session URL with your team instead of copying/pasting logs
Patterns emerge — You start noticing repeated failure modes that weren't obvious in raw logs
Agent performance improves — Better debugging leads to better prompts and tool selection

Track these metrics:

Average session replay views per debugging session
Time from bug report to root cause identification
Number of "I can't reproduce this" tickets (should go to zero)

Next Steps: Try the Full Implementation

The code above is a working foundation, but production requires more sophistication. Airblackbox provides this out of the box with zero configuration — just point your agent at our gateway and you get:

Automatic event capture for LangChain, CrewAI, AutoGen
Production-ready timeline UI with search and filtering
Session sharing and team collaboration
EU AI Act compliance scanning (6/6 technical checks passing)

Ready to stop debugging AI agents with vibes?

Try the live demo: github.com/airblackbox/agent-dvr

Or spin up the full observability platform: docs.airblackbox.com/quickstart

Your future self — the one debugging agents at 2 AM — will thank you.

Because elegant systems are nice. Observable systems are nicer. Observable systems with DVR replay are how you stop your agents from becoming very expensive goldfish.

DEV Community

Build a DVR for AI Agents: Episode Replay UI That Actually Works

Build a DVR for AI Agents: Episode Replay UI That Actually Works

The Problem: Agent Debugging is Archaeology Without Artifacts

Architecture: How Agent DVR Actually Works

Implementation: Building the Agent DVR

Step 1: Event Capture Gateway

Step 2: Agent Integration Wrapper

Step 3: Timeline API Backend

Step 4: React Timeline Component

Pitfalls: What Will Break and How to Handle It

Measurement: How to Know It's Working

Next Steps: Try the Full Implementation

Top comments (0)