Jason Shotwell

Posted on Mar 13

Build a DVR for AI Agents: Episode Replay UI That Actually Works

#airblackbox #aidebugging #observability #tutorial

Build a DVR for AI Agents: Episode Replay UI That Actually Works

Your AI agent just burned through $47 in tokens debugging a conversation that went sideways at message 23 of 847, and you're debugging it by scrolling through terminal logs like it's 1987.

The Problem: Debugging AI Agents is Archaeological Work

Here's what happens when your CrewAI agent decides to have an existential crisis:

[2024-01-15 14:32:17] Agent: ResearchAgent - Starting task: analyze_market_trends
[2024-01-15 14:32:18] LLM Call: system_prompt="You are a research agent..." 
[2024-01-15 14:32:19] Response: "I'll analyze the market trends..."
[2024-01-15 14:32:20] Agent: ResearchAgent - Task completed
[2024-01-15 14:32:21] Agent: WritingAgent - Starting task: write_report
... 
[847 more lines of this archaeological nightmare]
...
[2024-01-15 14:47:32] Agent: WritingAgent - Error: Token limit exceeded

You know something went wrong around message 23. You know the agent started hallucinating about cryptocurrency trends that don't exist. You know it burned through your token budget faster than a teenager burns through data.

What you don't know is why, because debugging multi-agent conversations with grep and prayer isn't debugging—it's divination.

The existing solutions are equally painful:

Terminal logs: Great if you enjoy archaeological digs through text
Static dashboards: Show you averages, not the specific moment everything went wrong
OpenAI playground: Only shows individual API calls, not agent conversations
Manual instrumentation: Because what every debugging session needs is more code to debug

You need a DVR for AI agents. Rewind to the exact moment. Scrub through the conversation timeline. See what the agent was thinking, what context it had, and where it decided to go rogue.

Architecture: How Agent DVR Actually Works

Here's how we build a proper replay system that doesn't suck:

graph TD
    A[AI Agent] --> B[Airblackbox Gateway]
    B --> C[OpenAI API]
    B --> D[Telemetry Store]

    D --> E[Timeline API]
    E --> F[React UI]

    F --> G[Episode List]
    F --> H[Timeline Scrubber]
    F --> I[Context Inspector]

    G --> J[Agent Conversations]
    H --> K[Message Navigation]
    I --> L[Token Analysis]

    subgraph "DVR Components"
        G
        H
        I
    end

    subgraph "Data Flow"
        B --> |"Records every LLM call"| D
        D --> |"Queries by episode"| E
        E --> |"Serves timeline data"| F
    end

The magic happens in three layers:

Capture Layer: Gateway intercepts every LLM call, preserving full context
Storage Layer: Time-indexed episode data with conversation threading
Replay Layer: Timeline UI that lets you scrub through agent decisions like a video

Implementation: Building the Agent DVR

Step 1: Set Up the Recording Infrastructure

First, install Airblackbox to start capturing your agent conversations:

pip install airblackbox

Configure the gateway to record everything your agents do:

# agent_dvr_setup.py
import os
from airblackbox import configure_gateway
from openai import OpenAI

# Start recording all LLM calls
configure_gateway(
    api_key=os.getenv("OPENAI_API_KEY"),
    endpoint="http://localhost:8080",  # Gateway endpoint
    record_all=True,
    session_tags=["agent_episode", "production"]
)

# Your agents now route through the recording gateway
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    base_url="http://localhost:8080/v1"  # Gateway intercepts here
)

Step 2: Create the Episode Data Model

Define how we structure agent conversations for timeline replay:

# episode_model.py
from dataclasses import dataclass
from datetime import datetime
from typing import List, Dict, Any, Optional
from enum import Enum

class MessageType(Enum):
    SYSTEM = "system"
    USER = "user" 
    ASSISTANT = "assistant"
    TOOL_CALL = "tool_call"
    TOOL_RESPONSE = "tool_response"

@dataclass
class AgentMessage:
    timestamp: datetime
    agent_name: str
    message_type: MessageType
    content: str
    token_count: int
    cost: float
    metadata: Dict[str, Any]

@dataclass
class AgentEpisode:
    episode_id: str
    start_time: datetime
    end_time: Optional[datetime]
    total_cost: float
    total_tokens: int
    agent_names: List[str]
    messages: List[AgentMessage]
    status: str  # "running", "completed", "failed"

    @property
    def duration(self) -> float:
        """Duration in seconds"""
        if not self.end_time:
            return (datetime.now() - self.start_time).total_seconds()
        return (self.end_time - self.start_time).total_seconds()

Step 3: Build the Timeline API

Create the backend that serves episode data to your DVR interface:

# timeline_api.py
from fastapi import FastAPI, HTTPException
from typing import List, Optional
import sqlite3
from datetime import datetime, timedelta

app = FastAPI()

class EpisodeStore:
    def __init__(self, db_path: str = "agent_episodes.db"):
        self.db_path = db_path
        self._init_db()

    def _init_db(self):
        conn = sqlite3.connect(self.db_path)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS episodes (
                episode_id TEXT PRIMARY KEY,
                start_time TEXT,
                end_time TEXT,
                total_cost REAL,
                total_tokens INTEGER,
                agent_names TEXT,
                status TEXT
            )
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS messages (
                id INTEGER PRIMARY KEY,
                episode_id TEXT,
                timestamp TEXT,
                agent_name TEXT,
                message_type TEXT,
                content TEXT,
                token_count INTEGER,
                cost REAL,
                metadata TEXT,
                FOREIGN KEY (episode_id) REFERENCES episodes (episode_id)
            )
        """)
        conn.commit()
        conn.close()

    def get_episodes(self, limit: int = 50) -> List[AgentEpisode]:
        """Get recent episodes for the episode list"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.execute("""
            SELECT episode_id, start_time, end_time, total_cost, 
                   total_tokens, agent_names, status
            FROM episodes 
            ORDER BY start_time DESC 
            LIMIT ?
        """, (limit,))

        episodes = []
        for row in cursor.fetchall():
            episode = AgentEpisode(
                episode_id=row[0],
                start_time=datetime.fromisoformat(row[1]),
                end_time=datetime.fromisoformat(row[2]) if row[2] else None,
                total_cost=row[3],
                total_tokens=row[4],
                agent_names=row[5].split(",") if row[5] else [],
                messages=[],  # Load separately for performance
                status=row[6]
            )
            episodes.append(episode)

        conn.close()
        return episodes

    def get_episode_timeline(self, episode_id: str) -> Optional[AgentEpisode]:
        """Get full episode with message timeline"""
        # Implementation details for loading complete episode...
        pass

store = EpisodeStore()

@app.get("/episodes")
async def list_episodes():
    """Get recent episodes for DVR interface"""
    episodes = store.get_episodes()
    return {"episodes": episodes}

@app.get("/episodes/{episode_id}/timeline")
async def get_timeline(episode_id: str):
    """Get episode timeline for scrubber interface"""
    episode = store.get_episode_timeline(episode_id)
    if not episode:
        raise HTTPException(status_code=404, detail="Episode not found")

    return {
        "episode": episode,
        "timeline_markers": _generate_timeline_markers(episode)
    }

def _generate_timeline_markers(episode: AgentEpisode) -> List[Dict]:
    """Generate timeline markers for scrubber UI"""
    markers = []
    for i, message in enumerate(episode.messages):
        # Create markers for key events
        if message.message_type in [MessageType.TOOL_CALL, MessageType.ASSISTANT]:
            markers.append({
                "timestamp": message.timestamp.isoformat(),
                "message_index": i,
                "agent_name": message.agent_name,
                "type": message.message_type.value,
                "cost": message.cost
            })
    return markers

Step 4: Build the DVR Interface

Create a React component that actually feels like using a DVR:

// AgentDVR.jsx
import React, { useState, useEffect } from 'react';

const AgentDVR = () => {
    const [episodes, setEpisodes] = useState([]);
    const [currentEpisode, setCurrentEpisode] = useState(null);
    const [currentMessageIndex, setCurrentMessageIndex] = useState(0);
    const [isPlaying, setIsPlaying] = useState(false);

    useEffect(() => {
        fetch('/api/episodes')
            .then(r => r.json())
            .then(data => setEpisodes(data.episodes));
    }, []);

    const loadEpisode = async (episodeId) => {
        const response = await fetch(`/api/episodes/${episodeId}/timeline`);
        const data = await response.json();
        setCurrentEpisode(data.episode);
        setCurrentMessageIndex(0);
    };

    const scrubToMessage = (messageIndex) => {
        setCurrentMessageIndex(messageIndex);
        setIsPlaying(false);
    };

    const playFromCurrent = () => {
        setIsPlaying(true);
        const interval = setInterval(() => {
            setCurrentMessageIndex(prev => {
                if (prev >= currentEpisode.messages.length - 1) {
                    setIsPlaying(false);
                    clearInterval(interval);
                    return prev;
                }
                return prev + 1;
            });
        }, 1000); // 1 message per second playback
    };

    return (
        <div className="agent-dvr">
            <div className="episode-list">
                <h2>Agent Episodes</h2>
                {episodes.map(episode => (
                    <div 
                        key={episode.episode_id}
                        className="episode-item"
                        onClick={() => loadEpisode(episode.episode_id)}
                    >
                        <div className="episode-info">
                            <span className="episode-time">
                                {new Date(episode.start_time).toLocaleString()}
                            </span>
                            <span className="episode-cost">${episode.total_cost.toFixed(4)}</span>
                            <span className="episode-status">{episode.status}</span>
                        </div>
                    </div>
                ))}
            </div>

            {currentEpisode && (
                <div className="episode-player">
                    <div className="timeline-scrubber">
                        <input
                            type="range"
                            min="0"
                            max={currentEpisode.messages.length - 1}
                            value={currentMessageIndex}
                            onChange={(e) => scrubToMessage(parseInt(e.target.value))}
                            className="timeline-slider"
                        />
                        <div className="timeline-markers">
                            {/* Render timeline markers for key events */}
                        </div>
                    </div>

                    <div className="playback-controls">
                        <button onClick={() => scrubToMessage(0)}>⏮</button>
                        <button onClick={() => scrubToMessage(Math.max(0, currentMessageIndex - 1))}>⏪</button>
                        <button onClick={isPlaying ? () => setIsPlaying(false) : playFromCurrent}>
                            {isPlaying ? '⏸' : '▶️'}
                        </button>
                        <button onClick={() => scrubToMessage(Math.min(currentEpisode.messages.length - 1, currentMessageIndex + 1))}>⏩</button>
                        <button onClick={() => scrubToMessage(currentEpisode.messages.length - 1)}>⏭</button>
                    </div>

                    <div className="message-inspector">
                        {currentEpisode.messages[currentMessageIndex] && (
                            <MessageView 
                                message={currentEpisode.messages[currentMessageIndex]}
                                context={currentEpisode.messages.slice(0, currentMessageIndex + 1)}
                            />
                        )}
                    </div>
                </div>
            )}
        </div>
    );
};

Step 5: Connect Your Agents

Wire up your existing agents to flow through the recording gateway:

# crew_ai_example.py
from crewai import Agent, Task, Crew
from langchain.chat_models import ChatOpenAI

# Agents automatically use the recording gateway
llm = ChatOpenAI(
    model_name="gpt-4",
    openai_api_base="http://localhost:8080/v1",  # Routes through gateway
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

research_agent = Agent(
    role="Research Analyst",
    goal="Analyze market trends and provide insights",
    backstory="Expert in market analysis with 10 years of experience",
    llm=llm,
    verbose=True
)

# Every conversation now gets recorded with full context
crew = Crew(
    agents=[research_agent],
    tasks=[research_task],
    verbose=True
)

# This entire episode will be replayable in the DVR
result = crew.kickoff()

Pitfalls: What Will Break and How to Handle It

1. Memory Explosion with Long Episodes

Problem: Recording everything means episodes with 10,000+ messages will eat your storage alive.

Solution: Implement smart chunking and compression:

# episode_compression.py
def compress_episode_content(episode: AgentEpisode) -> AgentEpisode:
    """Compress large episodes for storage"""
    compressed_messages = []

    for message in episode.messages:
        # Compress repetitive system prompts
        if message.message_type == MessageType.SYSTEM:
            message.content = _compress_system_prompt(message.content)

        # Truncate very long tool responses
        if message.message_type == MessageType.TOOL_RESPONSE and len(message.content) > 5000:
            message.content = message.content[:5000] + "... [truncated]"

        compressed_messages.append(message)

    episode.messages = compressed_messages
    return episode

2. Timeline Scrubbing Performance

Problem: Loading 847 messages into the timeline scrubber makes your browser cry.

Solution: Virtual scrolling and lazy loading:

// VirtualTimeline.jsx
const VirtualTimeline = ({ messages, onMessageSelect }) => {
    const [visibleRange, setVisibleRange] = useState({ start: 0, end: 50 });

    const handleScroll = (scrollTop) => {
        const itemHeight = 40;
        const containerHeight = 600;
        const start = Math.floor(scrollTop / itemHeight);
        const end = start + Math.ceil(containerHeight / itemHeight);
        setVisibleRange({ start, end });
    };

    return (
        <div className="virtual-timeline" onScroll={handleScroll}>
            {messages.slice(visibleRange.start, visibleRange.end).map((message, index) => (
                <MessageMarker key={index} message={message} onClick={onMessageSelect} />
            ))}
        </div>
    );
};

3. Context Window Visualization

Problem: Showing what context the agent had at message 247 requires reconstructing the entire conversation state.

Solution: Pre-compute context snapshots during recording:

# context_snapshots.py
def create_context_snapshot(messages: List[AgentMessage], at_index: int) -> Dict:
    """Create a snapshot of agent context at a specific message"""
    relevant_messages = messages[:at_index + 1]

    return {
        "conversation_length": len(relevant_messages),
        "token_count": sum(msg.token_count for msg in relevant_messages),
        "agents_present": list(set(msg.agent_name for msg in relevant_messages)),
        "last_tool_call": _find_last_tool_call(relevant_messages),
        "context_window": _extract_context_window(relevant_messages)
    }

Measurement: How to Know It's Working

Your Agent DVR is working when debugging stops feeling like archaeology:

1. Debug Speed Metrics

# Before: "Let me grep through 2000 lines of logs..."
# After: "Jump to message 247, check agent context, replay from there"

debug_time_before = 45  # minutes
debug_time_after = 3   # minutes
efficiency_gain = debug_time_before / debug_time_after  # 15x faster

2. Issue Resolution Rate

# Track how quickly you identify root causes
issues_resolved_per_hour = {
    "before_dvr": 0.8,
    "after_dvr": 4.2
}

3. Token Waste Prevention

# Spot runaway agents before they burn your budget
average_cost_per_debug_session = {
    "before": 12.50,  # Re-running agents to reproduce issues
    "after": 0.00     # Just replay the recorded episode
}

Next Steps: Get Your Agent DVR Running

Ready to stop debugging AI agents like it's 1987?

Clone the demo repo: Complete Agent DVR implementation with React frontend

   git clone https://github.com/airblackboxio/agent-dvr-tutorial
   cd agent-dvr-tutorial
   python setup.py install

Connect your agents: Route through Airblackbox Gateway in 5 minutes

   pip install airblackbox
   # Follow the quickstart in the repo README

Start recording: Your next agent conversation will be fully replayable

Your agents are about to become a lot less mysterious. And your debugging sessions are about to become a lot shorter.

Because the only thing worse than an AI agent that forgets everything is a developer who can't figure out what the agent forgot.

Try Airblackbox: The flight recorder for autonomous AI agents → airblackbox.io

DEV Community

Build a DVR for AI Agents: Episode Replay UI That Actually Works

Build a DVR for AI Agents: Episode Replay UI That Actually Works

The Problem: Debugging AI Agents is Archaeological Work

Architecture: How Agent DVR Actually Works

Implementation: Building the Agent DVR

Step 1: Set Up the Recording Infrastructure

Step 2: Create the Episode Data Model

Step 3: Build the Timeline API

Step 4: Build the DVR Interface

Step 5: Connect Your Agents

Pitfalls: What Will Break and How to Handle It

1. Memory Explosion with Long Episodes

2. Timeline Scrubbing Performance

3. Context Window Visualization

Measurement: How to Know It's Working

1. Debug Speed Metrics

2. Issue Resolution Rate

3. Token Waste Prevention

Next Steps: Get Your Agent DVR Running

Top comments (0)