Build a DVR for AI Agents: Episode Replay UI That Actually Works
Your AI agent just burned through $47 in tokens debugging a conversation that went sideways at message 23 of 847, and you're debugging it by scrolling through terminal logs like it's 1987.
The Problem: Debugging AI Agents is Archaeological Work
Here's what happens when your CrewAI agent decides to have an existential crisis:
[2024-01-15 14:32:17] Agent: ResearchAgent - Starting task: analyze_market_trends
[2024-01-15 14:32:18] LLM Call: system_prompt="You are a research agent..."
[2024-01-15 14:32:19] Response: "I'll analyze the market trends..."
[2024-01-15 14:32:20] Agent: ResearchAgent - Task completed
[2024-01-15 14:32:21] Agent: WritingAgent - Starting task: write_report
...
[847 more lines of this archaeological nightmare]
...
[2024-01-15 14:47:32] Agent: WritingAgent - Error: Token limit exceeded
You know something went wrong around message 23. You know the agent started hallucinating about cryptocurrency trends that don't exist. You know it burned through your token budget faster than a teenager burns through data.
What you don't know is why, because debugging multi-agent conversations with grep and prayer isn't debugging—it's divination.
The existing solutions are equally painful:
- Terminal logs: Great if you enjoy archaeological digs through text
- Static dashboards: Show you averages, not the specific moment everything went wrong
- OpenAI playground: Only shows individual API calls, not agent conversations
- Manual instrumentation: Because what every debugging session needs is more code to debug
You need a DVR for AI agents. Rewind to the exact moment. Scrub through the conversation timeline. See what the agent was thinking, what context it had, and where it decided to go rogue.
Architecture: How Agent DVR Actually Works
Here's how we build a proper replay system that doesn't suck:
graph TD
A[AI Agent] --> B[Airblackbox Gateway]
B --> C[OpenAI API]
B --> D[Telemetry Store]
D --> E[Timeline API]
E --> F[React UI]
F --> G[Episode List]
F --> H[Timeline Scrubber]
F --> I[Context Inspector]
G --> J[Agent Conversations]
H --> K[Message Navigation]
I --> L[Token Analysis]
subgraph "DVR Components"
G
H
I
end
subgraph "Data Flow"
B --> |"Records every LLM call"| D
D --> |"Queries by episode"| E
E --> |"Serves timeline data"| F
end
The magic happens in three layers:
- Capture Layer: Gateway intercepts every LLM call, preserving full context
- Storage Layer: Time-indexed episode data with conversation threading
- Replay Layer: Timeline UI that lets you scrub through agent decisions like a video
Implementation: Building the Agent DVR
Step 1: Set Up the Recording Infrastructure
First, install Airblackbox to start capturing your agent conversations:
pip install airblackbox
Configure the gateway to record everything your agents do:
# agent_dvr_setup.py
import os
from airblackbox import configure_gateway
from openai import OpenAI
# Start recording all LLM calls
configure_gateway(
api_key=os.getenv("OPENAI_API_KEY"),
endpoint="http://localhost:8080", # Gateway endpoint
record_all=True,
session_tags=["agent_episode", "production"]
)
# Your agents now route through the recording gateway
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY"),
base_url="http://localhost:8080/v1" # Gateway intercepts here
)
Step 2: Create the Episode Data Model
Define how we structure agent conversations for timeline replay:
# episode_model.py
from dataclasses import dataclass
from datetime import datetime
from typing import List, Dict, Any, Optional
from enum import Enum
class MessageType(Enum):
SYSTEM = "system"
USER = "user"
ASSISTANT = "assistant"
TOOL_CALL = "tool_call"
TOOL_RESPONSE = "tool_response"
@dataclass
class AgentMessage:
timestamp: datetime
agent_name: str
message_type: MessageType
content: str
token_count: int
cost: float
metadata: Dict[str, Any]
@dataclass
class AgentEpisode:
episode_id: str
start_time: datetime
end_time: Optional[datetime]
total_cost: float
total_tokens: int
agent_names: List[str]
messages: List[AgentMessage]
status: str # "running", "completed", "failed"
@property
def duration(self) -> float:
"""Duration in seconds"""
if not self.end_time:
return (datetime.now() - self.start_time).total_seconds()
return (self.end_time - self.start_time).total_seconds()
Step 3: Build the Timeline API
Create the backend that serves episode data to your DVR interface:
# timeline_api.py
from fastapi import FastAPI, HTTPException
from typing import List, Optional
import sqlite3
from datetime import datetime, timedelta
app = FastAPI()
class EpisodeStore:
def __init__(self, db_path: str = "agent_episodes.db"):
self.db_path = db_path
self._init_db()
def _init_db(self):
conn = sqlite3.connect(self.db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS episodes (
episode_id TEXT PRIMARY KEY,
start_time TEXT,
end_time TEXT,
total_cost REAL,
total_tokens INTEGER,
agent_names TEXT,
status TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS messages (
id INTEGER PRIMARY KEY,
episode_id TEXT,
timestamp TEXT,
agent_name TEXT,
message_type TEXT,
content TEXT,
token_count INTEGER,
cost REAL,
metadata TEXT,
FOREIGN KEY (episode_id) REFERENCES episodes (episode_id)
)
""")
conn.commit()
conn.close()
def get_episodes(self, limit: int = 50) -> List[AgentEpisode]:
"""Get recent episodes for the episode list"""
conn = sqlite3.connect(self.db_path)
cursor = conn.execute("""
SELECT episode_id, start_time, end_time, total_cost,
total_tokens, agent_names, status
FROM episodes
ORDER BY start_time DESC
LIMIT ?
""", (limit,))
episodes = []
for row in cursor.fetchall():
episode = AgentEpisode(
episode_id=row[0],
start_time=datetime.fromisoformat(row[1]),
end_time=datetime.fromisoformat(row[2]) if row[2] else None,
total_cost=row[3],
total_tokens=row[4],
agent_names=row[5].split(",") if row[5] else [],
messages=[], # Load separately for performance
status=row[6]
)
episodes.append(episode)
conn.close()
return episodes
def get_episode_timeline(self, episode_id: str) -> Optional[AgentEpisode]:
"""Get full episode with message timeline"""
# Implementation details for loading complete episode...
pass
store = EpisodeStore()
@app.get("/episodes")
async def list_episodes():
"""Get recent episodes for DVR interface"""
episodes = store.get_episodes()
return {"episodes": episodes}
@app.get("/episodes/{episode_id}/timeline")
async def get_timeline(episode_id: str):
"""Get episode timeline for scrubber interface"""
episode = store.get_episode_timeline(episode_id)
if not episode:
raise HTTPException(status_code=404, detail="Episode not found")
return {
"episode": episode,
"timeline_markers": _generate_timeline_markers(episode)
}
def _generate_timeline_markers(episode: AgentEpisode) -> List[Dict]:
"""Generate timeline markers for scrubber UI"""
markers = []
for i, message in enumerate(episode.messages):
# Create markers for key events
if message.message_type in [MessageType.TOOL_CALL, MessageType.ASSISTANT]:
markers.append({
"timestamp": message.timestamp.isoformat(),
"message_index": i,
"agent_name": message.agent_name,
"type": message.message_type.value,
"cost": message.cost
})
return markers
Step 4: Build the DVR Interface
Create a React component that actually feels like using a DVR:
// AgentDVR.jsx
import React, { useState, useEffect } from 'react';
const AgentDVR = () => {
const [episodes, setEpisodes] = useState([]);
const [currentEpisode, setCurrentEpisode] = useState(null);
const [currentMessageIndex, setCurrentMessageIndex] = useState(0);
const [isPlaying, setIsPlaying] = useState(false);
useEffect(() => {
fetch('/api/episodes')
.then(r => r.json())
.then(data => setEpisodes(data.episodes));
}, []);
const loadEpisode = async (episodeId) => {
const response = await fetch(`/api/episodes/${episodeId}/timeline`);
const data = await response.json();
setCurrentEpisode(data.episode);
setCurrentMessageIndex(0);
};
const scrubToMessage = (messageIndex) => {
setCurrentMessageIndex(messageIndex);
setIsPlaying(false);
};
const playFromCurrent = () => {
setIsPlaying(true);
const interval = setInterval(() => {
setCurrentMessageIndex(prev => {
if (prev >= currentEpisode.messages.length - 1) {
setIsPlaying(false);
clearInterval(interval);
return prev;
}
return prev + 1;
});
}, 1000); // 1 message per second playback
};
return (
<div className="agent-dvr">
<div className="episode-list">
<h2>Agent Episodes</h2>
{episodes.map(episode => (
<div
key={episode.episode_id}
className="episode-item"
onClick={() => loadEpisode(episode.episode_id)}
>
<div className="episode-info">
<span className="episode-time">
{new Date(episode.start_time).toLocaleString()}
</span>
<span className="episode-cost">${episode.total_cost.toFixed(4)}</span>
<span className="episode-status">{episode.status}</span>
</div>
</div>
))}
</div>
{currentEpisode && (
<div className="episode-player">
<div className="timeline-scrubber">
<input
type="range"
min="0"
max={currentEpisode.messages.length - 1}
value={currentMessageIndex}
onChange={(e) => scrubToMessage(parseInt(e.target.value))}
className="timeline-slider"
/>
<div className="timeline-markers">
{/* Render timeline markers for key events */}
</div>
</div>
<div className="playback-controls">
<button onClick={() => scrubToMessage(0)}>⏮</button>
<button onClick={() => scrubToMessage(Math.max(0, currentMessageIndex - 1))}>⏪</button>
<button onClick={isPlaying ? () => setIsPlaying(false) : playFromCurrent}>
{isPlaying ? '⏸' : '▶️'}
</button>
<button onClick={() => scrubToMessage(Math.min(currentEpisode.messages.length - 1, currentMessageIndex + 1))}>⏩</button>
<button onClick={() => scrubToMessage(currentEpisode.messages.length - 1)}>⏭</button>
</div>
<div className="message-inspector">
{currentEpisode.messages[currentMessageIndex] && (
<MessageView
message={currentEpisode.messages[currentMessageIndex]}
context={currentEpisode.messages.slice(0, currentMessageIndex + 1)}
/>
)}
</div>
</div>
)}
</div>
);
};
Step 5: Connect Your Agents
Wire up your existing agents to flow through the recording gateway:
# crew_ai_example.py
from crewai import Agent, Task, Crew
from langchain.chat_models import ChatOpenAI
# Agents automatically use the recording gateway
llm = ChatOpenAI(
model_name="gpt-4",
openai_api_base="http://localhost:8080/v1", # Routes through gateway
openai_api_key=os.getenv("OPENAI_API_KEY")
)
research_agent = Agent(
role="Research Analyst",
goal="Analyze market trends and provide insights",
backstory="Expert in market analysis with 10 years of experience",
llm=llm,
verbose=True
)
# Every conversation now gets recorded with full context
crew = Crew(
agents=[research_agent],
tasks=[research_task],
verbose=True
)
# This entire episode will be replayable in the DVR
result = crew.kickoff()
Pitfalls: What Will Break and How to Handle It
1. Memory Explosion with Long Episodes
Problem: Recording everything means episodes with 10,000+ messages will eat your storage alive.
Solution: Implement smart chunking and compression:
# episode_compression.py
def compress_episode_content(episode: AgentEpisode) -> AgentEpisode:
"""Compress large episodes for storage"""
compressed_messages = []
for message in episode.messages:
# Compress repetitive system prompts
if message.message_type == MessageType.SYSTEM:
message.content = _compress_system_prompt(message.content)
# Truncate very long tool responses
if message.message_type == MessageType.TOOL_RESPONSE and len(message.content) > 5000:
message.content = message.content[:5000] + "... [truncated]"
compressed_messages.append(message)
episode.messages = compressed_messages
return episode
2. Timeline Scrubbing Performance
Problem: Loading 847 messages into the timeline scrubber makes your browser cry.
Solution: Virtual scrolling and lazy loading:
// VirtualTimeline.jsx
const VirtualTimeline = ({ messages, onMessageSelect }) => {
const [visibleRange, setVisibleRange] = useState({ start: 0, end: 50 });
const handleScroll = (scrollTop) => {
const itemHeight = 40;
const containerHeight = 600;
const start = Math.floor(scrollTop / itemHeight);
const end = start + Math.ceil(containerHeight / itemHeight);
setVisibleRange({ start, end });
};
return (
<div className="virtual-timeline" onScroll={handleScroll}>
{messages.slice(visibleRange.start, visibleRange.end).map((message, index) => (
<MessageMarker key={index} message={message} onClick={onMessageSelect} />
))}
</div>
);
};
3. Context Window Visualization
Problem: Showing what context the agent had at message 247 requires reconstructing the entire conversation state.
Solution: Pre-compute context snapshots during recording:
# context_snapshots.py
def create_context_snapshot(messages: List[AgentMessage], at_index: int) -> Dict:
"""Create a snapshot of agent context at a specific message"""
relevant_messages = messages[:at_index + 1]
return {
"conversation_length": len(relevant_messages),
"token_count": sum(msg.token_count for msg in relevant_messages),
"agents_present": list(set(msg.agent_name for msg in relevant_messages)),
"last_tool_call": _find_last_tool_call(relevant_messages),
"context_window": _extract_context_window(relevant_messages)
}
Measurement: How to Know It's Working
Your Agent DVR is working when debugging stops feeling like archaeology:
1. Debug Speed Metrics
# Before: "Let me grep through 2000 lines of logs..."
# After: "Jump to message 247, check agent context, replay from there"
debug_time_before = 45 # minutes
debug_time_after = 3 # minutes
efficiency_gain = debug_time_before / debug_time_after # 15x faster
2. Issue Resolution Rate
# Track how quickly you identify root causes
issues_resolved_per_hour = {
"before_dvr": 0.8,
"after_dvr": 4.2
}
3. Token Waste Prevention
# Spot runaway agents before they burn your budget
average_cost_per_debug_session = {
"before": 12.50, # Re-running agents to reproduce issues
"after": 0.00 # Just replay the recorded episode
}
Next Steps: Get Your Agent DVR Running
Ready to stop debugging AI agents like it's 1987?
- Clone the demo repo: Complete Agent DVR implementation with React frontend
git clone https://github.com/airblackboxio/agent-dvr-tutorial
cd agent-dvr-tutorial
python setup.py install
- Connect your agents: Route through Airblackbox Gateway in 5 minutes
pip install airblackbox
# Follow the quickstart in the repo README
- Start recording: Your next agent conversation will be fully replayable
Your agents are about to become a lot less mysterious. And your debugging sessions are about to become a lot shorter.
Because the only thing worse than an AI agent that forgets everything is a developer who can't figure out what the agent forgot.
Try Airblackbox: The flight recorder for autonomous AI agents → airblackbox.io
Top comments (0)