Build a DVR for AI Agents: Episode Replay UI That Actually Works
Your autonomous agent just spent 47 minutes and $23.50 failing to book a dinner reservation, and all you have is a chat log that says "task completed successfully."
The Problem: Agent Debugging is Archaeology Without Artifacts
Here's what happens when your AI agent breaks in production: you get a support ticket that says "the AI is acting weird" with zero context about what "weird" means. Was it the prompt? The tool selection? The memory retrieval? The moon phase?
Current debugging approaches are embarrassingly primitive:
- Log spelunking: Grepping through thousands of lines hoping to spot the moment everything went sideways
- Vibes-based debugging: Staring at the final output and reverse-engineering what might have gone wrong
- The nuclear option: Adding more logging and hoping the agent fails again
What developers actually need is a DVR for AI agents — the ability to scrub through an episode timeline, see exactly what the agent observed at each step, and understand the decision tree that led to disaster.
The core challenge isn't just capturing the data (though that's hard enough). It's building a replay interface that doesn't make you want to throw your laptop out the window.
Architecture: How Agent DVR Actually Works
graph TB
subgraph "Agent Runtime"
A[LLM Call] --> G[Gateway]
T[Tool Call] --> G
M[Memory Access] --> G
G --> C[Collector]
end
subgraph "Storage Layer"
C --> S[Session Store]
C --> E[Event Store]
S --> D[(SQLite)]
E --> D
end
subgraph "Replay UI"
D --> API[FastAPI Backend]
API --> UI[React Timeline]
UI --> V[Event Viewer]
V --> P[Prompt Inspector]
V --> R[Response Analyzer]
end
UI --> |scrub| API
API --> |fetch events| D
The architecture has three layers:
- Capture Layer: Gateway intercepts every LLM call, tool execution, and memory access
- Storage Layer: Events get indexed by session, timestamp, and type for efficient timeline queries
- Replay Layer: Timeline UI that lets you scrub through the episode and inspect each decision point
The secret sauce is in the event indexing. We're not just dumping logs — we're creating a queryable timeline where you can jump to any point and see the agent's complete context.
Implementation: Building the Agent DVR
Step 1: Event Capture Gateway
First, we need to capture agent events in a way that preserves temporal relationships:
from typing import Dict, Any, List
import json
import sqlite3
from datetime import datetime
from fastapi import FastAPI
from pydantic import BaseModel
import uuid
class AgentEvent(BaseModel):
event_id: str
session_id: str
timestamp: datetime
event_type: str # 'llm_call', 'tool_call', 'memory_access'
input_data: Dict[str, Any]
output_data: Dict[str, Any]
context: Dict[str, Any]
parent_event_id: str = None
class EventCollector:
def __init__(self, db_path: str = "agent_events.db"):
self.db_path = db_path
self.init_db()
def init_db(self):
conn = sqlite3.connect(self.db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS events (
event_id TEXT PRIMARY KEY,
session_id TEXT,
timestamp TEXT,
event_type TEXT,
input_data TEXT,
output_data TEXT,
context TEXT,
parent_event_id TEXT
)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_session_time
ON events(session_id, timestamp)
""")
conn.commit()
conn.close()
def record_event(self, event: AgentEvent):
conn = sqlite3.connect(self.db_path)
conn.execute("""
INSERT INTO events VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
event.event_id,
event.session_id,
event.timestamp.isoformat(),
event.event_type,
json.dumps(event.input_data),
json.dumps(event.output_data),
json.dumps(event.context),
event.parent_event_id
))
conn.commit()
conn.close()
Step 2: Agent Integration Wrapper
Now wrap your existing agent to capture events without changing its core logic:
import functools
from openai import OpenAI
class DVRAgent:
def __init__(self, session_id: str = None):
self.session_id = session_id or str(uuid.uuid4())
self.collector = EventCollector()
self.client = OpenAI()
self.context_stack = []
def record_llm_call(func):
@functools.wraps(func)
def wrapper(self, *args, **kwargs):
event_id = str(uuid.uuid4())
start_time = datetime.now()
# Capture input
input_data = {
'model': kwargs.get('model', 'gpt-3.5-turbo'),
'messages': kwargs.get('messages', []),
'temperature': kwargs.get('temperature', 0.7)
}
try:
# Execute the actual call
response = func(self, *args, **kwargs)
# Capture output
output_data = {
'content': response.choices[0].message.content,
'usage': dict(response.usage),
'model': response.model
}
# Record the event
event = AgentEvent(
event_id=event_id,
session_id=self.session_id,
timestamp=start_time,
event_type='llm_call',
input_data=input_data,
output_data=output_data,
context={'stack': self.context_stack.copy()}
)
self.collector.record_event(event)
return response
except Exception as e:
# Record failures too
output_data = {'error': str(e)}
event = AgentEvent(
event_id=event_id,
session_id=self.session_id,
timestamp=start_time,
event_type='llm_call',
input_data=input_data,
output_data=output_data,
context={'stack': self.context_stack.copy()}
)
self.collector.record_event(event)
raise
return wrapper
@record_llm_call
def chat_completion(self, messages: List[Dict], **kwargs):
return self.client.chat.completions.create(
messages=messages,
**kwargs
)
def use_tool(self, tool_name: str, tool_args: Dict) -> Dict:
event_id = str(uuid.uuid4())
# Simulate tool execution
result = self._execute_tool(tool_name, tool_args)
event = AgentEvent(
event_id=event_id,
session_id=self.session_id,
timestamp=datetime.now(),
event_type='tool_call',
input_data={'tool': tool_name, 'args': tool_args},
output_data={'result': result},
context={'stack': self.context_stack.copy()}
)
self.collector.record_event(event)
return result
def _execute_tool(self, tool_name: str, args: Dict) -> Dict:
# Your actual tool implementations
if tool_name == 'web_search':
return {'results': f"Search results for {args.get('query')}"}
return {'error': f'Unknown tool: {tool_name}'}
Step 3: Timeline API Backend
Build a FastAPI backend that serves timeline data for the UI:
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from typing import List, Optional
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
collector = EventCollector()
@app.get("/sessions")
async def get_sessions():
conn = sqlite3.connect(collector.db_path)
cursor = conn.execute("""
SELECT session_id, MIN(timestamp) as start_time,
COUNT(*) as event_count
FROM events
GROUP BY session_id
ORDER BY start_time DESC
""")
sessions = []
for row in cursor.fetchall():
sessions.append({
'session_id': row[0],
'start_time': row[1],
'event_count': row[2]
})
conn.close()
return sessions
@app.get("/sessions/{session_id}/timeline")
async def get_timeline(session_id: str, start_time: Optional[str] = None):
conn = sqlite3.connect(collector.db_path)
query = "SELECT * FROM events WHERE session_id = ?"
params = [session_id]
if start_time:
query += " AND timestamp >= ?"
params.append(start_time)
query += " ORDER BY timestamp"
cursor = conn.execute(query, params)
events = []
for row in cursor.fetchall():
events.append({
'event_id': row[0],
'session_id': row[1],
'timestamp': row[2],
'event_type': row[3],
'input_data': json.loads(row[4]),
'output_data': json.loads(row[5]),
'context': json.loads(row[6]),
'parent_event_id': row[7]
})
conn.close()
return {'events': events}
@app.get("/events/{event_id}")
async def get_event_details(event_id: str):
conn = sqlite3.connect(collector.db_path)
cursor = conn.execute("SELECT * FROM events WHERE event_id = ?", [event_id])
row = cursor.fetchone()
if not row:
raise HTTPException(status_code=404, detail="Event not found")
event = {
'event_id': row[0],
'session_id': row[1],
'timestamp': row[2],
'event_type': row[3],
'input_data': json.loads(row[4]),
'output_data': json.loads(row[5]),
'context': json.loads(row[6]),
'parent_event_id': row[7]
}
conn.close()
return event
Step 4: React Timeline Component
The frontend needs a scrub-friendly timeline that doesn't lag with hundreds of events:
import React, { useState, useEffect, useRef } from 'react';
import axios from 'axios';
const AgentDVR = ({ sessionId }) => {
const [events, setEvents] = useState([]);
const [currentEvent, setCurrentEvent] = useState(null);
const [playbackPosition, setPlaybackPosition] = useState(0);
const [isPlaying, setIsPlaying] = useState(false);
const timelineRef = useRef(null);
useEffect(() => {
loadTimeline();
}, [sessionId]);
const loadTimeline = async () => {
try {
const response = await axios.get(`/sessions/${sessionId}/timeline`);
setEvents(response.data.events);
if (response.data.events.length > 0) {
setCurrentEvent(response.data.events[0]);
}
} catch (error) {
console.error('Failed to load timeline:', error);
}
};
const handleTimelineClick = (eventIndex) => {
setPlaybackPosition(eventIndex);
setCurrentEvent(events[eventIndex]);
setIsPlaying(false);
};
const playTimeline = () => {
if (isPlaying) {
setIsPlaying(false);
return;
}
setIsPlaying(true);
const interval = setInterval(() => {
setPlaybackPosition(prev => {
const next = prev + 1;
if (next >= events.length) {
setIsPlaying(false);
clearInterval(interval);
return prev;
}
setCurrentEvent(events[next]);
return next;
});
}, 1000);
};
return (
<div className="agent-dvr">
<div className="timeline-container">
<div className="playback-controls">
<button onClick={playTimeline}>
{isPlaying ? '⏸️' : '▶️'}
</button>
<span>{playbackPosition + 1} / {events.length}</span>
</div>
<div className="timeline" ref={timelineRef}>
{events.map((event, index) => (
<div
key={event.event_id}
className={`timeline-event ${index === playbackPosition ? 'active' : ''} ${event.event_type}`}
onClick={() => handleTimelineClick(index)}
>
<div className="event-marker" />
<div className="event-label">{event.event_type}</div>
</div>
))}
</div>
</div>
<div className="event-inspector">
{currentEvent && (
<EventInspector event={currentEvent} />
)}
</div>
</div>
);
};
const EventInspector = ({ event }) => {
return (
<div className="event-details">
<h3>{event.event_type} at {new Date(event.timestamp).toLocaleTimeString()}</h3>
<div className="event-section">
<h4>Input</h4>
<pre>{JSON.stringify(event.input_data, null, 2)}</pre>
</div>
<div className="event-section">
<h4>Output</h4>
<pre>{JSON.stringify(event.output_data, null, 2)}</pre>
</div>
{event.context && (
<div className="event-section">
<h4>Context</h4>
<pre>{JSON.stringify(event.context, null, 2)}</pre>
</div>
)}
</div>
);
};
export default AgentDVR;
Pitfalls: What Will Break and How to Handle It
Performance Death by a Thousand Events
When your agent makes 200 LLM calls in a session, the timeline becomes unusable. Solution: implement virtual scrolling and event aggregation. Group rapid-fire events into collapsed sections.
Memory Explosion
Storing full prompt/response pairs for every call will eat your disk. Use a retention policy and compress old events. Keep full detail for the last 48 hours, summaries beyond that.
Context Loss
The hardest part isn't capturing events — it's preserving enough context to understand WHY each decision was made. Always include the agent's memory state and reasoning chain, not just the I/O.
Race Conditions in Event Ordering
Concurrent tool calls can arrive out of order. Use high-resolution timestamps and implement event reordering on the backend. Don't trust arrival order.
UI Performance with Large Timelines
React will choke on 1000+ timeline elements. Implement windowing — only render visible events plus a buffer. Libraries like react-window are your friend.
Measurement: How to Know It's Working
Your agent DVR is successful when:
- Time to root cause drops below 2 minutes — You can identify the failing step without reading logs
- Debugging becomes collaborative — You can share a session URL with your team instead of copying/pasting logs
- Patterns emerge — You start noticing repeated failure modes that weren't obvious in raw logs
- Agent performance improves — Better debugging leads to better prompts and tool selection
Track these metrics:
- Average session replay views per debugging session
- Time from bug report to root cause identification
- Number of "I can't reproduce this" tickets (should go to zero)
Next Steps: Try the Full Implementation
The code above is a working foundation, but production requires more sophistication. Airblackbox provides this out of the box with zero configuration — just point your agent at our gateway and you get:
- Automatic event capture for LangChain, CrewAI, AutoGen
- Production-ready timeline UI with search and filtering
- Session sharing and team collaboration
- EU AI Act compliance scanning (6/6 technical checks passing)
Ready to stop debugging AI agents with vibes?
Try the live demo: github.com/airblackbox/agent-dvr
Or spin up the full observability platform: docs.airblackbox.com/quickstart
Your future self — the one debugging agents at 2 AM — will thank you.
Because elegant systems are nice. Observable systems are nicer. Observable systems with DVR replay are how you stop your agents from becoming very expensive goldfish.
Top comments (0)