Build a DVR for Your AI Agent: Episode Replay UI Tutorial
Your AI agent just crashed in production, and the only evidence is a vague error message and the cold sweat on your forehead.
The Problem: AI Agent Debugging Is Still Medieval
Your autonomous agent was supposed to book a conference room, send three emails, and update a spreadsheet. Instead, it booked three conference rooms, sent the same email 47 times to your CEO, and deleted half your customer database.
The logs show:
INFO: Agent completed task successfully
ERROR: Database connection lost
INFO: Retrying operation
INFO: Task completed
Helpful. Really.
Traditional debugging assumes deterministic code. But AI agents are probabilistic chaos machines with memory problems and an alarming tendency to hallucinate their way through error handling. Standard logging wasn't built for "the LLM decided that 'delete customer' was the same as 'update customer' because both contain the word 'customer.'"
You need to see what the agent was thinking, not just what it did. You need a DVR for AI agents — something that records every decision, every context window, every tool call, and lets you replay the entire episode step by step.
Architecture: How Agent DVR Actually Works
graph TD
A[AI Agent] -->|Instrumented calls| B[Airblackbox Gateway]
B --> C[OpenAI/Anthropic API]
C --> B
B -->|Records everything| D[Storage Layer]
D --> E[Episode Extractor]
E --> F[Context Reconstructor]
F --> G[Replay UI]
G --> H[Step Debugger]
G --> I[Context Viewer]
G --> J[Decision Tree]
subgraph "What gets recorded"
K[Prompt templates]
L[Full conversations]
M[Tool calls & responses]
N[Token usage & costs]
O[Timing data]
P[Error states]
end
B -.-> K
B -.-> L
B -.-> M
B -.-> N
B -.-> O
B -.-> P
The Airblackbox Gateway sits between your agent and the LLM API, recording everything without changing your code. The Episode Extractor groups related calls into logical sessions. The Context Reconstructor rebuilds the agent's complete mental state at each decision point. The Replay UI lets you step through the episode like a debugger.
Implementation: Building Your Agent DVR
Step 1: Install and Configure Airblackbox Gateway
First, get the gateway running:
# Install airblackbox
pip install airblackbox
# Start the gateway (runs on port 8000)
airblackbox gateway --port 8000 --storage sqlite:///agent_episodes.db
Step 2: Instrument Your Agent Code
Here's a realistic AI agent that manages GitHub issues. Notice how we change exactly one line — the base URL:
# github_agent.py
import openai
from typing import List, Dict
import json
import requests
class GitHubAgent:
def __init__(self, github_token: str):
# THIS IS THE ONLY LINE YOU CHANGE
self.client = openai.OpenAI(
base_url="http://localhost:8000/v1", # Points to Airblackbox Gateway
api_key=your_openai_key
)
self.github_token = github_token
self.headers = {"Authorization": f"token {github_token}"}
def get_issues(self, repo: str) -> List[Dict]:
"""Fetch open issues from a GitHub repository"""
url = f"https://api.github.com/repos/{repo}/issues"
response = requests.get(url, headers=self.headers)
return response.json()
def analyze_issue_sentiment(self, issue: Dict) -> str:
"""Analyze the emotional tone of an issue"""
prompt = f"""
Analyze the sentiment of this GitHub issue:
Title: {issue['title']}
Body: {issue['body'][:500]}
Classify as: URGENT, FRUSTRATED, NEUTRAL, or GRATEFUL
Provide reasoning.
"""
response = self.client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
metadata={"operation": "sentiment_analysis", "issue_id": issue['id']}
)
return response.choices[0].message.content
def generate_response_draft(self, issue: Dict, sentiment: str) -> str:
"""Generate appropriate response based on sentiment"""
prompt = f"""
Draft a response to this {sentiment} GitHub issue:
Issue: {issue['title']}
Sentiment: {sentiment}
Be professional, helpful, and match the appropriate tone.
"""
response = self.client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
metadata={"operation": "response_generation", "sentiment": sentiment}
)
return response.choices[0].message.content
def process_issue_batch(self, repo: str, max_issues: int = 5):
"""Process a batch of issues with full context"""
issues = self.get_issues(repo)[:max_issues]
results = []
for issue in issues:
print(f"Processing issue #{issue['number']}: {issue['title']}")
# This creates a natural "episode" boundary
sentiment = self.analyze_issue_sentiment(issue)
response = self.generate_response_draft(issue, sentiment)
results.append({
"issue": issue,
"sentiment": sentiment,
"draft_response": response
})
return results
# Usage
agent = GitHubAgent(github_token="your_token")
results = agent.process_issue_batch("microsoft/vscode")
Step 3: Build the Episode Replay UI
Now create a Flask app that reads from the Airblackbox storage and builds an interactive replay interface:
# episode_replay.py
from flask import Flask, render_template, jsonify, request
import sqlite3
import json
from datetime import datetime
from typing import List, Dict
app = Flask(__name__)
class EpisodeReplay:
def __init__(self, db_path: str):
self.db_path = db_path
def get_episodes(self) -> List[Dict]:
"""Extract distinct episodes from recorded calls"""
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
# Group calls by session/time windows
cursor.execute("""
SELECT
datetime(timestamp, 'unixepoch') as session_time,
COUNT(*) as call_count,
MIN(timestamp) as start_time,
MAX(timestamp) as end_time,
GROUP_CONCAT(operation) as operations
FROM llm_calls
WHERE metadata LIKE '%operation%'
GROUP BY date(timestamp, 'unixepoch'),
(timestamp - 1640995200) / 3600 -- Group by hour
ORDER BY start_time DESC
""")
episodes = []
for row in cursor.fetchall():
episodes.append({
"id": len(episodes),
"session_time": row[0],
"call_count": row[1],
"start_time": row[2],
"end_time": row[3],
"operations": row[4].split(',') if row[4] else []
})
conn.close()
return episodes
def get_episode_calls(self, episode_id: int) -> List[Dict]:
"""Get all LLM calls for a specific episode"""
episodes = self.get_episodes()
if episode_id >= len(episodes):
return []
episode = episodes[episode_id]
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT timestamp, model, messages, response, metadata, tokens_used, cost
FROM llm_calls
WHERE timestamp BETWEEN ? AND ?
ORDER BY timestamp ASC
""", (episode['start_time'], episode['end_time']))
calls = []
for row in cursor.fetchall():
calls.append({
"timestamp": row[0],
"model": row[1],
"messages": json.loads(row[2]),
"response": json.loads(row[3]),
"metadata": json.loads(row[4]) if row[4] else {},
"tokens_used": row[5],
"cost": row[6]
})
conn.close()
return calls
replay = EpisodeReplay("agent_episodes.db")
@app.route('/')
def episodes_list():
episodes = replay.get_episodes()
return render_template('episodes.html', episodes=episodes)
@app.route('/episode/<int:episode_id>')
def episode_detail(episode_id: int):
calls = replay.get_episode_calls(episode_id)
return render_template('episode_detail.html',
episode_id=episode_id,
calls=calls)
@app.route('/api/episode/<int:episode_id>/step/<int:step>')
def get_step_context(episode_id: int, step: int):
"""API endpoint for step-by-step debugging"""
calls = replay.get_episode_calls(episode_id)
if step >= len(calls):
return jsonify({"error": "Step out of range"})
current_call = calls[step]
context_window = calls[max(0, step-2):step+1] # Show 2 previous calls
return jsonify({
"current_call": current_call,
"context_window": context_window,
"step": step,
"total_steps": len(calls)
})
if __name__ == '__main__':
app.run(debug=True, port=5001)
Step 4: Create the Frontend Templates
Create templates/episode_detail.html for the interactive replay:
<!-- templates/episode_detail.html -->
<!DOCTYPE html>
<html>
<head>
<title>Episode Replay - Step by Step</title>
<style>
.step-debugger { display: flex; height: 100vh; }
.timeline { width: 200px; background: #f5f5f5; padding: 20px; }
.content { flex: 1; padding: 20px; }
.step-item { padding: 10px; cursor: pointer; border-radius: 4px; margin: 5px 0; }
.step-item.active { background: #007bff; color: white; }
.call-details { background: #f8f9fa; padding: 20px; border-radius: 8px; margin: 10px 0; }
.prompt { background: #e3f2fd; padding: 15px; border-radius: 4px; }
.response { background: #f3e5f5; padding: 15px; border-radius: 4px; margin-top: 10px; }
.metadata { font-size: 0.9em; color: #666; margin-top: 10px; }
</style>
</head>
<body>
<div class="step-debugger">
<div class="timeline">
<h3>Episode Steps</h3>
{% for call in calls %}
<div class="step-item" onclick="showStep({{ loop.index0 }})"
id="step-{{ loop.index0 }}">
<strong>Step {{ loop.index }}</strong><br>
{{ call.metadata.operation|default('LLM Call') }}<br>
<small>{{ call.tokens_used }} tokens</small>
</div>
{% endfor %}
</div>
<div class="content">
<div id="step-content">
<h2>Select a step to replay</h2>
<p>Use the timeline on the left to step through the agent's decision process.</p>
</div>
<div id="context-window" style="display: none;">
<h3>Context Window</h3>
<div id="context-calls"></div>
</div>
</div>
</div>
<script>
let currentStep = -1;
function showStep(stepIndex) {
// Update UI
document.querySelectorAll('.step-item').forEach(item => {
item.classList.remove('active');
});
document.getElementById(`step-${stepIndex}`).classList.add('active');
// Fetch step details
fetch(`/api/episode/{{ episode_id }}/step/${stepIndex}`)
.then(response => response.json())
.then(data => {
displayStepContent(data);
currentStep = stepIndex;
});
}
function displayStepContent(data) {
const call = data.current_call;
const content = document.getElementById('step-content');
content.innerHTML = `
<h2>Step ${data.step + 1} of ${data.total_steps}</h2>
<div class="call-details">
<h3>Operation: ${call.metadata.operation || 'LLM Call'}</h3>
<div class="prompt">
<h4>Input Prompt:</h4>
<pre>${JSON.stringify(call.messages, null, 2)}</pre>
</div>
<div class="response">
<h4>LLM Response:</h4>
<pre>${JSON.stringify(call.response, null, 2)}</pre>
</div>
<div class="metadata">
<strong>Model:</strong> ${call.model}<br>
<strong>Tokens:</strong> ${call.tokens_used}<br>
<strong>Cost:</strong> $${call.cost.toFixed(4)}<br>
<strong>Timestamp:</strong> ${new Date(call.timestamp * 1000).toLocaleString()}
</div>
</div>
`;
// Show context window
if (data.context_window.length > 1) {
displayContextWindow(data.context_window);
}
}
function displayContextWindow(contextCalls) {
const contextDiv = document.getElementById('context-window');
const callsDiv = document.getElementById('context-calls');
let contextHTML = '';
contextCalls.forEach((call, index) => {
const isCurrentCall = index === contextCalls.length - 1;
contextHTML += `
<div class="call-details" style="${isCurrentCall ? 'border: 2px solid #007bff;' : 'opacity: 0.7;'}">
<h4>${call.metadata.operation || 'LLM Call'} ${isCurrentCall ? '(Current)' : ''}</h4>
<div style="font-size: 0.9em;">
<strong>Input:</strong> ${call.messages[0]?.content?.substring(0, 100)}...<br>
<strong>Output:</strong> ${call.response.choices?.[0]?.message?.content?.substring(0, 100)}...
</div>
</div>
`;
});
callsDiv.innerHTML = contextHTML;
contextDiv.style.display = 'block';
}
// Keyboard navigation
document.addEventListener('keydown', (e) => {
if (e.key === 'ArrowLeft' && currentStep > 0) {
showStep(currentStep - 1);
} else if (e.key === 'ArrowRight' && currentStep < {{ calls|length - 1 }}) {
showStep(currentStep + 1);
}
});
</script>
</body>
</html>
Pitfalls: What Will Break and How to Handle It
Memory Explosion
Recording everything creates massive datasets. Your SQLite file will grow fast.
Solution: Implement retention policies:
# Cleanup old episodes
cursor.execute("DELETE FROM llm_calls WHERE timestamp < ?",
(time.time() - 7*24*3600,)) # Keep 1 week
Context Window Reconstruction Fails
LLMs have context limits. Long conversations get truncated, breaking replay accuracy.
Solution: Store the actual context sent to the model, not just the messages:
# In your gateway configuration
STORE_FULL_CONTEXT = True
MAX_CONTEXT_TOKENS = 8192
Session Boundary Detection Is Wrong
The replay UI groups unrelated calls into fake "episodes."
Solution: Add explicit session tracking:
# Add to your agent code
response = self.client.chat.completions.create(
model="gpt-4",
messages=messages,
metadata={
"session_id": self.session_id,
"operation": "sentiment_analysis",
"step_index": self.current_step
}
)
Performance Dies with Large Episodes
Rendering 500+ LLM calls in the browser locks up the UI.
Solution: Implement pagination and lazy loading:
// Load steps on demand
function showStep(stepIndex) {
if (!stepCache[stepIndex]) {
fetch(`/api/episode/${episodeId}/step/${stepIndex}`)
.then(response => response.json())
.then(data => {
stepCache[stepIndex] = data;
displayStepContent(data);
});
}
}
Measurement: How to Know It's Working
Test with a Controlled Failure
Create an agent that deliberately fails:
def test_episode_recording():
agent = GitHubAgent(github_token="invalid_token")
try:
# This will fail, creating a clear episode
results = agent.process_issue_batch("nonexistent/repo")
except Exception as e:
print(f"Expected failure: {e}")
# Check the episode was recorded
replay = EpisodeReplay("agent_episodes.db")
episodes = replay.get_episodes()
assert len(episodes) > 0, "No episodes recorded"
# Verify step-by-step data
calls = replay.get_episode_calls(0)
assert len(calls) > 0, "No calls in episode"
assert all('timestamp' in call for call in calls), "Missing timestamps"
Verify Context Reconstruction
The real test: can you debug a failure you've never seen before?
- Run your agent on a complex task
- Let it fail mysteriously
- Open the replay UI
- Step through until you find the exact moment it went wrong
- You should see the prompt, the response, and understand why
If you can't identify the failure point in under 5 minutes, your episode recording isn't detailed enough.
Check Storage Performance
Monitor your database size and query speed:
# Check episode database size
ls -lh agent_episodes.db
# Query performance test
time sqlite3 agent_episodes.db "SELECT COUNT(*) FROM llm_calls WHERE timestamp > $(date -d '1 day ago' +%s)"
If queries take longer than 100ms, add indexes:
CREATE INDEX idx_timestamp ON llm_calls(timestamp);
CREATE INDEX idx_metadata ON llm_calls(json_extract(metadata, '$.session_id'));
Next Steps: Build Your Agent DVR
- Clone the complete demo: github.com/airblackbox/agent-dvr-tutorial
- Start with your existing agent: Change one line (the base URL) and start recording
-
Add session tracking: Include
session_idandoperationin your metadata - Build your replay UI: Use the Flask template above as a starting point
Your AI agents will still be probabilistic chaos machines. But now when they misbehave, you'll have a complete recording of their decision process. No more debugging by vibes — you'll see exactly where the wheels fell off.
The DVR doesn't prevent AI agent failures. It just makes them debuggable. Which, honestly, is most of the battle.
Try Airblackbox Gateway — your future debugging self will thank you.
Top comments (0)