Jason Shotwell

Posted on Mar 30

Build a DVR for Your AI Agent: Episode Replay UI Tutorial

#airblackbox #aidebugging #observability #tutorial

Build a DVR for Your AI Agent: Episode Replay UI Tutorial

Your AI agent just crashed in production, and the only evidence is a vague error message and the cold sweat on your forehead.

The Problem: AI Agent Debugging Is Still Medieval

Your autonomous agent was supposed to book a conference room, send three emails, and update a spreadsheet. Instead, it booked three conference rooms, sent the same email 47 times to your CEO, and deleted half your customer database.

The logs show:

INFO: Agent completed task successfully
ERROR: Database connection lost
INFO: Retrying operation
INFO: Task completed

Helpful. Really.

Traditional debugging assumes deterministic code. But AI agents are probabilistic chaos machines with memory problems and an alarming tendency to hallucinate their way through error handling. Standard logging wasn't built for "the LLM decided that 'delete customer' was the same as 'update customer' because both contain the word 'customer.'"

You need to see what the agent was thinking, not just what it did. You need a DVR for AI agents — something that records every decision, every context window, every tool call, and lets you replay the entire episode step by step.

Architecture: How Agent DVR Actually Works

graph TD
    A[AI Agent] -->|Instrumented calls| B[Airblackbox Gateway]
    B --> C[OpenAI/Anthropic API]
    C --> B
    B -->|Records everything| D[Storage Layer]

    D --> E[Episode Extractor]
    E --> F[Context Reconstructor]
    F --> G[Replay UI]

    G --> H[Step Debugger]
    G --> I[Context Viewer]
    G --> J[Decision Tree]

    subgraph "What gets recorded"
        K[Prompt templates]
        L[Full conversations]
        M[Tool calls & responses]
        N[Token usage & costs]
        O[Timing data]
        P[Error states]
    end

    B -.-> K
    B -.-> L  
    B -.-> M
    B -.-> N
    B -.-> O
    B -.-> P

The Airblackbox Gateway sits between your agent and the LLM API, recording everything without changing your code. The Episode Extractor groups related calls into logical sessions. The Context Reconstructor rebuilds the agent's complete mental state at each decision point. The Replay UI lets you step through the episode like a debugger.

Implementation: Building Your Agent DVR

Step 1: Install and Configure Airblackbox Gateway

First, get the gateway running:

# Install airblackbox
pip install airblackbox

# Start the gateway (runs on port 8000)
airblackbox gateway --port 8000 --storage sqlite:///agent_episodes.db

Step 2: Instrument Your Agent Code

Here's a realistic AI agent that manages GitHub issues. Notice how we change exactly one line — the base URL:

# github_agent.py
import openai
from typing import List, Dict
import json
import requests

class GitHubAgent:
    def __init__(self, github_token: str):
        # THIS IS THE ONLY LINE YOU CHANGE
        self.client = openai.OpenAI(
            base_url="http://localhost:8000/v1",  # Points to Airblackbox Gateway
            api_key=your_openai_key
        )
        self.github_token = github_token
        self.headers = {"Authorization": f"token {github_token}"}

    def get_issues(self, repo: str) -> List[Dict]:
        """Fetch open issues from a GitHub repository"""
        url = f"https://api.github.com/repos/{repo}/issues"
        response = requests.get(url, headers=self.headers)
        return response.json()

    def analyze_issue_sentiment(self, issue: Dict) -> str:
        """Analyze the emotional tone of an issue"""
        prompt = f"""
        Analyze the sentiment of this GitHub issue:

        Title: {issue['title']}
        Body: {issue['body'][:500]}

        Classify as: URGENT, FRUSTRATED, NEUTRAL, or GRATEFUL
        Provide reasoning.
        """

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            metadata={"operation": "sentiment_analysis", "issue_id": issue['id']}
        )

        return response.choices[0].message.content

    def generate_response_draft(self, issue: Dict, sentiment: str) -> str:
        """Generate appropriate response based on sentiment"""
        prompt = f"""
        Draft a response to this {sentiment} GitHub issue:

        Issue: {issue['title']}
        Sentiment: {sentiment}

        Be professional, helpful, and match the appropriate tone.
        """

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            metadata={"operation": "response_generation", "sentiment": sentiment}
        )

        return response.choices[0].message.content

    def process_issue_batch(self, repo: str, max_issues: int = 5):
        """Process a batch of issues with full context"""
        issues = self.get_issues(repo)[:max_issues]
        results = []

        for issue in issues:
            print(f"Processing issue #{issue['number']}: {issue['title']}")

            # This creates a natural "episode" boundary
            sentiment = self.analyze_issue_sentiment(issue)
            response = self.generate_response_draft(issue, sentiment)

            results.append({
                "issue": issue,
                "sentiment": sentiment,
                "draft_response": response
            })

        return results

# Usage
agent = GitHubAgent(github_token="your_token")
results = agent.process_issue_batch("microsoft/vscode")

Step 3: Build the Episode Replay UI

Now create a Flask app that reads from the Airblackbox storage and builds an interactive replay interface:

# episode_replay.py
from flask import Flask, render_template, jsonify, request
import sqlite3
import json
from datetime import datetime
from typing import List, Dict

app = Flask(__name__)

class EpisodeReplay:
    def __init__(self, db_path: str):
        self.db_path = db_path

    def get_episodes(self) -> List[Dict]:
        """Extract distinct episodes from recorded calls"""
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        # Group calls by session/time windows
        cursor.execute("""
            SELECT 
                datetime(timestamp, 'unixepoch') as session_time,
                COUNT(*) as call_count,
                MIN(timestamp) as start_time,
                MAX(timestamp) as end_time,
                GROUP_CONCAT(operation) as operations
            FROM llm_calls 
            WHERE metadata LIKE '%operation%'
            GROUP BY date(timestamp, 'unixepoch'), 
                     (timestamp - 1640995200) / 3600  -- Group by hour
            ORDER BY start_time DESC
        """)

        episodes = []
        for row in cursor.fetchall():
            episodes.append({
                "id": len(episodes),
                "session_time": row[0],
                "call_count": row[1],
                "start_time": row[2],
                "end_time": row[3],
                "operations": row[4].split(',') if row[4] else []
            })

        conn.close()
        return episodes

    def get_episode_calls(self, episode_id: int) -> List[Dict]:
        """Get all LLM calls for a specific episode"""
        episodes = self.get_episodes()
        if episode_id >= len(episodes):
            return []

        episode = episodes[episode_id]
        conn = sqlite3.connect(self.db_path)
        cursor = conn.cursor()

        cursor.execute("""
            SELECT timestamp, model, messages, response, metadata, tokens_used, cost
            FROM llm_calls 
            WHERE timestamp BETWEEN ? AND ?
            ORDER BY timestamp ASC
        """, (episode['start_time'], episode['end_time']))

        calls = []
        for row in cursor.fetchall():
            calls.append({
                "timestamp": row[0],
                "model": row[1],
                "messages": json.loads(row[2]),
                "response": json.loads(row[3]),
                "metadata": json.loads(row[4]) if row[4] else {},
                "tokens_used": row[5],
                "cost": row[6]
            })

        conn.close()
        return calls

replay = EpisodeReplay("agent_episodes.db")

@app.route('/')
def episodes_list():
    episodes = replay.get_episodes()
    return render_template('episodes.html', episodes=episodes)

@app.route('/episode/<int:episode_id>')
def episode_detail(episode_id: int):
    calls = replay.get_episode_calls(episode_id)
    return render_template('episode_detail.html', 
                         episode_id=episode_id, 
                         calls=calls)

@app.route('/api/episode/<int:episode_id>/step/<int:step>')
def get_step_context(episode_id: int, step: int):
    """API endpoint for step-by-step debugging"""
    calls = replay.get_episode_calls(episode_id)
    if step >= len(calls):
        return jsonify({"error": "Step out of range"})

    current_call = calls[step]
    context_window = calls[max(0, step-2):step+1]  # Show 2 previous calls

    return jsonify({
        "current_call": current_call,
        "context_window": context_window,
        "step": step,
        "total_steps": len(calls)
    })

if __name__ == '__main__':
    app.run(debug=True, port=5001)

Step 4: Create the Frontend Templates

Create templates/episode_detail.html for the interactive replay:

<!-- templates/episode_detail.html -->
<!DOCTYPE html>
<html>
<head>
    <title>Episode Replay - Step by Step</title>
    <style>
        .step-debugger { display: flex; height: 100vh; }
        .timeline { width: 200px; background: #f5f5f5; padding: 20px; }
        .content { flex: 1; padding: 20px; }
        .step-item { padding: 10px; cursor: pointer; border-radius: 4px; margin: 5px 0; }
        .step-item.active { background: #007bff; color: white; }
        .call-details { background: #f8f9fa; padding: 20px; border-radius: 8px; margin: 10px 0; }
        .prompt { background: #e3f2fd; padding: 15px; border-radius: 4px; }
        .response { background: #f3e5f5; padding: 15px; border-radius: 4px; margin-top: 10px; }
        .metadata { font-size: 0.9em; color: #666; margin-top: 10px; }
    </style>
</head>
<body>
    <div class="step-debugger">
        <div class="timeline">
            <h3>Episode Steps</h3>
            {% for call in calls %}
            <div class="step-item" onclick="showStep({{ loop.index0 }})" 
                 id="step-{{ loop.index0 }}">
                <strong>Step {{ loop.index }}</strong><br>
                {{ call.metadata.operation|default('LLM Call') }}<br>
                <small>{{ call.tokens_used }} tokens</small>
            </div>
            {% endfor %}
        </div>

        <div class="content">
            <div id="step-content">
                <h2>Select a step to replay</h2>
                <p>Use the timeline on the left to step through the agent's decision process.</p>
            </div>

            <div id="context-window" style="display: none;">
                <h3>Context Window</h3>
                <div id="context-calls"></div>
            </div>
        </div>
    </div>

    <script>
        let currentStep = -1;

        function showStep(stepIndex) {
            // Update UI
            document.querySelectorAll('.step-item').forEach(item => {
                item.classList.remove('active');
            });
            document.getElementById(`step-${stepIndex}`).classList.add('active');

            // Fetch step details
            fetch(`/api/episode/{{ episode_id }}/step/${stepIndex}`)
                .then(response => response.json())
                .then(data => {
                    displayStepContent(data);
                    currentStep = stepIndex;
                });
        }

        function displayStepContent(data) {
            const call = data.current_call;
            const content = document.getElementById('step-content');

            content.innerHTML = `
                <h2>Step ${data.step + 1} of ${data.total_steps}</h2>

                <div class="call-details">
                    <h3>Operation: ${call.metadata.operation || 'LLM Call'}</h3>

                    <div class="prompt">
                        <h4>Input Prompt:</h4>
                        <pre>${JSON.stringify(call.messages, null, 2)}</pre>
                    </div>

                    <div class="response">
                        <h4>LLM Response:</h4>
                        <pre>${JSON.stringify(call.response, null, 2)}</pre>
                    </div>

                    <div class="metadata">
                        <strong>Model:</strong> ${call.model}<br>
                        <strong>Tokens:</strong> ${call.tokens_used}<br>
                        <strong>Cost:</strong> $${call.cost.toFixed(4)}<br>
                        <strong>Timestamp:</strong> ${new Date(call.timestamp * 1000).toLocaleString()}
                    </div>
                </div>
            `;

            // Show context window
            if (data.context_window.length > 1) {
                displayContextWindow(data.context_window);
            }
        }

        function displayContextWindow(contextCalls) {
            const contextDiv = document.getElementById('context-window');
            const callsDiv = document.getElementById('context-calls');

            let contextHTML = '';
            contextCalls.forEach((call, index) => {
                const isCurrentCall = index === contextCalls.length - 1;
                contextHTML += `
                    <div class="call-details" style="${isCurrentCall ? 'border: 2px solid #007bff;' : 'opacity: 0.7;'}">
                        <h4>${call.metadata.operation || 'LLM Call'} ${isCurrentCall ? '(Current)' : ''}</h4>
                        <div style="font-size: 0.9em;">
                            <strong>Input:</strong> ${call.messages[0]?.content?.substring(0, 100)}...<br>
                            <strong>Output:</strong> ${call.response.choices?.[0]?.message?.content?.substring(0, 100)}...
                        </div>
                    </div>
                `;
            });

            callsDiv.innerHTML = contextHTML;
            contextDiv.style.display = 'block';
        }

        // Keyboard navigation
        document.addEventListener('keydown', (e) => {
            if (e.key === 'ArrowLeft' && currentStep > 0) {
                showStep(currentStep - 1);
            } else if (e.key === 'ArrowRight' && currentStep < {{ calls|length - 1 }}) {
                showStep(currentStep + 1);
            }
        });
    </script>
</body>
</html>

Pitfalls: What Will Break and How to Handle It

Memory Explosion

Recording everything creates massive datasets. Your SQLite file will grow fast.

Solution: Implement retention policies:

# Cleanup old episodes
cursor.execute("DELETE FROM llm_calls WHERE timestamp < ?", 
               (time.time() - 7*24*3600,))  # Keep 1 week

Context Window Reconstruction Fails

LLMs have context limits. Long conversations get truncated, breaking replay accuracy.

Solution: Store the actual context sent to the model, not just the messages:

# In your gateway configuration
STORE_FULL_CONTEXT = True
MAX_CONTEXT_TOKENS = 8192

Session Boundary Detection Is Wrong

The replay UI groups unrelated calls into fake "episodes."

Solution: Add explicit session tracking:

# Add to your agent code
response = self.client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    metadata={
        "session_id": self.session_id,
        "operation": "sentiment_analysis",
        "step_index": self.current_step
    }
)

Performance Dies with Large Episodes

Rendering 500+ LLM calls in the browser locks up the UI.

Solution: Implement pagination and lazy loading:

// Load steps on demand
function showStep(stepIndex) {
    if (!stepCache[stepIndex]) {
        fetch(`/api/episode/${episodeId}/step/${stepIndex}`)
            .then(response => response.json())
            .then(data => {
                stepCache[stepIndex] = data;
                displayStepContent(data);
            });
    }
}

Measurement: How to Know It's Working

Test with a Controlled Failure

Create an agent that deliberately fails:

def test_episode_recording():
    agent = GitHubAgent(github_token="invalid_token")

    try:
        # This will fail, creating a clear episode
        results = agent.process_issue_batch("nonexistent/repo")
    except Exception as e:
        print(f"Expected failure: {e}")

    # Check the episode was recorded
    replay = EpisodeReplay("agent_episodes.db")
    episodes = replay.get_episodes()
    assert len(episodes) > 0, "No episodes recorded"

    # Verify step-by-step data
    calls = replay.get_episode_calls(0)
    assert len(calls) > 0, "No calls in episode"
    assert all('timestamp' in call for call in calls), "Missing timestamps"

Verify Context Reconstruction

The real test: can you debug a failure you've never seen before?

Run your agent on a complex task
Let it fail mysteriously
Open the replay UI
Step through until you find the exact moment it went wrong
You should see the prompt, the response, and understand why

If you can't identify the failure point in under 5 minutes, your episode recording isn't detailed enough.

Check Storage Performance

Monitor your database size and query speed:

# Check episode database size
ls -lh agent_episodes.db

# Query performance test
time sqlite3 agent_episodes.db "SELECT COUNT(*) FROM llm_calls WHERE timestamp > $(date -d '1 day ago' +%s)"

If queries take longer than 100ms, add indexes:

CREATE INDEX idx_timestamp ON llm_calls(timestamp);
CREATE INDEX idx_metadata ON llm_calls(json_extract(metadata, '$.session_id'));

Next Steps: Build Your Agent DVR

Clone the complete demo: github.com/airblackbox/agent-dvr-tutorial
Start with your existing agent: Change one line (the base URL) and start recording
Add session tracking: Include session_id and operation in your metadata
Build your replay UI: Use the Flask template above as a starting point

Your AI agents will still be probabilistic chaos machines. But now when they misbehave, you'll have a complete recording of their decision process. No more debugging by vibes — you'll see exactly where the wheels fell off.

The DVR doesn't prevent AI agent failures. It just makes them debuggable. Which, honestly, is most of the battle.

Try Airblackbox Gateway — your future debugging self will thank you.

DEV Community

Build a DVR for Your AI Agent: Episode Replay UI Tutorial

Build a DVR for Your AI Agent: Episode Replay UI Tutorial

The Problem: AI Agent Debugging Is Still Medieval

Architecture: How Agent DVR Actually Works

Implementation: Building Your Agent DVR

Step 1: Install and Configure Airblackbox Gateway

Step 2: Instrument Your Agent Code

Step 3: Build the Episode Replay UI

Step 4: Create the Frontend Templates

Pitfalls: What Will Break and How to Handle It

Memory Explosion

Context Window Reconstruction Fails

Session Boundary Detection Is Wrong

Performance Dies with Large Episodes

Measurement: How to Know It's Working

Test with a Controlled Failure

Verify Context Reconstruction

Check Storage Performance

Next Steps: Build Your Agent DVR

Top comments (0)