How I Taught My AI Agent to Watch YouTube Videos
My AI agent runs on Claude Opus. It can read documents, write code, browse the web — but it can't watch a video. Hand it a YouTube link and it just… stares at it. No eyes, no ears, no temporal understanding.
I needed it to analyze a 78-minute Daniel Kahneman podcast. Not a summary from someone's blog — the actual content, with visual context. So I built a pipeline to make that happen.
The Problem: LLMs Are Blind (and Deaf) to Video
This sounds obvious, but the implications are subtle. A video isn't just "text that happens to be spoken." It's slides, facial expressions, diagrams drawn on whiteboards, screen shares, b-roll. If you only feed the transcript, you lose half the signal.
Claude can process images. It can process text. It just can't process time. So the job is: decompose a video into a structured sequence of (image, text) pairs that preserve temporal relationships. Make the video readable.
The Architecture: Four-Stage Pipeline
I asked Gemini for architectural advice (yes, I use competing models as consultants — no loyalty in engineering). It suggested a four-stage approach:
- Download — grab the video and subtitle tracks
- Subtitles — parse VTT into timestamped text segments
- Scene detection — extract keyframes at visual transition points
- Temporal alignment — merge frames and text into time-synced blocks
This felt right. Each stage is independently testable, and failures are isolated.
Implementation
Stage 1: Download
Nothing fancy. yt-dlp handles this reliably:
def download_video(url: str, output_dir: Path) -> tuple[Path, Path | None]:
"""Download video + subtitles. Returns (video_path, vtt_path)."""
ydl_opts = {
'format': 'bestvideo[height<=720]+bestaudio/best[height<=720]',
'writesubtitles': True,
'writeautomaticsub': True,
'subtitleslangs': ['en'],
'subtitlesformat': 'vtt',
'outtmpl': str(output_dir / '%(id)s.%(ext)s'),
'merge_output_format': 'mp4',
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=True)
video_path = output_dir / f"{info['id']}.mp4"
vtt_path = output_dir / f"{info['id']}.en.vtt"
return video_path, vtt_path if vtt_path.exists() else None
I cap at 720p. You don't need 4K for keyframe extraction — it just burns disk and processing time.
Stage 2: VTT Parsing
YouTube's auto-generated VTT files are messy. Duplicate lines, overlapping timestamps, filler text. The parser needs to clean aggressively:
def parse_vtt(vtt_path: Path) -> list[dict]:
"""Parse VTT into clean segments: [{start, end, text}, ...]"""
segments = []
for caption in webvtt.read(str(vtt_path)):
text = caption.text.strip()
text = re.sub(r'<[^>]+>', '', text) # strip tags
text = re.sub(r'\s+', ' ', text)
if not text or text in [s['text'] for s in segments[-3:]]:
continue # skip empty/duplicate
segments.append({
'start': timestamp_to_seconds(caption.start),
'end': timestamp_to_seconds(caption.end),
'text': text
})
return segments
The dedup check against the last 3 segments catches YouTube's habit of repeating lines across overlapping cue windows.
Stage 3: Scene Detection via FFmpeg
This is where it gets interesting. Instead of extracting frames at fixed intervals (every N seconds), I use FFmpeg's scene detection filter. It triggers on visual change — a new slide, a camera cut, a graph appearing:
def extract_keyframes(video_path: Path, output_dir: Path,
threshold: float = 0.3) -> list[dict]:
"""Extract frames at scene changes. Returns [{timestamp, path}, ...]"""
cmd = [
'ffmpeg', '-i', str(video_path),
'-vf', f'select=gt(scene\\,{threshold}),showinfo',
'-vsync', 'vfr',
str(output_dir / 'frame_%04d.jpg'),
'-hide_banner'
]
result = subprocess.run(cmd, capture_output=True, text=True)
frames = []
for match in re.finditer(r'pts_time:(\d+\.?\d*)', result.stderr):
ts = float(match.group(1))
idx = len(frames)
frames.append({
'timestamp': ts,
'path': output_dir / f'frame_{idx+1:04d}.jpg'
})
return frames
The threshold parameter (0.0–1.0) controls sensitivity. More on that later.
Stage 4: Temporal Alignment
Now the glue. I merge frames and subtitle segments into 30-second blocks. Each block contains the keyframes that appeared during that window and the concatenated subtitle text:
def build_context_blocks(frames: list[dict], subtitles: list[dict],
block_duration: int = 30) -> list[dict]:
"""Merge frames + subtitles into time-aligned blocks."""
total_duration = max(
max((f['timestamp'] for f in frames), default=0),
max((s['end'] for s in subtitles), default=0)
)
blocks = []
for block_start in range(0, int(total_duration) + 1, block_duration):
block_end = block_start + block_duration
block_frames = [f for f in frames
if block_start <= f['timestamp'] < block_end]
block_text = ' '.join(
s['text'] for s in subtitles
if s['start'] < block_end and s['end'] > block_start
)
if block_frames or block_text.strip():
blocks.append({
'time_range': f"{block_start}s–{block_end}s",
'frames': [f['path'] for f in block_frames],
'transcript': block_text.strip()
})
return blocks
30 seconds is a sweet spot. Short enough to preserve locality, long enough to avoid fragmenting sentences.
Results
I pointed this at a 78-minute Kahneman podcast interview. The pipeline produced:
- 20 keyframes (scene changes: new interview angles, title cards, audience shots)
- 156 subtitle segments merged into 156 30-second blocks (many with overlapping text)
- Total context size: ~45K tokens (text) + 20 images
That fits comfortably in Claude's 200K context window. I fed it in and asked for a structured analysis of Kahneman's key arguments. The result was dramatically better than transcript-only analysis — Claude could reference "the diagram shown at 34:20" and correctly describe it.
Hard-Won Heuristics
Scene threshold selection. 0.3 works for talking-head podcasts and interviews. For slide-heavy presentations, drop to 0.2 (more frames, catches subtle slide transitions). For music videos or fast-cut content, raise to 0.4 or you'll drown in frames. I start at 0.3 and adjust if the frame count is unreasonable (< 5 or > 100 for a 1-hour video).
Redundant frame removal. Scene detection sometimes fires on lighting changes or minor camera wobble. I added a post-filter that compares consecutive frames using perceptual hashing (imagehash library) and drops near-duplicates:
def deduplicate_frames(frames: list[dict], hash_threshold: int = 5):
"""Remove visually similar consecutive frames."""
from PIL import Image
import imagehash
kept = [frames[0]]
prev_hash = imagehash.phash(Image.open(frames[0]['path']))
for f in frames[1:]:
curr_hash = imagehash.phash(Image.open(f['path']))
if abs(curr_hash - prev_hash) > hash_threshold:
kept.append(f)
prev_hash = curr_hash
return kept
Context window budgeting. Rule of thumb: each 720p JPEG keyframe ≈ 1,200 tokens (Claude's image tokenization). 20 frames = ~24K image tokens. Subtitle text for a 1-hour video ≈ 30–50K tokens. Total budget: ~75K tokens, well within 200K. If you're processing 3+ hour content, you'll need to either increase the scene threshold or implement a "most important frames" selector.
What's Next
This works. But there are gaps:
Whisper fallback. YouTube auto-captions fail for non-English content, poor audio quality, or DRM-restricted videos. Adding local Whisper transcription as a fallback is the obvious next step. The pipeline already expects
(timestamp, text)tuples — Whisper slots right in.Batch processing. Right now it's one video at a time. For playlist analysis (conference talks, lecture series), I need queue management and incremental context building.
Cost optimization. 20 images × $0.024 per image (Claude's pricing) = $0.48 per video just for vision. For batch analysis, switching to a frame description step (describe each image as text first, then feed text-only to the main analysis) could cut costs 10×.
Smarter block sizing. Fixed 30-second windows are crude. Ideally, blocks should align with topic boundaries detected from the transcript. A lightweight topic segmentation model could handle this.
The core insight is simple: videos are just interleaved streams of images and text, arranged in time. Decompose them that way, and any multimodal LLM can "watch" them. The engineering is in making the decomposition smart enough to preserve signal without blowing your context budget.
Built with yt-dlp, FFmpeg, webvtt-py, and too much trial and error with scene detection thresholds.
Top comments (0)