DEV Community

agBythos
agBythos

Posted on

How I Taught My AI Agent to Watch YouTube Videos

How I Taught My AI Agent to Watch YouTube Videos

My AI agent runs on Claude Opus. It can read documents, write code, browse the web — but it can't watch a video. Hand it a YouTube link and it just… stares at it. No eyes, no ears, no temporal understanding.

I needed it to analyze a 78-minute Daniel Kahneman podcast. Not a summary from someone's blog — the actual content, with visual context. So I built a pipeline to make that happen.

The Problem: LLMs Are Blind (and Deaf) to Video

This sounds obvious, but the implications are subtle. A video isn't just "text that happens to be spoken." It's slides, facial expressions, diagrams drawn on whiteboards, screen shares, b-roll. If you only feed the transcript, you lose half the signal.

Claude can process images. It can process text. It just can't process time. So the job is: decompose a video into a structured sequence of (image, text) pairs that preserve temporal relationships. Make the video readable.

The Architecture: Four-Stage Pipeline

I asked Gemini for architectural advice (yes, I use competing models as consultants — no loyalty in engineering). It suggested a four-stage approach:

  1. Download — grab the video and subtitle tracks
  2. Subtitles — parse VTT into timestamped text segments
  3. Scene detection — extract keyframes at visual transition points
  4. Temporal alignment — merge frames and text into time-synced blocks

This felt right. Each stage is independently testable, and failures are isolated.

Implementation

Stage 1: Download

Nothing fancy. yt-dlp handles this reliably:

def download_video(url: str, output_dir: Path) -> tuple[Path, Path | None]:
    """Download video + subtitles. Returns (video_path, vtt_path)."""
    ydl_opts = {
        'format': 'bestvideo[height<=720]+bestaudio/best[height<=720]',
        'writesubtitles': True,
        'writeautomaticsub': True,
        'subtitleslangs': ['en'],
        'subtitlesformat': 'vtt',
        'outtmpl': str(output_dir / '%(id)s.%(ext)s'),
        'merge_output_format': 'mp4',
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=True)
        video_path = output_dir / f"{info['id']}.mp4"
        vtt_path = output_dir / f"{info['id']}.en.vtt"
        return video_path, vtt_path if vtt_path.exists() else None
Enter fullscreen mode Exit fullscreen mode

I cap at 720p. You don't need 4K for keyframe extraction — it just burns disk and processing time.

Stage 2: VTT Parsing

YouTube's auto-generated VTT files are messy. Duplicate lines, overlapping timestamps, filler text. The parser needs to clean aggressively:

def parse_vtt(vtt_path: Path) -> list[dict]:
    """Parse VTT into clean segments: [{start, end, text}, ...]"""
    segments = []
    for caption in webvtt.read(str(vtt_path)):
        text = caption.text.strip()
        text = re.sub(r'<[^>]+>', '', text)  # strip tags
        text = re.sub(r'\s+', ' ', text)
        if not text or text in [s['text'] for s in segments[-3:]]:
            continue  # skip empty/duplicate
        segments.append({
            'start': timestamp_to_seconds(caption.start),
            'end': timestamp_to_seconds(caption.end),
            'text': text
        })
    return segments
Enter fullscreen mode Exit fullscreen mode

The dedup check against the last 3 segments catches YouTube's habit of repeating lines across overlapping cue windows.

Stage 3: Scene Detection via FFmpeg

This is where it gets interesting. Instead of extracting frames at fixed intervals (every N seconds), I use FFmpeg's scene detection filter. It triggers on visual change — a new slide, a camera cut, a graph appearing:

def extract_keyframes(video_path: Path, output_dir: Path,
                      threshold: float = 0.3) -> list[dict]:
    """Extract frames at scene changes. Returns [{timestamp, path}, ...]"""
    cmd = [
        'ffmpeg', '-i', str(video_path),
        '-vf', f'select=gt(scene\\,{threshold}),showinfo',
        '-vsync', 'vfr',
        str(output_dir / 'frame_%04d.jpg'),
        '-hide_banner'
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)

    frames = []
    for match in re.finditer(r'pts_time:(\d+\.?\d*)', result.stderr):
        ts = float(match.group(1))
        idx = len(frames)
        frames.append({
            'timestamp': ts,
            'path': output_dir / f'frame_{idx+1:04d}.jpg'
        })
    return frames
Enter fullscreen mode Exit fullscreen mode

The threshold parameter (0.0–1.0) controls sensitivity. More on that later.

Stage 4: Temporal Alignment

Now the glue. I merge frames and subtitle segments into 30-second blocks. Each block contains the keyframes that appeared during that window and the concatenated subtitle text:

def build_context_blocks(frames: list[dict], subtitles: list[dict],
                         block_duration: int = 30) -> list[dict]:
    """Merge frames + subtitles into time-aligned blocks."""
    total_duration = max(
        max((f['timestamp'] for f in frames), default=0),
        max((s['end'] for s in subtitles), default=0)
    )
    blocks = []
    for block_start in range(0, int(total_duration) + 1, block_duration):
        block_end = block_start + block_duration
        block_frames = [f for f in frames
                        if block_start <= f['timestamp'] < block_end]
        block_text = ' '.join(
            s['text'] for s in subtitles
            if s['start'] < block_end and s['end'] > block_start
        )
        if block_frames or block_text.strip():
            blocks.append({
                'time_range': f"{block_start}s–{block_end}s",
                'frames': [f['path'] for f in block_frames],
                'transcript': block_text.strip()
            })
    return blocks
Enter fullscreen mode Exit fullscreen mode

30 seconds is a sweet spot. Short enough to preserve locality, long enough to avoid fragmenting sentences.

Results

I pointed this at a 78-minute Kahneman podcast interview. The pipeline produced:

  • 20 keyframes (scene changes: new interview angles, title cards, audience shots)
  • 156 subtitle segments merged into 156 30-second blocks (many with overlapping text)
  • Total context size: ~45K tokens (text) + 20 images

That fits comfortably in Claude's 200K context window. I fed it in and asked for a structured analysis of Kahneman's key arguments. The result was dramatically better than transcript-only analysis — Claude could reference "the diagram shown at 34:20" and correctly describe it.

Hard-Won Heuristics

Scene threshold selection. 0.3 works for talking-head podcasts and interviews. For slide-heavy presentations, drop to 0.2 (more frames, catches subtle slide transitions). For music videos or fast-cut content, raise to 0.4 or you'll drown in frames. I start at 0.3 and adjust if the frame count is unreasonable (< 5 or > 100 for a 1-hour video).

Redundant frame removal. Scene detection sometimes fires on lighting changes or minor camera wobble. I added a post-filter that compares consecutive frames using perceptual hashing (imagehash library) and drops near-duplicates:

def deduplicate_frames(frames: list[dict], hash_threshold: int = 5):
    """Remove visually similar consecutive frames."""
    from PIL import Image
    import imagehash
    kept = [frames[0]]
    prev_hash = imagehash.phash(Image.open(frames[0]['path']))
    for f in frames[1:]:
        curr_hash = imagehash.phash(Image.open(f['path']))
        if abs(curr_hash - prev_hash) > hash_threshold:
            kept.append(f)
            prev_hash = curr_hash
    return kept
Enter fullscreen mode Exit fullscreen mode

Context window budgeting. Rule of thumb: each 720p JPEG keyframe ≈ 1,200 tokens (Claude's image tokenization). 20 frames = ~24K image tokens. Subtitle text for a 1-hour video ≈ 30–50K tokens. Total budget: ~75K tokens, well within 200K. If you're processing 3+ hour content, you'll need to either increase the scene threshold or implement a "most important frames" selector.

What's Next

This works. But there are gaps:

  • Whisper fallback. YouTube auto-captions fail for non-English content, poor audio quality, or DRM-restricted videos. Adding local Whisper transcription as a fallback is the obvious next step. The pipeline already expects (timestamp, text) tuples — Whisper slots right in.

  • Batch processing. Right now it's one video at a time. For playlist analysis (conference talks, lecture series), I need queue management and incremental context building.

  • Cost optimization. 20 images × $0.024 per image (Claude's pricing) = $0.48 per video just for vision. For batch analysis, switching to a frame description step (describe each image as text first, then feed text-only to the main analysis) could cut costs 10×.

  • Smarter block sizing. Fixed 30-second windows are crude. Ideally, blocks should align with topic boundaries detected from the transcript. A lightweight topic segmentation model could handle this.

The core insight is simple: videos are just interleaved streams of images and text, arranged in time. Decompose them that way, and any multimodal LLM can "watch" them. The engineering is in making the decomposition smart enough to preserve signal without blowing your context budget.


Built with yt-dlp, FFmpeg, webvtt-py, and too much trial and error with scene detection thresholds.

Top comments (0)