DEV Community

Hunter G
Hunter G

Posted on

My Claude Code Can INSTANTLY Watch Any Video (Here's How)

Most AI video summary tools are completely blind. When you give them a 45-minute tech talk, they only extract the transcript.

If the speaker points to a retention graph and says "This is where startups die," the AI has no idea what "this" is. It misses the charts, the UI bugs, and the code snippets. In a multi-modal era, summarizing without visual context is useless.

The Local Hacker Solution

Anthropic doesn't have a native video model yet, and Gemini 1.5 Pro is expensive and hard to wire into Claude.

But a video is just two things: Frames (Images) + A Transcript (Text).

We can build an unstoppable pipeline using two battle-tested CLI tools:

  1. yt-dlp: Instantly downloads the video stream and official free subtitles from over 1,000 sites.
  2. ffmpeg: Silently extracts high-res frames every few seconds.

If a video lacks captions, we use Grok or OpenAI's Whisper API to transcribe the audio for pennies.

How it works

The script extracts roughly 100 keyframes from the video (dynamically scaling the interval so it never blows up your token window). It pairs these frames with the timestamped transcript and feeds it all into Claude.

Within 2 minutes, Claude has "watched" the entire video. The total token cost for a 45-minute video? About $1.

3 Killer Use Cases

  1. Content Research: Drop a competitor's viral video and ask Claude to analyze the visual hook and script simultaneously.
  2. UI Debugging: Feed a 30s screen recording of a frontend crash and ask Claude to pinpoint the exact frame the Z-index state changed.
  3. Automating the Second Brain: Run this over industry podcasts and push structured, charted notes directly into your Obsidian vault.

Stop paying for expensive AI wrappers. Wire up your CLI and let your LLM grow eyes.

Top comments (0)