DEV Community

HUANGCHIHHUNG
HUANGCHIHHUNG

Posted on

Your LLM isn't watching the video. It's reading subtitles.

Paste a YouTube link into ChatGPT and ask "what's this video about?" — you'll get an answer. But here's the thing: it read the transcript. The slides, the live demo, the thing the presenter actually showed on screen? All thrown away.

I found this out the hard way, and it bugged me enough to build a tool for it. Last week it hit the front page of Hacker News and just passed 500 GitHub stars, so I figured I'd write down how it works.

The state of "AI watching video" today

  • Claude won't accept a video file at all.
  • ChatGPT takes a YouTube link, reads the subtitles, and answers from those.
  • Gemini genuinely reads video — but it samples at a fixed interval (1 fps by default), so fast cuts slip between samples while a 10-minute static slide burns 600 near-identical frames. And your footage goes to the cloud.

For talks, tutorials, and demos — where most of the value is on screen, not in the audio — none of these actually work.

What I built instead

claude-real-video takes a URL or a local file and produces a folder any LLM can read:

pip install claude-real-video
crv "https://www.youtube.com/watch?v=..." --grid
# → crv-out/frames/  +  transcript.txt  +  MANIFEST.txt  +  grids/
Enter fullscreen mode Exit fullscreen mode

Three ideas, all boring on purpose:

  1. Grab a frame only when the picture actually changes. Scene-change detection instead of a fixed sampling interval — a 10-minute static slide collapses to one frame, a rapid-fire edit keeps every cut.
  2. Drop what the model already saw. A sliding-window dedup compares each new frame against the last few kept ones, so an A-B-A cutaway doesn't send shot A twice.
  3. Tell the model what it's looking at. One MANIFEST.txt lists every frame with its timestamp, aligned with the Whisper transcript.

Real numbers from a 58-second clip: fixed 1 fps sampling gives you 58 frames; this keeps the 26 that actually differ.

"Keyframes are not video"

Fairest criticism I got on HN. A stack of stills loses motion and order. v0.4.0's answer is --grid: it packs consecutive keyframes into 3x3 contact sheets, so the model reads a chronological sequence instead of scattered images — and you send 9x fewer images while you're at it.

It still won't recover true motion or object permanence — I'd rather say that plainly than oversell it. (I'm exploring measured motion data — camera moves, cut rhythm — as a paid add-on called crv Pro, but the free tool stands on its own.)

Everything runs locally

ffmpeg + faster-whisper on your machine. Nothing is uploaded by the tool — what reaches an LLM is only what you choose to paste into one afterwards. MIT licensed.

If you use Claude Code, there's a ready-made skill in the repo — drop it into ~/.claude/skills and Claude will run the whole pipeline itself when you paste a video link.

GitHub: https://github.com/HUANGCHIHHUNGLeo/claude-real-video

I'm Leo — a liberal-arts founder running a one-person company with an AI team. Happy to answer anything about the approach.

Top comments (0)