DEV Community: Leo Huang

1,377 frames in, 60 out, and none of them knew what time it was

Leo Huang — Sat, 25 Jul 2026 19:12:34 +0000

A user opened an issue on my open-source video tool last week that named a gap I had been shipping around for months.

He runs lectures through crv so an LLM can read them. One 22-minute lecture: 1,377 candidate frames extracted, 60 kept after dedup and --max-frames thinning. The 60 frames come out in the right order. That is all they come out with.

His complaint, in one line: the LLM can describe the slide, but it cannot tell you when the slide was on screen.

That breaks more than it sounds like:

you cannot cite visual evidence with a timestamp
you cannot line a chart up against the nearby transcript.json segments
you cannot jump from a keyframe back to that moment in the video
you cannot verify the claim afterwards

The transcript had timestamps the whole time. The frames did not.

Why the timestamps died

The pipeline goes: extract with ffmpeg, drop near-identical frames, thin down to --max-frames, rename everything to frame_001.jpg, frame_002.jpg.

Every one of those steps is lossy for position. Extraction writes files, dedup deletes some, thinning deletes more, renaming closes the gaps. By the time you are holding frame_012.jpg, the only fact left in the filename is "twelfth surviving frame", and twelfth of what is no longer recoverable from the output directory.

The tempting fix is arithmetic: timestamp = frame_number / fps. That is wrong on any variable frame rate source, which covers most screen recordings and a lot of phone video. It gives you a number that looks right and drifts.

What actually works

ffmpeg already knows. The showinfo filter prints the real PTS of every frame it passes, on the same select pass you are already running:

-vf "select=...,showinfo"

Parse that log and you get true presentation timestamps with no second decode pass. Then you carry them: attach the PTS at extraction, keep it attached through dedup, through thinning, through the rename, and write it out next to the images as frames.json:

{
  "frames": [
    {
      "file": "frame_001.jpg",
      "timestamp_sec": 18.42,
      "timestamp": "00:00:18.420",
      "selection_reason": "scene"
    }
  ]
}

selection_reason records which dedup channel kept the frame. That one is worth adding early: it is what you read when a frame you wanted is missing and you need to know which stage ate it.

The part I nearly skipped

If the showinfo log and the extracted files ever disagree on count, the tool writes no timestamps at all rather than approximate ones.

That felt overly strict while I was writing it. It is the opposite. A missing timestamp makes the model say "I don't know when". A wrong timestamp makes it cite 00:03:41 with total confidence, and nothing downstream can catch it. In a pipeline whose entire job is handing a model verifiable evidence, a plausible wrong number is the worst thing you can emit.

Did it hold up

The person who filed the issue re-ran his 22:12 lecture on the new build and checked it himself: 1,377 candidates down to 60 final frames, 60 entries in the mapping, all monotonic, no missing or extra image files. He replayed the full extraction pass against the original source and matched every final image back to its recorded timestamp. 60 out of 60.

I did not ask him to do that. It is the most useful thing anyone has done for this project.

Takeaway

If you build any extract, filter and rename pipeline that feeds an LLM, decide early where position lives. Threading an identifier through four stages is much cheaper than reconstructing it from a directory listing afterwards. And when the identifier is uncertain, emit nothing instead of something.

crv is MIT and on PyPI:

pip install -U claude-real-video

Source: https://github.com/HUANGCHIHHUNGLeo/claude-real-video

There is also a paid Pro build if you need camera motion, audio and speaker labels on top of frames and transcript: https://capafy.ai/agent/llm-real-video-pro-let-any-llm-watch-videos/5451082151?ct=devto

A 2,181-video field report made my open-source video tool better in one day

Leo Huang — Tue, 21 Jul 2026 09:02:22 +0000

Last week a user emailed me a field report. He had run claude-real-video — my open-source tool that turns a video into something an LLM can actually read — over his entire photo library: 2,181 videos in about four days. Then he sent me the bug list, worst first.

The worst one was a design flaw I had been shipping for months without noticing.

The blind spot

Frame dedup compares downscaled frames and drops a frame when too few pixels changed. Sounds reasonable — until the thing that matters is small in the frame.

A person filmed at phone-camera distance covers roughly 0.5% of the picture. Whatever they do, they can never change 8% of the pixels. So percentage-based dedup calls the crucial second "a duplicate" and deletes it. On his repro clip — a static shot where a vehicle knocks someone down in about one second — extraction produced 69 frames and dedup kept 4. The analysis described "a vehicle passes close to the camera" and missed the incident entirely. Setting the threshold to zero did not help. The math is structurally blind.

The fix

Percentages cannot see small subjects, so 0.7.16 adds a third check that ignores percentages: if a handful of cells change hard, the frame stays. That is the whole idea. On the same repro, the action now survives 10/10 frames and a vision model narrates the event correctly. On normal footage the kept-count barely moves, so you do not pay extra for it.

Everything else he reported — a crash on non-UTF-8 metadata, a flag name that means the opposite of what it says, 68 GB of intermediates piling up silently — shipped the same day, also in 0.7.16.

Try it

If you use Claude Code (Codex, Cursor and Gemini CLI work too):

npx skills add HUANGCHIHHUNGLeo/claude-real-video

Claude Code plugin marketplace:

/plugin marketplace add HUANGCHIHHUNGLeo/claude-real-video
/plugin install claude-real-video@claude-real-video

Or run it directly:

pip install claude-real-video
crv "your video URL or file"

Here is a 60-second demo:

日本のユーザーの方へ

日本からの利用者が増えてきて、うれしいです。インストールは pip install claude-real-video、そのあと crv "動画のURLまたはファイル" を実行するだけです。ご質問は日本語でも大丈夫です。

The takeaway

A bug list this long only comes from mileage. If someone runs your tool 2,181 times and writes down everything that broke, that is not criticism — that is the roadmap. Treasure those users.

Repo: https://github.com/HUANGCHIHHUNGLeo/claude-real-video

Your LLM can't actually watch video. Here's the smallest fix (MIT)

Leo Huang — Sun, 19 Jul 2026 03:01:16 +0000

Every model card says "multimodal". Then you hand the model a real video file and discover what that means in practice: ChatGPT reads the subtitle track, Claude doesn't accept video files at all. The model narrates a video it mostly never saw.

I unpack viral videos daily for my own content work, so I couldn't route around this. I built a small tool instead.

The mechanism

claude-real-video converts a video into three things an LLM can genuinely read:

Scene-aware sampled frames — ffmpeg scene scores decide where to sample, so you get a frame when the picture changes, not every N seconds. An --adaptive flag handles slow deformations (a real user bug report: fixed thresholds missed squash/stretch morphs entirely).
A timestamped transcript — whisper by default; if faster-whisper is installed it runs in-process and several times faster, with automatic fallback.
One MANIFEST timeline — frames and transcript merged into a single file, so the model follows the video in order instead of guessing from fragments. A --text-anchors flag force-samples frames at subtitle cues so on-screen text never falls between frames.

Then you point any LLM at the output folder — Claude, GPT, Gemini, or a local model. No API of mine in the middle, everything runs on your machine.

Usage

pip install claude-real-video
crv "video.mp4" -o out

Honest limitations

Not real-time — a 90-second video takes about 1–2 minutes all-in on an M-series Mac. Frame sampling can still miss motion between frames; the flags above patch the worst cases, both born from real GitHub issues.

It's MIT, currently at 1,731 stars with ~8k installs last month, which taught me the problem was never just mine:
https://github.com/HUANGCHIHHUNGLeo/claude-real-video

The eye corrects the ear: fixing my LLM's video hallucinations with OCR and a VAD gate

Leo Huang — Fri, 17 Jul 2026 09:14:29 +0000

Sequel to My LLM could not tell a timelapse from real time — so I taught it physics.

I build crv, an open-source tool that turns videos into something an LLM can actually read: scene-aware keyframes, a timestamped transcript, and a fused timeline. This week two of its senses started lying to it, and fixing that taught me one lesson worth writing down.

The ear lies: whisper invents captions over music

An 8-second, music-only clip came back with the caption "I'll see you next time." Nobody says anything in the clip. Whisper's decoder has seen too many outros — music at the end of a video "should" have that line, so it writes it.

The standard fix works: switch to faster-whisper and enable its Silero VAD (voice-activity detection) gate:

model.transcribe(wav,
    vad_filter=True,
    vad_parameters={"min_silence_duration_ms": 500},
    condition_on_previous_text=False)

Segments with no detected speech never reach the model. Nothing goes in, nothing gets invented.

The bug that was mine, not whisper's

Here is the part I have not seen written down anywhere. My pipeline had a fallback: if the fast engine returns nothing, fall back to the whisper CLI. Sounds harmless — until the VAD gate correctly hears no speech and returns an empty segment list. My code read "empty" as "engine failed", fell back to the ungated CLI, and the phantom caption walked right back in through the back door.

An empty result is an answer, not an error. If you put a gated path in front of a fallback path, make sure the gate's verdict can't be overruled by your own plumbing. My manifest now says "the voice-activity gate heard no speech; music/ambient-only audio" — an honest sentence instead of a fake caption.

The eye corrects the ear: OCR as ground truth

Short-form video is wall-to-wall burned-in captions, and those captions are the video's own script. So crv Pro now OCRs every kept frame and places on-screen text on the same timeline as the ASR transcript.

Real example from a Chinese video I processed yesterday: whisper heard 猴狼 ("monkey-wolf" — not a word), while the burned-in caption at the same second clearly said 后浪 ("the rising generation", the video's whole point). Same timestamp, two readings. The manifest tells the reading LLM: prefer on-screen wording over the ASR transcript for names, numbers and terms.

The ear mishears; the eye reads the script. Cross-modal redundancy beats either sense alone.

Takeaways

VAD-gate your ASR. Hallucinated captions poison everything downstream.
Audit your fallback paths — a correct empty result must not trigger a fallback to the thing you were protecting against.
If the video carries its own text, treat it as ground truth and let it correct the transcript.

Everything above ships in claude-real-video 0.7.13 (free, MIT) and crv Pro 0.8.12. All local — nothing leaves your machine.

My LLM could not tell a timelapse from real time — so I taught it physics

Leo Huang — Thu, 16 Jul 2026 14:33:52 +0000

Yesterday I got asked a simple question: "can your tool tell this reel is a timelapse?"

It could not. The tool (claude-real-video, open source, MIT) turns any video into keyframes + a timestamped transcript so an LLM can actually read it. But keyframes alone don't carry playback speed. My model watched a hyperlapse of a guy typing and described it as "a man typing."

Five hours later there was a working prototype. Here's what I learned building it.

The physics is simple

A video is a sampling of time. Detecting speed manipulation reduces to one question: does the motion between frames match the time the container claims passed between them?

Three measurable signals fall out of that:

Trajectory continuity. Post speed-up drops frames from continuously captured footage — motion is fast but trackable. Interval capture (timelapse) never recorded the in-between frames — subjects teleport, optical-flow tracking collapses. Dense frame extraction can recover the former, never the latter.
Duplicate-frame patterns. Slow motion by frame duplication leaves a periodic fingerprint: hold-hold-hold-move. Frame-rate conversion (24→30fps) leaves a different one: one duplicate every five frames. A still slide leaves one long run. Same "duplicate ratio", three different verdicts — run-length structure is the tell.
Camera vs. subject motion. Estimate the global affine transform per frame pair (RANSAC over LK tracks), subtract it, and classify what's left. Skip this and a stabilized sped-up walking tour reads as normal — the "speed" was all in the camera channel.

What the benchmark taught me

I built a labeled corpus the cheap way: took clean YouTube footage, generated known transforms with ffmpeg (2x, 4x, 8x/30x interval sampling, duplicated slow-mo), and ran the classifier against ground truth.

Results after five iterations:

Zero false positives on clean footage — the one metric I refuse to trade away. A forensics tool that cries wolf is worse than no tool.
Heavy manipulation (30x lapse, padded slow-mo, 4x on visible subjects): caught, with per-segment verdicts.
Subtle 2x on slow scenes: missed, and honestly unfixable with displacement statistics alone. A slow camera sped up 2x still moves within normal-camera range. You need a reference clock — something in the frame with a known real-world rate. Human gait (~2 steps/s) is the obvious next channel.

Two bugs were more instructive than the wins:

My own corpus generation manufactured evidence: normalizing 23.976fps film to 30fps created a perfect pulldown pattern that the tool flagged as slow motion. The fix wasn't a threshold — it was teaching the classifier to recognize frame-rate conversion as its own category.
Median motion statistics erased the flagship case. In a reel where typing hands occupy 10% of the frame, the median over 400 tracked corners is the static wall. The manipulation lives in the tail (p90) and in the moving cluster — aggregate stats hide exactly what you're looking for.

Never say "normal speed"

The design rule I'm most attached to came from an adversarial review: the tool never outputs "this video is normal speed." It outputs "no reliable evidence of manipulation." Cleanly re-encoded speed-up is theoretically indistinguishable from native low-frame-rate capture — a tool that pretends otherwise is lying. Evidence tiers (strong/moderate/weak/insufficient), never fake certainty.

Where this lands

The prototype (387 lines, OpenCV + ffmpeg, no GPU, ~1s per 10s of video) ships as an opt-in --speed-check flag in crv Pro once it passes a benchmark built from real reels and Shorts — because that's what people actually feed these tools, and the current corpus is too polite.

The free base — scene-aware keyframes + timestamped transcript, 100% local — is here: https://github.com/HUANGCHIHHUNGLeo/claude-real-video

If you've worked on temporal forensics (SpeedNet, the recent Cornell "Seeing Fast and Slow" work) I'd genuinely like to hear where this naive-physics approach breaks.

Your LLM isn't watching that video — it's reading the subtitles

Leo Huang — Wed, 15 Jul 2026 14:55:47 +0000

A few months ago I pasted a YouTube link into an AI chat and asked "what happens in this video?"

It answered instantly. Confidently. And completely from the transcript. The video had a sight gag in the middle — the whole point of the clip — and the model had no idea, because nobody ever showed it a single frame.

That bugged me enough to build claude-real-video (crv), a small open-source CLI that turns any video into something an LLM can actually read. It hit the Hacker News front page and just passed 1.6k GitHub stars, so I figured it's time to write up how it works under the hood.

The naive approach fails on tokens

The obvious fix is "extract frames, paste them in." But at a fixed 1 fps, a 58-second clip becomes 58 images. Most are near-duplicates of their neighbours, and vision tokens are expensive — you're paying to show the model the same talking head 40 times.

Fixed-interval sampling has the opposite failure too: a fast cut between two samples just disappears.

What crv does instead

pip install "claude-real-video[whisper]"
crv "https://www.youtube.com/watch?v=..."
# → crv-out/frames/*.jpg + frames.json + transcript.txt/.json + MANIFEST.txt

Everything runs locally. No ML models to download for the core path — it's ffmpeg doing the heavy lifting:

1. Scene-change detection, not a fixed quota. One ffmpeg metadata pass computes a scene score for every frame. Frames are kept where the content actually changes, so that same 58-second clip yields 26 frames instead of 58 — and no cut slips through, because cuts are exactly what scene scores spike on.

2. Sliding-window dedup. Near-duplicates that survive the threshold get compared against a rolling window and dropped. What's left is the minimal set of frames that differ.

3. Contact sheets. --grid packs the survivors into a few labeled grid images. 26 frames become 3 contact sheets. Fewer images, same information, and the timestamps are printed on each cell so the model can reference "at 0:41" correctly.

4. Timestamped transcript. Subtitles when the platform provides them, Whisper when it doesn't — written both as plain text and as transcript.json with per-segment timestamps, so the frames and the words line up on one timeline.

The output is one folder with a MANIFEST.txt on top. Drop it into Claude, ChatGPT or Gemini and ask away.

The two failure modes that took real users to find

The fixed scene-score threshold turned out to have blind spots, and both fixes came from GitHub issues:

Slow morphs never spike. An animator reported that a 2-3 second squash-and-stretch never triggered a single frame — no individual frame differs enough from the previous one. --adaptive fixes this by scoring each frame against its rolling 2-second neighbourhood mean instead of a global constant. Slow change accumulates against the local baseline and gets caught.

Slides don't change when the speaker does. In lectures and screen recordings, the picture can sit still for a minute while the audio moves through three ideas. --text-anchors forces one extra frame at each subtitle-cue timestamp (capped at one per second), so every spoken segment has a matching visual even when scene detection sees nothing.

Why local matters here

The model never needs the video file — it needs the residue: which frames changed, what was said, when. That residue is small enough to compute on any laptop with ffmpeg, which means the video itself never has to leave your machine. What goes to a cloud LLM afterwards is only whatever you choose to paste.

If you're on Claude Code, the repo ships a skill folder — install it and the agent watches videos on its own when you paste a link.

Honest footnote

crv is MIT and stays free. I fund the work with a paid add-on (crv Pro) that adds camera-motion and emotion-timeline analysis for creators — the free core is the complete watching pipeline, not a demo.

Repo: https://github.com/HUANGCHIHHUNGLeo/claude-real-video
PyPI: https://pypi.org/project/claude-real-video/

— Leo Huang (黃志弘, LeoAido), building a one-person company with an AI team.

Your LLM isn't watching the video. It's reading subtitles.

Leo Huang — Fri, 03 Jul 2026 21:57:26 +0000

Paste a YouTube link into ChatGPT and ask "what's this video about?" — you'll get an answer. But here's the thing: it read the transcript. The slides, the live demo, the thing the presenter actually showed on screen? All thrown away.

I found this out the hard way, and it bugged me enough to build a tool for it. Last week it hit the front page of Hacker News and just passed 500 GitHub stars, so I figured I'd write down how it works.

The state of "AI watching video" today

Claude won't accept a video file at all.
ChatGPT takes a YouTube link, reads the subtitles, and answers from those.
Gemini genuinely reads video — but it samples at a fixed interval (1 fps by default), so fast cuts slip between samples while a 10-minute static slide burns 600 near-identical frames. And your footage goes to the cloud.

For talks, tutorials, and demos — where most of the value is on screen, not in the audio — none of these actually work.

What I built instead

claude-real-video takes a URL or a local file and produces a folder any LLM can read:

pip install claude-real-video
crv "https://www.youtube.com/watch?v=..." --grid
# → crv-out/frames/  +  transcript.txt  +  MANIFEST.txt  +  grids/

Three ideas, all boring on purpose:

Grab a frame only when the picture actually changes. Scene-change detection instead of a fixed sampling interval — a 10-minute static slide collapses to one frame, a rapid-fire edit keeps every cut.
Drop what the model already saw. A sliding-window dedup compares each new frame against the last few kept ones, so an A-B-A cutaway doesn't send shot A twice.
Tell the model what it's looking at. One MANIFEST.txt lists every frame with its timestamp, aligned with the Whisper transcript.

Real numbers from a 58-second clip: fixed 1 fps sampling gives you 58 frames; this keeps the 26 that actually differ.

"Keyframes are not video"

Fairest criticism I got on HN. A stack of stills loses motion and order. v0.4.0's answer is --grid: it packs consecutive keyframes into 3x3 contact sheets, so the model reads a chronological sequence instead of scattered images — and you send 9x fewer images while you're at it.

It still won't recover true motion or object permanence — I'd rather say that plainly than oversell it. (I'm exploring measured motion data — camera moves, cut rhythm — as a paid add-on called crv Pro, but the free tool stands on its own.)

Everything runs locally

ffmpeg + faster-whisper on your machine. Nothing is uploaded by the tool — what reaches an LLM is only what you choose to paste into one afterwards. MIT licensed.

If you use Claude Code, there's a ready-made skill in the repo — drop it into ~/.claude/skills and Claude will run the whole pipeline itself when you paste a video link.

GitHub: https://github.com/HUANGCHIHHUNGLeo/claude-real-video

I'm Leo — a liberal-arts founder running a one-person company with an AI team. Happy to answer anything about the approach.