<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: HUANGCHIHHUNG</title>
    <description>The latest articles on DEV Community by HUANGCHIHHUNG (@huangchihhungleo).</description>
    <link>https://dev.to/huangchihhungleo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4014142%2Ff98d8693-2550-4f9e-9aa6-8c516ba04b84.jpg</url>
      <title>DEV Community: HUANGCHIHHUNG</title>
      <link>https://dev.to/huangchihhungleo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/huangchihhungleo"/>
    <language>en</language>
    <item>
      <title>Your LLM isn't watching the video. It's reading subtitles.</title>
      <dc:creator>HUANGCHIHHUNG</dc:creator>
      <pubDate>Fri, 03 Jul 2026 21:57:26 +0000</pubDate>
      <link>https://dev.to/huangchihhungleo/your-llm-isnt-watching-the-video-its-reading-subtitles-2jjl</link>
      <guid>https://dev.to/huangchihhungleo/your-llm-isnt-watching-the-video-its-reading-subtitles-2jjl</guid>
      <description>&lt;p&gt;Paste a YouTube link into ChatGPT and ask "what's this video about?" — you'll get an answer. But here's the thing: it read the transcript. The slides, the live demo, the thing the presenter actually showed on screen? All thrown away.&lt;/p&gt;

&lt;p&gt;I found this out the hard way, and it bugged me enough to build a tool for it. Last week it hit the front page of Hacker News and just passed 500 GitHub stars, so I figured I'd write down how it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The state of "AI watching video" today
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; won't accept a video file at all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT&lt;/strong&gt; takes a YouTube link, reads the subtitles, and answers from those.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; genuinely reads video — but it samples at a fixed interval (1 fps by default), so fast cuts slip between samples while a 10-minute static slide burns 600 near-identical frames. And your footage goes to the cloud.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For talks, tutorials, and demos — where most of the value is on screen, not in the audio — none of these actually work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built instead
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;claude-real-video&lt;/code&gt; takes a URL or a local file and produces a folder any LLM can read:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;claude-real-video
crv &lt;span class="s2"&gt;"https://www.youtube.com/watch?v=..."&lt;/span&gt; &lt;span class="nt"&gt;--grid&lt;/span&gt;
&lt;span class="c"&gt;# → crv-out/frames/  +  transcript.txt  +  MANIFEST.txt  +  grids/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three ideas, all boring on purpose:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Grab a frame only when the picture actually changes.&lt;/strong&gt; Scene-change detection instead of a fixed sampling interval — a 10-minute static slide collapses to one frame, a rapid-fire edit keeps every cut.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drop what the model already saw.&lt;/strong&gt; A sliding-window dedup compares each new frame against the last few kept ones, so an A-B-A cutaway doesn't send shot A twice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tell the model what it's looking at.&lt;/strong&gt; One MANIFEST.txt lists every frame with its timestamp, aligned with the Whisper transcript.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Real numbers from a 58-second clip: fixed 1 fps sampling gives you 58 frames; this keeps the 26 that actually differ.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Keyframes are not video"
&lt;/h2&gt;

&lt;p&gt;Fairest criticism I got on HN. A stack of stills loses motion and order. v0.4.0's answer is &lt;code&gt;--grid&lt;/code&gt;: it packs consecutive keyframes into 3x3 contact sheets, so the model reads a chronological sequence instead of scattered images — and you send 9x fewer images while you're at it.&lt;/p&gt;

&lt;p&gt;It still won't recover true motion or object permanence — I'd rather say that plainly than oversell it. (I'm exploring measured motion data — camera moves, cut rhythm — as a paid add-on called crv Pro, but the free tool stands on its own.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Everything runs locally
&lt;/h2&gt;

&lt;p&gt;ffmpeg + faster-whisper on your machine. Nothing is uploaded by the tool — what reaches an LLM is only what you choose to paste into one afterwards. MIT licensed.&lt;/p&gt;

&lt;p&gt;If you use Claude Code, there's a ready-made skill in the repo — drop it into &lt;code&gt;~/.claude/skills&lt;/code&gt; and Claude will run the whole pipeline itself when you paste a video link.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/HUANGCHIHHUNGLeo/claude-real-video" rel="noopener noreferrer"&gt;https://github.com/HUANGCHIHHUNGLeo/claude-real-video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'm Leo — a liberal-arts founder running a one-person company with an AI team. Happy to answer anything about the approach.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
