<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abhisek Mishra</title>
    <description>The latest articles on DEV Community by Abhisek Mishra (@abhisek_mishra).</description>
    <link>https://dev.to/abhisek_mishra</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2173616%2F487fd5eb-23a1-4ba2-b706-6bfbc6b95a11.png</url>
      <title>DEV Community: Abhisek Mishra</title>
      <link>https://dev.to/abhisek_mishra</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/abhisek_mishra"/>
    <language>en</language>
    <item>
      <title>How I built an AI video clipping pipeline with LangGraph, Whisper and FFmpeg</title>
      <dc:creator>Abhisek Mishra</dc:creator>
      <pubDate>Sun, 24 May 2026 07:20:15 +0000</pubDate>
      <link>https://dev.to/abhisek_mishra/how-i-built-an-ai-video-clipping-pipeline-with-langgraph-whisper-and-ffmpeg-4nfh</link>
      <guid>https://dev.to/abhisek_mishra/how-i-built-an-ai-video-clipping-pipeline-with-langgraph-whisper-and-ffmpeg-4nfh</guid>
      <description>&lt;p&gt;I kept avoiding clipping my own content.&lt;/p&gt;

&lt;p&gt;Not because I didn't want short clips. I did. But the process was genuinely painful — scrub through a long video, find a good moment, trim it, crop for vertical, add captions, export. Repeat three times. Two hours gone.&lt;/p&gt;

&lt;p&gt;So I built a tool that does the whole thing automatically.&lt;/p&gt;

&lt;p&gt;Here's how it works under the hood.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With a Simple Script
&lt;/h2&gt;

&lt;p&gt;My first instinct was a single Python script — call Whisper, parse the transcript, run FFmpeg. Done.&lt;/p&gt;

&lt;p&gt;It worked. Until it didn't.&lt;/p&gt;

&lt;p&gt;When the LLM returned a bad clip selection, I had to re-run transcription. When FFmpeg failed on a weird video format, I lost the focus detection results. Debugging meant re-running everything from scratch every single time.&lt;/p&gt;

&lt;p&gt;I needed each step to be isolated. That's where LangGraph came in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LangGraph
&lt;/h2&gt;

&lt;p&gt;LangGraph lets you model a pipeline as a graph of discrete nodes, each with its own state. Instead of one big sequential script, the workflow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;transcription → clip_selection → focus_detection → rendering
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each node:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Receives only the state it needs&lt;/li&gt;
&lt;li&gt;Writes its output back to shared state&lt;/li&gt;
&lt;li&gt;Can be retried independently if it fails&lt;/li&gt;
&lt;li&gt;Can be tested in isolation without running the full graph&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point alone saved me hours of debugging. When clip selection was returning poor moments, I could feed it test transcripts directly without touching Whisper or FFmpeg.&lt;/p&gt;

&lt;p&gt;Conditional edges also let me add error handling cleanly — if focus detection fails, route to a fallback center-crop instead of crashing the whole pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Node 1 — Transcription
&lt;/h3&gt;

&lt;p&gt;Pulls audio from the video (or YouTube URL via yt-dlp) and runs it through OpenAI Whisper locally. Output is a full transcript with word-level timestamps.&lt;/p&gt;

&lt;p&gt;Word-level timestamps are important — they let you map a selected text moment back to exact video timecodes for cutting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node 2 — Clip Selection
&lt;/h3&gt;

&lt;p&gt;Sends the transcript to an LLM with a prompt asking it to identify the 3 most engaging moments. The model returns start/end timestamps and a brief reason for each selection.&lt;/p&gt;

&lt;p&gt;The prompt explicitly asks for moments that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have a clear beginning and end&lt;/li&gt;
&lt;li&gt;Make sense without surrounding context&lt;/li&gt;
&lt;li&gt;Would stop someone mid-scroll&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Node 3 — Focus Detection
&lt;/h3&gt;

&lt;p&gt;For each selected clip, runs face/subject detection to find where the main subject is in the frame. This determines the crop position for the 9:16 vertical output.&lt;/p&gt;

&lt;p&gt;For single-speaker content this works well. Multi-person framing is still something I'm working on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Node 4 — Rendering
&lt;/h3&gt;

&lt;p&gt;FFmpeg renders each clip with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;9:16 crop based on focus detection output&lt;/li&gt;
&lt;li&gt;Auto-generated captions burned into the video&lt;/li&gt;
&lt;li&gt;Output optimised for TikTok / Reels / Shorts&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real-Time Progress in the UI
&lt;/h2&gt;

&lt;p&gt;One nice side effect of the LangGraph architecture: real-time progress updates came almost for free.&lt;/p&gt;

&lt;p&gt;As state moves through each node, the backend emits an event. The frontend listens and updates a progress indicator — so instead of staring at a loading spinner for 3 minutes, you watch the pipeline move:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✓ Transcription complete
✓ Clip moments identified
✓ Focus detection done
⏳ Rendering clips...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users told me this was the most reassuring part of the UX. Knowing something is actually happening makes the wait feel shorter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Tech&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Backend&lt;/td&gt;
&lt;td&gt;FastAPI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;Next.js 14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transcription&lt;/td&gt;
&lt;td&gt;OpenAI Whisper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Video Processing&lt;/td&gt;
&lt;td&gt;FFmpeg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pipeline Orchestration&lt;/td&gt;
&lt;td&gt;LangGraph&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage &amp;amp; Auth&lt;/td&gt;
&lt;td&gt;Supabase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YouTube ingestion&lt;/td&gt;
&lt;td&gt;yt-dlp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What Works Well and What Doesn't
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Works well:&lt;/strong&gt;&lt;br&gt;
Talk-heavy content — podcasts, interviews, conference talks, lectures. The transcript is rich and the LLM picks genuinely good moments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Still needs work:&lt;/strong&gt;&lt;br&gt;
B-roll-heavy videos where the visual tells the story more than the words. The transcript alone doesn't capture what makes a moment visually compelling. This is the next problem I want to solve — probably with frame-level visual analysis alongside the transcript.&lt;/p&gt;

&lt;p&gt;Multi-person framing for focus detection is also rough. Single speaker is solid.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;It's free right now, no signup needed: &lt;a href="https://video-generator-six-coral.vercel.app/" rel="noopener noreferrer"&gt;https://video-generator-six-coral.vercel.app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're curious about the LangGraph architecture or any part of the pipeline, ask in the comments — happy to go deeper on any of it.&lt;/p&gt;

&lt;p&gt;And if you try it on your own content, I'd genuinely love to know if the clip selection actually picks good moments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>langraph</category>
      <category>whisper</category>
      <category>ffmpeg</category>
    </item>
  </channel>
</rss>
