Abhisek Mishra

Posted on May 24 • Edited on Jun 13

How I built an AI video clipping pipeline with LangGraph, Whisper and FFmpeg

#ai #langraph #whisper #ffmpeg

I kept avoiding clipping my own content.

Not because I didn't want short clips. I did. But the process was genuinely painful — scrub through a long video, find a good moment, trim it, crop for vertical, add captions, export. Repeat three times. Two hours gone.

So I built a tool that does the whole thing automatically.

Here's how it works under the hood.

The Problem With a Simple Script

My first instinct was a single Python script — call Whisper, parse the transcript, run FFmpeg. Done.

It worked. Until it didn't.

When the LLM returned a bad clip selection, I had to re-run transcription. When FFmpeg failed on a weird video format, I lost the focus detection results. Debugging meant re-running everything from scratch every single time.

I needed each step to be isolated. That's where LangGraph came in.

Why LangGraph

LangGraph lets you model a pipeline as a graph of discrete nodes, each with its own state. Instead of one big sequential script, the workflow looks like this:

transcription → clip_selection → focus_detection → rendering

Each node:

Receives only the state it needs
Writes its output back to shared state
Can be retried independently if it fails
Can be tested in isolation without running the full graph

That last point alone saved me hours of debugging. When clip selection was returning poor moments, I could feed it test transcripts directly without touching Whisper or FFmpeg.

Conditional edges also let me add error handling cleanly — if focus detection fails, route to a fallback center-crop instead of crashing the whole pipeline.

The Full Pipeline

Node 1 — Transcription

Pulls audio from the video (or YouTube URL via yt-dlp) and runs it through OpenAI Whisper locally. Output is a full transcript with word-level timestamps.

Word-level timestamps are important — they let you map a selected text moment back to exact video timecodes for cutting.

Node 2 — Clip Selection

Sends the transcript to an LLM with a prompt asking it to identify the 3 most engaging moments. The model returns start/end timestamps and a brief reason for each selection.

The prompt explicitly asks for moments that:

Have a clear beginning and end
Make sense without surrounding context
Would stop someone mid-scroll

Node 3 — Focus Detection

For each selected clip, runs face/subject detection to find where the main subject is in the frame. This determines the crop position for the 9:16 vertical output.

For single-speaker content this works well. Multi-person framing is still something I'm working on.

Node 4 — Rendering

FFmpeg renders each clip with:

9:16 crop based on focus detection output
Auto-generated captions burned into the video
Output optimised for TikTok / Reels / Shorts

Real-Time Progress in the UI

One nice side effect of the LangGraph architecture: real-time progress updates came almost for free.

As state moves through each node, the backend emits an event. The frontend listens and updates a progress indicator — so instead of staring at a loading spinner for 3 minutes, you watch the pipeline move:

✓ Transcription complete
✓ Clip moments identified
✓ Focus detection done
⏳ Rendering clips...

Users told me this was the most reassuring part of the UX. Knowing something is actually happening makes the wait feel shorter.

Stack

Layer	Tech
Backend	FastAPI
Frontend	Next.js 14
Transcription	OpenAI Whisper
Video Processing	FFmpeg
Pipeline Orchestration	LangGraph
Storage & Auth	Supabase
YouTube ingestion	yt-dlp

What Works Well and What Doesn't

Works well:
Talk-heavy content — podcasts, interviews, conference talks, lectures. The transcript is rich and the LLM picks genuinely good moments.

Still needs work:
B-roll-heavy videos where the visual tells the story more than the words. The transcript alone doesn't capture what makes a moment visually compelling. This is the next problem I want to solve — probably with frame-level visual analysis alongside the transcript.

Multi-person framing for focus detection is also rough. Single speaker is solid.

Try It

It's free right now, no signup needed: https://video-generator-six-coral.vercel.app/

If you're curious about the LangGraph architecture or any part of the pipeline, ask in the comments — happy to go deeper on any of it.

And if you try it on your own content, I'd genuinely love to know if the clip selection actually picks good moments.

DEV Community