How LumiClip Finds the Best Moments in Your Video and Reframes Them for Mobile

Garry Williams — Wed, 13 May 2026 21:49:53 +0000

When someone uploads an hour-long podcast or a Twitch VOD to LumiClip, they expect ten short, vertical, ready-to-post clips back. Two pipelines do the heavy lifting: a highlight finder that decides what's worth clipping, and a reframer that turns landscape footage into something that looks native on a phone screen.
Here's how each one actually works under the hood.

The core problem with asking one model to do everything
The first thing we tried was the obvious thing: prompt a capable LLM with the transcript and ask it to find the best clips. The signal-to-noise was terrible. A model looking at a raw hour-long transcript has no spatial sense of the video, no understanding of energy or pacing, and no way to know that two candidate clips are basically the same moment from different angles.
So we scrapped that and built a small assembly line instead. Each step is cheap, focused, and only passes its survivors to the next stage. By the time the most capable model runs, it's looking at a curated shortlist rather than raw noise.

The Highlight Pipeline
Step 1 — Transcribe with Deepgram Nova-3
Word-level timestamps, speaker diarization, and utterance boundaries are the substrate for everything downstream. Long sources get split into chunks, transcribed in parallel, then merged back into a single timeline. Nova-3 is fast enough that this doesn't become the bottleneck even on 3-hour VODs.
Step 2 — Classify the video type
Seven evenly-spaced frames go to a multimodal classifier — a small, fast vision model — that returns one of four buckets: dialogue, screenshare, gaming, or action. This decision changes everything downstream. A podcast doesn't need the same clip-selection heuristics as a Call of Duty stream. A screenshare tutorial has completely different "good moment" criteria than a two-person interview.
This single classification step rules out the wrong heuristics before any expensive processing runs.
Step 3 — Topic-segment the transcript
A second LLM call walks the merged transcript and breaks it into topic blocks — coherent runs of related speech. We score each segment on three axes: how self-contained it is, how hooky the opening is, and how emotionally salient the content is.
This is where most of the junk gets filtered. A five-minute tangent that goes nowhere scores poorly on self-containment. A mid-sentence cut scores poorly on hooks. Only segments that clear all three thresholds move forward.
Step 4 — Score candidate highlights
A scoring model evaluates each candidate against criteria like: opens strong, has tension, has payoff, would survive being seen with no setup. Anything below a hard quality floor gets dropped before the next step ever sees it.
This is the most expensive step in the pipeline. The reason we can afford to run a capable model here is that by this point we've gone from hours of raw content to maybe 15-20 candidate segments. The classifier and topic segmenter did the cheap filtering work so this step can do the quality work.
Step 5 — Final selection
A final pass picks the best non-overlapping clips, respects per-tier caps (we ship ten clips per project), and assigns each a viral-score hint that surfaces in the dashboard. The non-overlapping constraint is important — without it you get five clips that are all variations of the same thirty-second moment.
Step 6 — Generate the hook
Each clip gets a three-to-seven-word punch line generated by a model that has seen only that clip's transcript. Short, declarative, optimized for the first second of attention. This runs last because you want the hook to reflect what the clip actually is, not what you hoped it would be.

Why layering matters for cost and quality
The reason the pipeline is structured this way is cost containment without quality sacrifice.
A small classifier rules out 90% of the work for a screenshare video instantly. A topic segmenter narrows hours of speech to tens of candidates cheaply. Only the survivors get the expensive scoring pass. Running a capable model on a raw hour-long transcript for every upload would be both slow and expensive. Running it on a pre-filtered shortlist of 15 candidates is fast and affordable.
The quality benefit is the same: a model looking at 15 curated candidates makes much better decisions than a model drowning in 200 possible segments.

The Reframing Pipeline
Landscape video on a 9:16 phone screen has a brutal math problem: 75% of the pixels are now off-canvas. The naive fix — a centered static crop — works for a stationary podcaster in the middle of the frame. It fails immediately the moment anyone moves, looks at a side monitor, or there are two people sitting apart from each other.
This was the hardest problem to solve well and the one most other tools get wrong consistently.
Step 1 — Face detection on keyframes
We run InsightFace's buffalo_l model on sampled keyframes to get bounding boxes plus a face embedding per detection. Sampling keyframes rather than every frame keeps this fast without losing tracking fidelity — faces don't teleport between adjacent frames.
Step 2 — Identify the active speaker
Face embeddings let us cluster detections into persistent identities across the clip. We combine this with the diarization data from Deepgram to know not just where faces are, but which face is currently speaking. The active speaker gets priority in the crop target calculation.
This is the step that makes two-person interview reframing work. Without speaker identification, the crop just averages the two face positions and ends up centered between them — which means neither person is properly in frame. With it, the crop follows whoever is actually talking.
Step 3 — Smooth the crop path
Raw frame-by-frame crop targets are jittery. If you apply them directly the video looks like it's having a seizure. We run the crop coordinates through a smoothing pass that respects the natural movement of the speaker while eliminating micro-jitter. The goal is motion that feels like a camera operator following the subject, not a bounding box chasing pixels.
Step 4 — Handle edge cases
Some frames have no detectable face — the speaker looked down, the camera cut away, there's a B-roll insert. We hold the last known crop position through short gaps and interpolate smoothly back when the face reappears. For longer gaps we fall back to a content-aware center crop rather than freezing awkwardly on the last known position.

What we got wrong first
The first version of the reframer used a single static crop calculated from the average face position across the whole clip. It looked fine in demos with a stationary speaker and broke on everything else.
The first version of the highlight pipeline had no video type classification. It applied the same dialogue heuristics to a gaming stream and generated clips of someone quietly farming resources with no audio spike. The scoring model had no idea those clips were bad because it was optimized for spoken content.
Both mistakes came from the same root cause: building for the clean case and being surprised by real footage. Real footage is messy, unpredictable, and almost never looks like your test videos.

What's next
The video type classifier currently handles four buckets. We're working on expanding it and adding per-type scoring models that understand what "good" actually means for each format — a good gaming clip has completely different properties than a good podcast clip, and a single scoring model trying to handle both will always be compromised.
The reframer handles single-camera footage well. Multi-camera footage with cuts is the next hard problem.
If you've built something similar or have thoughts on the pipeline, would love to hear it in the comments. And if you want to see what the output actually looks like on your own content: lumiclip.ai

DEV Community: Garry Williams

How LumiClip Finds the Best Moments in Your Video and Reframes Them for Mobile