Mason K

Posted on May 31

Pick a better video thumbnail automatically with FFmpeg, PySceneDetect, and CLIP

#python #tutorial #video #ai

TL;DR

We'll build a pipeline that takes any video file, extracts candidate frames with FFmpeg and PySceneDetect, filters out blurry ones with OpenCV, scores each candidate with OpenCLIP against a small prompt set, and picks the top-K thumbnails with a diversity constraint. ~200 lines of Python, GPU-accelerated, fully local.

📦 Code: github.com/USER/auto-thumbnail-picker (replace before publishing)

The default thumbnail your encoder generates is "the middle frame." For most videos, the middle frame is a motion blur, a transition, or someone mid-blink. We can do much better with about an hour of effort. Here's the pipeline.

Versions

python 3.12
ffmpeg 7.1
pyscenedetect 0.6.x
open-clip-torch 2.x
opencv-python 4.x
torch 2.x (with CUDA or MPS)

The pipeline runs on CPU but is noticeably slower for the CLIP step. A consumer GPU (or Apple Silicon MPS) cuts the per-frame encode to a few milliseconds.

What we're building

Extract candidate frames (shot boundaries + uniform sampling).
Filter blurry frames out with Laplacian variance.
Score each remaining frame with OpenCLIP using a positive / negative prompt set.
Apply structural rules (prefer frames with faces).
Pick top-K with shot-level diversity.

1. Setup

python -m venv .venv && source .venv/bin/activate
pip install --break-system-packages \
  open-clip-torch \
  scenedetect[opencv] \
  opencv-python-headless \
  pillow \
  torch torchvision

For face detection we'll keep it simple and use OpenCV's built-in Haar cascades. If you need higher accuracy on small faces, swap to insightface later.

2. Extract candidate frames

Two sources of candidates: shot boundaries (one per shot) and uniform sampling (one every N seconds). Combine both, then dedupe by timestamp.

# extract.py
import subprocess, os, pathlib
from scenedetect import open_video, SceneManager, ContentDetector

def shot_boundaries(video_path: str) -> list[float]:
    """Return scene-cut timestamps (seconds) using PySceneDetect."""
    vid = open_video(video_path)
    sm = SceneManager()
    sm.add_detector(ContentDetector(threshold=27.0))
    sm.detect_scenes(vid, show_progress=False)
    scenes = sm.get_scene_list()
    # Take the start of each scene plus a small offset (avoid the literal cut frame).
    return [s[0].get_seconds() + 0.4 for s in scenes]

def uniform_samples(duration: float, every: float = 3.0) -> list[float]:
    return [t for t in [i * every for i in range(int(duration // every) + 1)] if t > 0]

def extract_frame(video_path: str, t: float, out_path: str) -> None:
    subprocess.run([
        "ffmpeg", "-y", "-ss", f"{t:.3f}", "-i", video_path,
        "-frames:v", "1", "-q:v", "2", out_path
    ], check=True, capture_output=True)

def probe_duration(video_path: str) -> float:
    out = subprocess.run(
        ["ffprobe", "-v", "error", "-show_entries", "format=duration",
         "-of", "default=nokey=1:noprint_wrappers=1", video_path],
        check=True, capture_output=True, text=True,
    ).stdout.strip()
    return float(out)

def build_candidates(video_path: str, out_dir: str) -> list[tuple[float, str]]:
    os.makedirs(out_dir, exist_ok=True)
    dur = probe_duration(video_path)
    times = sorted(set(shot_boundaries(video_path) + uniform_samples(dur)))
    out = []
    for i, t in enumerate(times):
        p = pathlib.Path(out_dir) / f"f_{i:04d}_{t:07.2f}.jpg"
        extract_frame(video_path, t, str(p))
        out.append((t, str(p)))
    return out

💡 Tip: the -ss flag before -i makes ffmpeg seek using the index (fast and inaccurate at frame level). For thumbnail-quality stills we want accurate seeking, but 0.4s after the cut is forgiving enough that index-seek is fine.

Run it:

$ python -c "from extract import build_candidates; print(len(build_candidates('clip.mp4', 'frames/')))"
27

Twenty-seven candidates for a six-minute clip. Manageable.

3. Filter blurry frames

CLIP is too tolerant of blur. We prune with Laplacian variance first.

# sharpness.py
import cv2

def is_sharp(image_path: str, threshold: float = 80.0) -> bool:
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    if img is None:
        return False
    return cv2.Laplacian(img, cv2.CV_64F).var() > threshold

For 1080p source the threshold around 80 works. For 720p drop to ~60. Tune on a few samples from your library; the right cutoff depends on the camera you're shooting with.

4. Score with OpenCLIP

The trick is scoring each frame as positive_similarity - mean(negative_similarity). This is more robust than the absolute positive score because lighting and content normalize out.

# score.py
import torch, open_clip
from PIL import Image

DEVICE = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")

MODEL_NAME = "ViT-L-14"
PRETRAINED = "laion2b_s32b_b82k"

POS = [
    "a sharp, well-lit frame from a video, showing the main subject clearly",
]
NEG = [
    "a blurry frame",
    "a black frame",
    "a transition with text overlay or graphic",
    "a frame caught between two shots",
]

class Scorer:
    def __init__(self) -> None:
        self.model, _, self.preprocess = open_clip.create_model_and_transforms(
            MODEL_NAME, pretrained=PRETRAINED, device=DEVICE
        )
        self.model.eval()
        tokenizer = open_clip.get_tokenizer(MODEL_NAME)
        with torch.no_grad():
            self.pos = self._embed_text(tokenizer(POS))
            self.neg = self._embed_text(tokenizer(NEG))

    def _embed_text(self, tokens):
        with torch.no_grad():
            v = self.model.encode_text(tokens.to(DEVICE))
            return v / v.norm(dim=-1, keepdim=True)

    def score(self, image_path: str) -> float:
        img = self.preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0).to(DEVICE)
        with torch.no_grad():
            i = self.model.encode_image(img)
            i = i / i.norm(dim=-1, keepdim=True)
            pos_sim = (i @ self.pos.T).max().item()
            neg_sim = (i @ self.neg.T).mean().item()
        return pos_sim - neg_sim

⚠️ Note: load the model once and reuse. Reloading on every call is the most common reason people think "CLIP is slow"; it's actually the checkpoint load that's slow.

Quick sanity check:

$ python -c "from score import Scorer; s = Scorer(); print(s.score('frames/f_0010_0030.50.jpg'))"
0.0432

Numbers around 0 to 0.1 are typical. Bigger is better.

5. Face bonus

For UGC-style content the right thumbnail almost always has a clear face. We boost the score when one is present.

# faces.py
import cv2

_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
)

def has_visible_face(image_path: str, min_size: int = 80) -> bool:
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    if img is None:
        return False
    faces = _cascade.detectMultiScale(img, scaleFactor=1.1, minNeighbors=5,
                                      minSize=(min_size, min_size))
    return len(faces) > 0

6. Orchestrate

# pick.py
from extract import build_candidates
from sharpness import is_sharp
from faces import has_visible_face
from score import Scorer

def pick_thumbnails(video_path: str, out_dir: str, k: int = 3) -> list[dict]:
    candidates = build_candidates(video_path, out_dir)
    candidates = [(t, p) for t, p in candidates if is_sharp(p)]

    scorer = Scorer()
    rows = []
    for t, p in candidates:
        s = scorer.score(p)
        if has_visible_face(p):
            s += 0.05  # small bias toward faces
        rows.append({"t": t, "path": p, "score": s})
    rows.sort(key=lambda r: r["score"], reverse=True)

    # Diversity: don't pick frames that are within 2 seconds of each other.
    picked: list[dict] = []
    for r in rows:
        if any(abs(r["t"] - p["t"]) < 2.0 for p in picked):
            continue
        picked.append(r)
        if len(picked) >= k:
            break
    return picked

if __name__ == "__main__":
    import sys, json
    print(json.dumps(pick_thumbnails(sys.argv[1], "frames/", k=3), indent=2))

Run:

$ python pick.py clip.mp4
[
  { "t": 84.40, "path": "frames/f_0017_0084.40.jpg", "score": 0.1129 },
  { "t": 23.50, "path": "frames/f_0009_0023.50.jpg", "score": 0.0871 },
  { "t": 165.20, "path": "frames/f_0029_0165.20.jpg", "score": 0.0832 }
]

You now have three diverse, sharp, scored candidate thumbnails. Use the first for the default and keep the other two for A/B testing.

7. (Optional) Crop to aspect ratio

If your product shows thumbnails at 16:9 but your source is 9:16 (vertical UGC), or vice versa, you want a smart crop, not a letterbox.

# crop.py
import cv2

def crop_to_ratio(image_path: str, out_path: str, ratio: tuple[int, int] = (16, 9)) -> None:
    img = cv2.imread(image_path)
    h, w = img.shape[:2]
    target = ratio[0] / ratio[1]
    actual = w / h

    if actual > target:
        # too wide, crop sides
        new_w = int(h * target)
        x0 = (w - new_w) // 2
        cropped = img[:, x0:x0 + new_w]
    else:
        # too tall, crop top/bottom toward the upper third
        new_h = int(w / target)
        y0 = max(0, int(h * 0.20))
        y0 = min(y0, h - new_h)
        cropped = img[y0:y0 + new_h, :]
    cv2.imwrite(out_path, cropped)

The "upper third" bias for vertical content is intentional. Faces in vertical video almost always sit in the upper third; a centered crop loses them. For higher quality, swap this for a face-aware crop using the bounding box from step 5.

8. Where to put this in your pipeline

Stage	When
Inline at upload	Easiest. Adds 5–30s to the "video processing" step.
Background worker (recommended)	Consume an "asset.ready" webhook, run, write back to your DB.
Re-run on edit	Re-compute when the user trims or reorders clips.

If you're using a managed video API, you'll already get an "asset ready" webhook. Hook a background worker to that and write the picked thumbnail back to your CMS row. Both FastPix and Mux let you set a custom thumbnail URL on the asset.

What's next

Three upgrades that are worth their complexity:

Better face detector. Replace Haar with insightface for small-face robustness.
Learned scorer. If you have click-through data, fine-tune a small head on top of the CLIP embeddings. The hand-tuned prompts get you 80% of the way; learning the last 20% is real ROI.
Vision-language scoring. Ask a small VLM ("rate this thumbnail's appeal 1-10") instead of cosine similarity. More expensive, sometimes worth it for high-stakes content.

The thing this pipeline gets right is that the model is the cheap part. The leverage is the candidate selection, the structural filters, and the diversity constraint. Get those right with a vanilla CLIP and you're already beating most platforms' default thumbnails.

Happy thumbnail picking.

DEV Community