Extracting Video Poster Frames at Scale with Go, FFmpeg, and Scene Detection

#go #ffmpeg #video #backend

Why Poster Frames Matter

At DailyWatch, we surface a global feed of trending videos across regions in English. Every card in the grid needs a thumbnail that loads instantly and represents the actual content, not a black frame at t=0. YouTube provides default thumbnails, but for our internal preview pipeline we wanted full control: deterministic output, no remote dependencies after ingest, and a frame that actually shows a face or a real scene rather than a fade-in or end card.

That meant building a poster frame extractor. We chose Go for the orchestration layer because it sits next to our ingestion workers, talks to FFmpeg via os/exec cleanly, and gives us cheap concurrency for batching across cores.

The Pipeline at a Glance

The extractor runs per video as a single goroutine and does four things:

Probe the file with ffprobe to get duration, codec, and resolution.
Run FFmpeg with the scene detection filter to extract candidate frames.
Score each candidate by entropy, brightness, and face presence.
Write the winner as a JPEG at the target aspect ratio.

The whole thing is bounded: no allocation surprises, no leftover temp files, no zombie FFmpeg processes. That last part bit us early — we now wrap every invocation with a context timeout and a deferred kill.

Calling FFmpeg from Go

Here is the core invocation. The select filter pulls frames where the scene change score exceeds 0.4, and showinfo traces give us timestamps we can parse.

func extractCandidates(ctx context.Context, src, outDir string) ([]string, error) {
    ctx, cancel := context.WithTimeout(ctx, 90*time.Second)
    defer cancel()

    cmd := exec.CommandContext(ctx, "ffmpeg",
        "-i", src,
        "-vf", "select='gt(scene,0.4)',scale=640:-1,showinfo",
        "-vsync", "vfr",
        "-frames:v", "12",
        "-q:v", "3",
        filepath.Join(outDir, "cand_%03d.jpg"),
    )
    var stderr bytes.Buffer
    cmd.Stderr = &stderr

    if err := cmd.Run(); err != nil {
        return nil, fmt.Errorf("ffmpeg: %w: %s", err, stderr.String())
    }
    return filepath.Glob(filepath.Join(outDir, "cand_*.jpg"))
}

A few things worth noting:

CommandContext is non-negotiable. A 4K source with a corrupted moov atom can spin FFmpeg forever otherwise.
-vsync vfr keeps timing honest when frames are dropped by the select filter.
We cap candidates at 12 to bound disk and downstream CPU.
-q:v 3 is a good quality/size sweet spot for intermediate JPEGs we will rescore.

Scoring Candidates

Scene detection finds visually distinct frames, but distinct is not good. A hard cut to a black screen scores high too. We rescore each candidate with a small Python worker invoked once per batch — it loads OpenCV once and processes a queue, which is much cheaper than spawning Python per file.

import cv2, json, sys, numpy as np

FACE = cv2.CascadeClassifier(
    cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
)

def score(path: str) -> float:
    img = cv2.imread(path)
    if img is None:
        return 0.0
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    brightness = float(gray.mean())
    if brightness < 25 or brightness > 235:
        return 0.0
    hist = cv2.calcHist([gray], [0], None, [64], [0, 256]).flatten()
    hist /= max(hist.sum(), 1)
    nz = hist[hist > 0]
    entropy = float(-np.sum(nz * np.log2(nz)))
    faces = len(FACE.detectMultiScale(gray))
    return entropy * (1.0 + 0.5 * faces)

for line in sys.stdin:
    p = line.strip()
    if p:
        print(json.dumps({"path": p, "score": score(p)}), flush=True)

Why this works for us:

Brightness gates reject pure black or pure white frames cheaply.
Histogram entropy approximates visual richness without being fooled by noise.
Face boosts are a soft preference, not a hard requirement — many of our videos are vlogs or interviews.

Go reads JSON lines from the worker's stdout and picks the max. Keeping the Python process alive across the batch cut per-frame latency from ~180ms to ~22ms.

Wiring It Into PHP

Our public CMS is PHP. Once Go finishes, it writes the poster path into a small SQLite row that PHP picks up on the next page render. The handoff is intentionally boring:

function posterUrl(PDO $db, string $videoId): string {
    $stmt = $db->prepare('SELECT poster_path FROM video_assets WHERE video_id = ? LIMIT 1');
    $stmt->execute([$videoId]);
    $path = $stmt->fetchColumn();
    return $path ?: '/assets/poster-fallback.jpg';
}

No FFmpeg in the request path, no Python on the web tier. The extractor is a sidecar; the site stays a flat PHP read.

What Broke in Production

A few lessons that are not in the FFmpeg docs:

HLS sources need -allowed_extensions ALL when pointing at a local m3u8 with unusual segment names.
Rotation metadata lies. Some phone uploads encode rotation that FFmpeg respects but OpenCV does not. Normalize with transpose early.
Animated thumbnails are a different pipeline. Do not bolt them onto this — the scoring criteria are opposite (you want motion, not stillness).
Disk is the bottleneck. We moved candidate temp dirs to tmpfs and throughput doubled.
FFmpeg's exit code is not enough. Always parse stderr; a successful return can still mean wrote zero frames.

What's Next

The current extractor handles roughly 8,000 videos per hour per worker on a modest VM. We're swapping the Haar cascade for a small ONNX face model to cut false positives on hands and packaging text, and experimenting with perceptual hashing across candidates to deduplicate when the same scene gets selected twice from a slow pan.

If you are building something similar, the biggest unlock is treating FFmpeg as a coprocessor with a strict contract, not a black box you shell out to. Bound it, parse its stderr, kill it on timeout, and keep the scoring layer in a process that does not pay startup costs per frame.