Building a Video Thumbnail Generator Service with Go and FFmpeg

#go #ffmpeg #webdev #performance

Every viral video we ingest at ViralVidVault arrives without a usable preview image. YouTube and TikTok embeds give us one default poster frame, and it is almost always the worst possible choice: a black fade-in, a sponsor card, or a blurry first frame before the camera stabilizes. When you are running a European viral-video discovery feed where the thumbnail is the single biggest driver of click-through, shipping a bad frame is shipping a dead row in the grid.

We used to lean on the platform-provided poster. Click-through on the discovery grid sat around 3.1%. After we built an internal service that extracts several candidate frames per video, scores them, and serves a cropped WebP, that number moved to 5.4%. This article is the build log for that service: a small Go HTTP daemon that wraps FFmpeg, picks a good frame, and hands the result back to our PHP 8.4 / SQLite WAL stack for caching. No magic, just FFmpeg invoked carefully and a scoring heuristic that beats "grab the frame at 50%."

Why a Separate Go Service Instead of PHP

The rest of VViralVidVault runs on PHP 8.4 behind LiteSpeed, with SQLite in WAL mode as the primary store and Cloudflare Workers handling edge caching and GDPR-aware request shaping. PHP is great for the request/response cycle, but thumbnail generation is the wrong job for it:

It is CPU-bound and bursty. When a new trend breaks we might queue 400 videos in a few minutes. We do not want those FFmpeg processes competing with page renders inside the LiteSpeed worker pool.
It needs real concurrency. Go's goroutines and a bounded worker pool let us cap how many FFmpeg processes run at once, which matters because each one happily eats a full core.
Process supervision is cleaner. Shelling out to FFmpeg from PHP via proc_open works, but timeouts, zombie reaping, and stderr capture are fiddly. Go's os/exec with context.Context gives us cancellation for free.

So the architecture is: PHP enqueues a job (video URL or local path) over HTTP, the Go service downloads/reads the file, extracts candidate frames, scores them, encodes the winner to WebP, and returns the bytes plus metadata. PHP stores the result and lets Cloudflare cache the delivered image.

Extracting Candidate Frames with FFmpeg

The naive approach is -ss 00:00:05 -frames:v 1. That gives you one frame at a fixed offset, and for a 12-second clip that lands you in dead air half the time. Instead we extract N evenly spaced candidates across the duration, skipping the first and last 10% where intros and outros live.

Two FFmpeg details matter here. First, put -ss before -i for fast input seeking — it jumps via the index instead of decoding every frame up to the timestamp. Second, use -noaccurate_seek when you only need an approximate position; it is dramatically faster and a few frames of imprecision is irrelevant for a thumbnail.

Here is the core extraction in Go. We first probe duration with ffprobe, then fan out frame grabs:

package thumb

import (
    "context"
    "encoding/json"
    "fmt"
    "os"
    "os/exec"
    "path/filepath"
    "strconv"
    "time"
)

type probeResult struct {
    Format struct {
        Duration string `json:"duration"`
    } `json:"format"`
}

// probeDuration returns the clip length in seconds.
func probeDuration(ctx context.Context, path string) (float64, error) {
    cmd := exec.CommandContext(ctx, "ffprobe",
        "-v", "error",
        "-show_entries", "format=duration",
        "-of", "json", path)
    out, err := cmd.Output()
    if err != nil {
        return 0, fmt.Errorf("ffprobe: %w", err)
    }
    var pr probeResult
    if err := json.Unmarshal(out, &pr); err != nil {
        return 0, err
    }
    return strconv.ParseFloat(pr.Format.Duration, 64)
}

// extractFrame grabs a single JPEG at offset seconds into dir.
func extractFrame(ctx context.Context, src, dir string, offset float64, idx int) (string, error) {
    dst := filepath.Join(dir, fmt.Sprintf("cand_%02d.jpg", idx))
    cmd := exec.CommandContext(ctx, "ffmpeg",
        "-noaccurate_seek",
        "-ss", strconv.FormatFloat(offset, 'f', 3, 64),
        "-i", src,
        "-frames:v", "1",
        "-q:v", "3",
        "-y", dst)
    if err := cmd.Run(); err != nil {
        return "", fmt.Errorf("ffmpeg frame %d: %w", idx, err)
    }
    return dst, nil
}

// CandidateFrames extracts n evenly spaced frames, skipping head/tail.
func CandidateFrames(ctx context.Context, src, dir string, n int) ([]string, error) {
    dur, err := probeDuration(ctx, src)
    if err != nil {
        return nil, err
    }
    if dur <= 0 {
        return nil, fmt.Errorf("non-positive duration %.2f", dur)
    }
    start, end := dur*0.10, dur*0.90
    step := (end - start) / float64(n-1)

    frames := make([]string, 0, n)
    for i := 0; i < n; i++ {
        offset := start + step*float64(i)
        select {
        case <-ctx.Done():
            return frames, ctx.Err()
        default:
        }
        path, err := extractFrame(ctx, src, dir, offset, i)
        if err != nil {
            continue // skip a bad frame, keep the rest
        }
        frames = append(frames, path)
    }
    if len(frames) == 0 {
        return nil, fmt.Errorf("no frames extracted")
    }
    return frames, nil
}

A few production notes that cost us a day each to learn:

Always set a timeout via context.WithTimeout. A corrupt file can make FFmpeg hang forever. We cap each job at 30 seconds.
-q:v 3 for the intermediate JPEG keeps the candidates small but high enough quality to score reliably. We re-encode the winner to WebP later, so this is throwaway quality.
Skip, don't fail, on a single bad frame. Seeking near the end of a variable-frame-rate file occasionally returns nothing; you still want the other candidates.

Scoring Frames So We Don't Ship Black

Evenly spaced candidates solve "don't always grab 50%," but we still need to rank them. The two failure modes that kill click-through are dark frames (fades, night shots that read as black thumbnails) and flat frames (a static title card with no visual interest). We score each candidate on two cheap metrics:

Brightness — mean luminance. Penalize anything too dark or blown-out white.
Detail / contrast — standard deviation of luminance. A higher spread means more visual structure (faces, motion, scenery) versus a flat gradient.

We do not need ML for this. A weighted combination of luminance mean and variance, computed in pure Go over the decoded JPEG, beats the platform default reliably. Here is the scorer:

package thumb

import (
    "image"
    _ "image/jpeg"
    "math"
    "os"
)

type Score struct {
    Path       string
    Brightness float64 // 0..255 mean luma
    Detail     float64 // luma std-dev
    Total      float64
}

// scoreFrame computes brightness and detail for one candidate.
func scoreFrame(path string) (Score, error) {
    f, err := os.Open(path)
    if err != nil {
        return Score{}, err
    }
    defer f.Close()

    img, _, err := image.Decode(f)
    if err != nil {
        return Score{}, err
    }
    b := img.Bounds()

    // Sample on a grid; full-pixel scans are wasteful at thumbnail scale.
    const stride = 4
    var sum, sumSq, count float64
    for y := b.Min.Y; y < b.Max.Y; y += stride {
        for x := b.Min.X; x < b.Max.X; x += stride {
            r, g, bl, _ := img.At(x, y).RGBA()
            // ITU-R BT.601 luma, RGBA is 16-bit so shift to 8-bit.
            luma := 0.299*float64(r>>8) + 0.587*float64(g>>8) + 0.114*float64(bl>>8)
            sum += luma
            sumSq += luma * luma
            count++
        }
    }
    mean := sum / count
    variance := sumSq/count - mean*mean
    stdDev := math.Sqrt(math.Max(0, variance))

    // Brightness reward peaks around 120 (mid-tone) and falls off toward
    // pure black or pure white.
    brightScore := 1 - math.Abs(mean-120)/120
    if brightScore < 0 {
        brightScore = 0
    }
    // Detail reward, normalized; 70+ std-dev is plenty of structure.
    detailScore := math.Min(stdDev/70, 1)

    total := 0.45*brightScore + 0.55*detailScore
    return Score{Path: path, Brightness: mean, Detail: stdDev, Total: total}, nil
}

// BestFrame returns the highest-scoring candidate path.
func BestFrame(paths []string) (Score, error) {
    var best Score
    best.Total = -1
    for _, p := range paths {
        s, err := scoreFrame(p)
        if err != nil {
            continue
        }
        if s.Total > best.Total {
            best = s
        }
    }
    if best.Total < 0 {
        return Score{}, errImpossible
    }
    return best, nil
}

The grid sampling with stride = 4 matters at volume — scanning every pixel of even a 640px JPEG is wasteful when a quarter of the samples gives the same ranking. We weight detail slightly higher than brightness (0.55 vs 0.45) because in our A/B testing a slightly dark but information-rich frame outperformed a perfectly lit title card almost every time.

Encoding the Winner to WebP

Once we have the best candidate, we crop it to our grid aspect ratio (16:9) and re-encode to WebP. WebP cut our thumbnail bytes by roughly 30% versus JPEG at visually identical quality, which matters when Cloudflare is billing egress and European mobile users on metered data are a big chunk of our traffic.

We go back to FFmpeg for the crop-and-encode in one pass. The crop filter with min(iw,ih*16/9) math centers the crop regardless of source aspect ratio:

// EncodeWebP crops to 16:9, scales to width, and writes WebP at quality q.
func EncodeWebP(ctx context.Context, src, dst string, width, q int) error {
    vf := fmt.Sprintf(
        "crop='min(iw,ih*16/9)':'min(ih,iw*9/16)',scale=%d:-1",
        width)
    cmd := exec.CommandContext(ctx, "ffmpeg",
        "-i", src,
        "-vf", vf,
        "-c:v", "libwebp",
        "-quality", strconv.Itoa(q),
        "-frames:v", "1",
        "-y", dst)
    out, err := cmd.CombinedOutput()
    if err != nil {
        return fmt.Errorf("webp encode: %w: %s", err, out)
    }
    return nil
}

We serve -quality 80 for the grid and a separate -quality 90 larger render for the video detail page. The bounded worker pool that drives all of this is the standard Go pattern — a buffered channel as a semaphore so we never run more FFmpeg processes than we have cores to spare:

// Pool caps concurrent FFmpeg jobs at size.
type Pool struct{ sem chan struct{} }

func NewPool(size int) *Pool { return &Pool{sem: make(chan struct{}, size)} }

func (p *Pool) Run(ctx context.Context, job func() error) error {
    select {
    case p.sem <- struct{}{}:
    case <-ctx.Done():
        return ctx.Err()
    }
    defer func() { <-p.sem }()
    return job()
}

We size the pool at runtime.NumCPU() - 1 so the box stays responsive. On a 4-core VPS that means three concurrent extractions, which clears a 400-video burst in a couple of minutes.

Wiring It Into the PHP and Cloudflare Stack

The Go service exposes one endpoint, POST /thumb, taking a video reference and returning JSON metadata plus a stored file path. PHP calls it, records the result in SQLite, and lets Cloudflare do the heavy lifting on delivery. Because SQLite runs in WAL mode, the thumbnail-metadata writes from the ingest worker do not block the read traffic serving the discovery grid — readers and the single writer coexist without lock contention.

Here is the PHP 8.4 side that requests a thumbnail and caches the metadata:

<?php
declare(strict_types=1);

final class ThumbnailClient
{
    public function __construct(
        private string $serviceUrl,
        private \PDO $db,
    ) {}

    public function ensure(string $videoId, string $sourcePath): ?string
    {
        $stmt = $this->db->prepare(
            'SELECT webp_path FROM thumbnails WHERE video_id = ?'
        );
        $stmt->execute([$videoId]);
        if ($path = $stmt->fetchColumn()) {
            return $path; // already generated
        }

        $ch = curl_init($this->serviceUrl . '/thumb');
        curl_setopt_array($ch, [
            CURLOPT_POST           => true,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT        => 35,
            CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
            CURLOPT_POSTFIELDS     => json_encode([
                'video_id' => $videoId,
                'source'   => $sourcePath,
                'width'    => 640,
            ], JSON_THROW_ON_ERROR),
        ]);
        $raw = curl_exec($ch);
        if ($raw === false || curl_getinfo($ch, CURLINFO_HTTP_CODE) !== 200) {
            curl_close($ch);
            return null; // fall back to platform poster
        }
        curl_close($ch);

        $data = json_decode($raw, true, 512, JSON_THROW_ON_ERROR);
        $ins = $this->db->prepare(
            'INSERT INTO thumbnails (video_id, webp_path, brightness, detail)
             VALUES (?, ?, ?, ?)'
        );
        $ins->execute([
            $videoId,
            $data['webp_path'],
            $data['brightness'],
            $data['detail'],
        ]);
        return $data['webp_path'];
    }
}

The delivery path is deliberately dumb: the generated WebP files sit behind a Cloudflare Worker that sets a long Cache-Control and an immutable filename keyed by video_id plus a content hash. Once a thumbnail is generated it is effectively free to serve — the origin (LiteSpeed) sees the request only on the first miss per edge location. From a GDPR standpoint this is clean: thumbnails carry no user data, the Worker strips any analytics cookies on the image route, and nothing about a visitor is logged to serve a static frame.

A few operational guardrails we added after running this in production:

Idempotency by video_id. The PHP ensure() check means a re-ingested video does not regenerate, and the INSERT is guarded by a unique index so concurrent ingest workers cannot double-write.
Graceful fallback. If the Go service is down or times out, ensure() returns null and the grid falls back to the platform poster. A degraded thumbnail beats a 500.
Disk hygiene. The candidate JPEGs go into a per-job temp directory that is defer os.RemoveAll-ed in Go. Only the final WebP survives. We learned this after a forgotten temp dir filled a disk overnight.
Reprocess hook. Because the score is stored, we can later requery for thumbnails with low detail and regenerate them with a wider candidate set — useful when we tune the scoring weights.

What Moved the Numbers

The single biggest win was not the WebP savings or the concurrency — it was the scoring heuristic refusing to ship dark and flat frames. Before scoring, roughly one in six auto-generated thumbnails was effectively black on the grid. After scoring, that dropped to near zero, and the discovery-grid click-through climbed from 3.1% to 5.4% over three weeks. The WebP re-encode then trimmed about 30% off image bytes, which on a mobile-heavy European audience showed up as a measurable drop in Largest Contentful Paint on the discovery page.

If you are building something similar, the order of impact is clear: get the frame selection right first, then optimize bytes and concurrency. FFmpeg does the hard part; the value you add is in not blindly trusting the first frame it hands you. A 90-line Go scorer beat every off-the-shelf default we tried, and it runs in single-digit milliseconds per candidate. Start there, measure click-through, and only reach for anything heavier if the cheap heuristic stops paying its way.