Mason K

Posted on May 14

Build a per-title bitrate ladder in 80 lines of FFmpeg + VMAF

#ffmpeg #video #webdev #tutorial

📦 Code: github.com/USER/per-title-ladder — replace before publishing.

TL;DR

We'll take any source video, run a probe pass at six bitrate-resolution points, score each with VMAF, build a per-title ladder from the convex hull, and compare the bandwidth bill against a static 2019 ladder. End-to-end, ~80 lines of Python orchestrating FFmpeg 7.0 and libvmaf.

What we're building

A small CLI that you can run as:

python3 ladder.py samples/talking-head.mp4

…and that produces:

A per-title ladder, tuned for that specific file.
A side-by-side bandwidth comparison vs a static ladder.
A ladder.json you can feed to your HLS packaging step.

You'll need FFmpeg 7.0+ built with libvmaf, Python 3.11+, and a source MP4. We're not training a perceptual model — VMAF's stock model is fine for a first cut.

1. Why one ladder doesn't fit everything

A static ladder picks the same bitrate-resolution rungs for every video. For a talking-head with no motion, those rungs are overkill at the top and oddly placed in the middle. For a sports clip, they're too sparse at the high end.

The idea behind per-title is: probe the content first, then place the rungs where they actually buy you visible quality. Same workflow Fraunhofer's "video-dev" group documents, same one Streaming Learning Center's recent piece on the evolving ladder walks through. The math gets fancy in production; for one file, you just need to score a handful of points and pick the best ones.

2. Install the toolchain 🧰

# Build FFmpeg 7.0 with libvmaf, on macOS:
brew install ffmpeg

# Verify
ffmpeg -version | head -1
# ffmpeg version 7.0 "Dijkstra" ...

ffmpeg -filters | grep vmaf
# libvmaf            VV->V      Calculate the VMAF score

If your distro ships an older FFmpeg, get a 7.0+ static build from the official site or codexffmpeg on GitHub. The release notes for 7.0 ("Dijkstra") matter for this kind of work — the major transcoding components (demuxers, decoders, filters, encoders, muxers) all run in parallel now, so probe passes complete materially faster on multi-core machines.

pip install --break-system-packages click ffmpeg-python
mkdir samples renditions

3. Define the test grid 🧪

Pick a small grid that spans the resolution/bitrate space:

# ladder.py
import json
import subprocess
from pathlib import Path
from dataclasses import dataclass

@dataclass(frozen=True)
class TestPoint:
    width: int
    height: int
    bitrate_kbps: int

    @property
    def label(self) -> str:
        return f"{self.height}p_{self.bitrate_kbps}k"

# A small grid — six points across width x bitrate.
TEST_GRID = [
    TestPoint(640, 360, 400),
    TestPoint(640, 360, 700),
    TestPoint(1280, 720, 1500),
    TestPoint(1280, 720, 2500),
    TestPoint(1920, 1080, 3500),
    TestPoint(1920, 1080, 5500),
]

For a production system you'd want more points (12–20 is typical). For a first pass, six is enough to demonstrate the technique.

4. Encode each test point 🎞️

The encode is plain H.264 with a fixed maxrate/bufsize — what most production HLS ladders use today:

def encode_point(src: Path, point: TestPoint, out_dir: Path) -> Path:
    out = out_dir / f"{point.label}.mp4"
    if out.exists():
        return out
    subprocess.run([
        "ffmpeg", "-y", "-i", str(src),
        "-vf", f"scale={point.width}:{point.height}",
        "-c:v", "libx264",
        "-preset", "medium",
        "-b:v", f"{point.bitrate_kbps}k",
        "-maxrate", f"{point.bitrate_kbps}k",
        "-bufsize", f"{point.bitrate_kbps * 2}k",
        "-c:a", "aac", "-b:a", "96k",
        str(out),
    ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    return out

💡 Tip: -preset medium is the right balance for a probe pass. Going slower buys you a small quality gain but multiplies probe time. Going faster muddles the VMAF readings.

5. Score with VMAF 📐

VMAF compares each encoded point against the source and gives you a single number from 0 to 100:

def vmaf_score(src: Path, encoded: Path) -> float:
    # libvmaf in FFmpeg 7.x: emit JSON via the `log_path` parameter.
    log_path = encoded.with_suffix(".vmaf.json")
    cmd = [
        "ffmpeg", "-i", str(encoded), "-i", str(src),
        "-lavfi",
        f"[0:v]scale=1920:1080:flags=bicubic[main];"
        f"[1:v]scale=1920:1080:flags=bicubic[ref];"
        f"[main][ref]libvmaf=log_path={log_path}:log_fmt=json",
        "-f", "null", "-",
    ]
    subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    data = json.loads(log_path.read_text())
    return float(data["pooled_metrics"]["vmaf"]["mean"])

Two notes here. First, we upscale both inputs to 1080p before comparing — VMAF compares same-size frames, so the encoded rendition gets scaled back up. Second, the JSON shape is the standard libvmaf output; the pooled_metrics.vmaf.mean field is the per-clip average.

⚠️ Note: VMAF scores aren't directly comparable across very different content types. Animation vs sports vs talking-head produce different score distributions for the same bitrate. The relative ranking within a single source video is what matters here.

6. Pick the convex hull 📈

The convex hull is the set of "best quality per bit" points across the bitrate-VMAF plane. Anything dominated by another point at lower bitrate gets dropped:

def convex_hull(points: list[tuple[int, float]]) -> list[tuple[int, float]]:
    """Pick points on the upper-left convex hull: lower bitrate, higher VMAF wins."""
    sorted_points = sorted(points)  # by bitrate ascending
    hull = []
    for bitrate, vmaf in sorted_points:
        # Drop any prior point we now dominate
        while hull and hull[-1][1] <= vmaf:
            hull.pop()
        hull.append((bitrate, vmaf))
    return hull

A real implementation would also enforce a quality floor (don't ship a 35-VMAF rung even if it's "on the hull") and resolution sanity (don't put two 360p rungs back to back). For the tutorial version, the math above is enough.

7. Build the final ladder 🏗️

Now wire it all together:

import click

@click.command()
@click.argument("src", type=click.Path(exists=True, path_type=Path))
def main(src: Path):
    out_dir = Path("renditions") / src.stem
    out_dir.mkdir(parents=True, exist_ok=True)

    scored = []
    for point in TEST_GRID:
        print(f"encoding {point.label}...")
        encoded = encode_point(src, point, out_dir)
        vmaf = vmaf_score(src, encoded)
        scored.append((point, vmaf))
        print(f"  VMAF: {vmaf:.1f}")

    hull_input = [(p.bitrate_kbps, v) for p, v in scored]
    hull = convex_hull(hull_input)

    ladder = []
    for bitrate, vmaf in hull:
        chosen = next(p for p in TEST_GRID if p.bitrate_kbps == bitrate)
        ladder.append({
            "height": chosen.height,
            "width": chosen.width,
            "bitrate_kbps": chosen.bitrate_kbps,
            "vmaf": round(vmaf, 1),
        })

    out_json = src.with_suffix(".ladder.json")
    out_json.write_text(json.dumps({"src": str(src), "ladder": ladder}, indent=2))
    print(f"\nwrote {out_json}")
    for rung in ladder:
        print(f"  {rung['height']}p @ {rung['bitrate_kbps']}k → VMAF {rung['vmaf']}")

if __name__ == "__main__":
    main()

Run it on a talking-head video:

python3 ladder.py samples/talking-head.mp4

Expect output like:

encoding 360p_400k...
  VMAF: 71.4
encoding 360p_700k...
  VMAF: 82.1
encoding 720p_1500k...
  VMAF: 91.7
encoding 720p_2500k...
  VMAF: 94.2
encoding 1080p_3500k...
  VMAF: 95.1
encoding 1080p_5500k...
  VMAF: 95.3

wrote samples/talking-head.ladder.json
  360p @ 400k → VMAF 71.4
  360p @ 700k → VMAF 82.1
  720p @ 1500k → VMAF 91.7
  720p @ 2500k → VMAF 94.2
  1080p @ 3500k → VMAF 95.1

Notice the 5500k rung dropped off the hull — it cost more bits but barely moved VMAF. That's the savings.

8. Compare against a static ladder 📊

The "Apple-style" static ladder for 1080p sources typically looks like:

Rung	Resolution	Bitrate
1	416×234	145k
2	640×360	365k
3	768×432	730k
4	960×540	2,000k
5	1280×720	3,000k
6	1920×1080	4,500k
7	1920×1080	6,000k

Sum the top-rung bitrate for a static ladder vs your per-title ladder, multiply by minutes streamed, and the egress delta falls out. For talking-head content, you'll routinely see the top-rung drop from 6,000k to 3,500k — a roughly 40% savings at the same perceived quality.

For high-motion content, the savings shrink and sometimes invert. Per-title isn't a magic bullet — it's a tool that pays off most for uniform low-motion content and long-tail mixed catalogs. Run it on three different titles and you'll feel the variance.

9. Where to go from here ➡️

A few honest follow-ups:

Add more test points. Six is enough to show the technique. Twelve to twenty is closer to production.
Constrain the ladder. Add a minimum VMAF (e.g., 70) and a maximum number of rungs (e.g., 5).
Move to live. The Christian Doppler ATHENA group has a patent (US20230388511A1) on perceptually-aware online per-title for live streaming — the technique generalizes, with engineering.
Skip the build, buy the result. Several managed video APIs now do context-aware encoding by default. If you're not in the "tens of thousands of minutes per month" range, just use one of those — the engineering math doesn't work in your favor below that volume.

The point of the tutorial isn't to convince you to build a production per-title system. It's to give you the numbers to argue about. Once you can run the probe on five titles from your own library and see the bandwidth delta, you'll know whether the engineering investment is worth it.

Wrapping up

Per-title encoding is one of those topics where the marketing copy makes it sound magical and the implementation is honestly pretty boring. Probe, score, pick the hull, ship. The hardest part is convincing the team that the static ladder they copy-pasted in 2019 is silently costing them money in 2026.

If you run this on your own catalog and find that it doesn't move the needle — great, that's also a result. Static ladders work fine for uniform content. The point is that you'll have data instead of vibes.

DEV Community