Mason K

Posted on May 21

Wiring VMAF (and PSNR) into your encoder CI with FFmpeg 8.1 and ffmpeg-quality-metrics

#video #ffmpeg #tutorial #devops

📦 Code: github.com/USER/encoder-qa-ci (replace before publishing)

TL;DR

We are going to wire a perceptual-quality gate into a CI workflow using FFmpeg 8.1.1, libvmaf, and the ffmpeg-quality-metrics Python wrapper. The job runs PSNR, SSIM, and VMAF against a fixed reference ladder and fails the merge if VMAF drops below threshold. Works on CPU; the same setup runs roughly 6x faster on GPU with VMAF-CUDA.

PSNR has been the default "is my encoder okay" metric in CI pipelines for a decade, and it is starting to show its age. It cannot tell when the encoder traded perceptual quality for raw pixel error, and that is exactly the failure mode per-title and ML-driven encoders walk into. Let's wire up something better.

We will build this in three steps:

Stand up a tiny test bench with a reference ladder.
Run PSNR + SSIM + VMAF with ffmpeg-quality-metrics.
Wrap the whole thing in a CI script that fails on regression.

🛠️ 1. The test bench

You need three things: a reference master, a set of encoded renditions ("distorted" in metric-speak), and a fixed VMAF model file.

# bash
mkdir encoder-qa && cd encoder-qa
mkdir reference distorted models

# A short, representative reference. Pick a clip that matches your worst-case content:
# face-heavy, motion, fine detail. The "TearsOfSteel" or "BigBuckBunny" clips work for demos.
curl -L -o reference/master.mov \
  https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_5MB.mp4

Drop the VMAF model into models/. The vmaf_v0.6.1.json model is the most widely cited Netflix VMAF model; there are also vmaf_4k_v0.6.1.json for 4K viewing and a phone variant. Pin the one you use, exactly:

# bash (clone the model into the repo so CI doesn't fetch over the network)
git clone --depth 1 https://github.com/Netflix/vmaf.git /tmp/vmaf
cp /tmp/vmaf/model/vmaf_v0.6.1.json models/

💡 Tip: treat the model file like a lockfile. Different VMAF model versions produce different scores. If you upgrade, you re-baseline every threshold.

Now generate a reference ladder. We use FFmpeg 8.1.1; older versions are fine but you want at least 6.1 for VMAF-CUDA, and 8.x for the cleanest webvtt/libsvtav1 integration.

# bash (produce three renditions for the test ladder)
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 23 -vf scale=1280:720  distorted/720p.mp4
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 26 -vf scale=854:480   distorted/480p.mp4
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 28 -vf scale=640:360   distorted/360p.mp4

Verify the FFmpeg version. The libvmaf and libsvtav1 wiring tightened up across the 8.0 → 8.1 line:

# bash
$ ffmpeg -version | head -1
ffmpeg version 8.1.1 ...
$ ffmpeg -filters | grep -E "libvmaf|psnr|ssim"
 .. libvmaf            VV->V       Calculate the VMAF between two video streams.
 .. psnr               VV->V       Calculate the PSNR between two video streams.
 .. ssim               VV->V       Calculate the SSIM between two video streams.

If libvmaf is not listed, your FFmpeg was built without --enable-libvmaf. Most distro builds ship it; static builds from BtbN/FFmpeg-Builds include it by default.

📐 2. Running the metrics

You can call ffmpeg -lavfi libvmaf... directly, but the output format is awkward to parse, and you end up writing the same Python wrapper everyone else has. Use ffmpeg-quality-metrics (slhck/ffmpeg-quality-metrics, still actively maintained). It runs all three metrics in one pass, emits JSON, and handles the model-path plumbing for you.

# bash
pip install ffmpeg-quality-metrics

A single rendition through the gate:

# bash
ffmpeg-quality-metrics distorted/720p.mp4 reference/master.mov \
  --metrics psnr ssim vmaf \
  --vmaf-model-path models/vmaf_v0.6.1.json \
  --output-format json > metrics_720p.json

The output looks like this (truncated):

{
  "global": {
    "psnr": { "psnr_avg": 41.82, "psnr_min": 36.41, "psnr_max": 45.97 },
    "ssim": { "ssim_avg": 0.978 },
    "vmaf": { "vmaf_avg": 88.7, "vmaf_min": 71.2, "vmaf_max": 96.4 }
  },
  "input_file_dist": "distorted/720p.mp4",
  "input_file_ref": "reference/master.mov"
}

global is the aggregate. The per-frame data is also in the JSON if you ask for --output-format json with the frame-level flag, and that is the file you want when an encoder regresses and you need to find which 200 frames lost 8 points.

⚠️ Note: VMAF on CPU is workable on short clips but slow on long ones. If you have NVENC-capable hardware, add --vmaf-features cuda (the wrapper passes it through to libvmaf) and decode through h264_cuvid / hevc_cuvid. The roughly 6x speedup NVIDIA documents lines up with what I see in practice when I keep frames on the GPU end-to-end.

🚦 3. Turning it into a CI gate

The script: run the metrics on every rendition, compare the aggregate against a per-rendition threshold, exit non-zero on regression.

# scripts/qa_gate.py
import json
import subprocess
import sys
from pathlib import Path

# Per-rendition VMAF floor. Tune to your content.
# Tighter on top renditions, looser on the bottom rung.
THRESHOLDS = {
    "720p": {"vmaf_avg": 85, "vmaf_min": 65, "ssim_avg": 0.96},
    "480p": {"vmaf_avg": 78, "vmaf_min": 55, "ssim_avg": 0.94},
    "360p": {"vmaf_avg": 70, "vmaf_min": 45, "ssim_avg": 0.92},
}

REFERENCE = "reference/master.mov"
MODEL = "models/vmaf_v0.6.1.json"

def run_metrics(distorted: Path) -> dict:
    out = subprocess.check_output([
        "ffmpeg-quality-metrics", str(distorted), REFERENCE,
        "--metrics", "psnr", "ssim", "vmaf",
        "--vmaf-model-path", MODEL,
        "--output-format", "json",
    ])
    return json.loads(out)["global"]

def main() -> int:
    failures = []
    for name, floor in THRESHOLDS.items():
        rendition = Path(f"distorted/{name}.mp4")
        if not rendition.exists():
            failures.append(f"{name}: rendition missing")
            continue
        metrics = run_metrics(rendition)
        for metric, minimum in floor.items():
            value = metrics.get(metric.split("_")[0], {}).get(metric)
            if value is None or value < minimum:
                failures.append(f"{name} {metric}: got {value}, want >= {minimum}")
        print(f"{name}: vmaf {metrics['vmaf']['vmaf_avg']:.1f}, "
              f"ssim {metrics['ssim']['ssim_avg']:.3f}, "
              f"psnr {metrics['psnr']['psnr_avg']:.1f}")
    if failures:
        print("\nFAILURES:")
        for f in failures:
            print(f"  - {f}")
        return 1
    return 0

if __name__ == "__main__":
    sys.exit(main())

Run it locally first:

# bash
$ python scripts/qa_gate.py
720p: vmaf 88.7, ssim 0.978, psnr 41.8
480p: vmaf 81.2, ssim 0.961, psnr 38.9
360p: vmaf 72.4, ssim 0.943, psnr 36.1

The output you actually want is when somebody breaks the encoder:

# bash
$ python scripts/qa_gate.py
720p: vmaf 79.1, ssim 0.962, psnr 42.4
480p: vmaf 81.2, ssim 0.961, psnr 38.9
360p: vmaf 72.4, ssim 0.943, psnr 36.1

FAILURES:
  - 720p vmaf_avg: got 79.1, want >= 85
exit code: 1

Look at what PSNR did here: it actually went up (42.4 vs the previous 41.8). The encoder traded perceptual quality for raw pixel error and a PSNR-only gate would have shipped it. The VMAF gate caught it.

🤖 4. The GitHub Actions job

# .github/workflows/encoder-qa.yml
name: encoder-qa
on:
  pull_request:
    paths:
      - "encoder/**"
      - "scripts/qa_gate.py"

jobs:
  qa:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4

      - name: Install FFmpeg 8.1.1
        run: |
          curl -L -o ffmpeg.tar.xz \
            https://github.com/BtbN/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz
          tar xf ffmpeg.tar.xz --strip-components=2 -C /usr/local/bin --wildcards '*/bin/ffmpeg' '*/bin/ffprobe'
          ffmpeg -version | head -1

      - name: Install metrics wrapper
        run: pip install ffmpeg-quality-metrics

      - name: Build the test ladder
        run: ./scripts/build_ladder.sh

      - name: Run quality gate
        run: python scripts/qa_gate.py

A few notes on this in production:

Cache the reference master in an S3 bucket or a Git LFS object. Re-fetching from a public CDN on every run is a recipe for flaky builds.
Treat the model file as a build input. Bump it deliberately, and re-baseline thresholds when you do.
Keep the per-frame JSON as a build artifact for at least 30 days. When an encoder regresses, the headline number tells you it broke; the per-frame data tells you where.

⚠️ Things that bite you in real workflows

A short list of things I have had to fix on actual CI pipelines:

Different frame counts between reference and distorted. Trim with ffmpeg -ss/-to on both sides, or VMAF will silently truncate to the shorter one and the score will surprise you.
Color space mismatches. A BT.709 source compared against a BT.601 encode will produce noisy VMAF scores. Normalize with zscale=transfer=bt709:matrix=bt709.
Resolution mismatches. libvmaf scales the distorted to match the reference; if you really want to test "how this looks on a 720p screen", scale the reference down to 720p first, then compare.
CPU vs CUDA scores drift slightly. Same model, same files, different paths through the math. Pick one for CI and stick to it.

What's next

A few directions worth exploring:

Per-title encoding QA. The same gate, but a thresholds-per-content-class table. Talking heads, sports, animation, screen content; each gets its own floor.
Frame-level alerting. Plot the per-frame VMAF in your CI artifact viewer. A two-second dip below 60 is often more telling than the global average.
GPU-resident pipelines. If you encode with NVENC and want sub-realtime QA on long content, run h264_cuvid decoders into a libvmaf filter graph that stays on the GPU. The 6x speedup is real; the plumbing is fiddly enough to be its own post.

Wire it up, get the gate green, and the first time a PR turns it red you will be glad you stopped trusting PSNR as the whole story.

video #ffmpeg #tutorial #devops

DEV Community