DEV Community

Mason K
Mason K

Posted on

Wiring VMAF (and PSNR) into your encoder CI with FFmpeg 8.1 and ffmpeg-quality-metrics

📦 Code: github.com/USER/encoder-qa-ci (replace before publishing)

TL;DR

We are going to wire a perceptual-quality gate into a CI workflow using FFmpeg 8.1.1, libvmaf, and the ffmpeg-quality-metrics Python wrapper. The job runs PSNR, SSIM, and VMAF against a fixed reference ladder and fails the merge if VMAF drops below threshold. Works on CPU; the same setup runs roughly 6x faster on GPU with VMAF-CUDA.

PSNR has been the default "is my encoder okay" metric in CI pipelines for a decade, and it is starting to show its age. It cannot tell when the encoder traded perceptual quality for raw pixel error, and that is exactly the failure mode per-title and ML-driven encoders walk into. Let's wire up something better.

We will build this in three steps:

  1. Stand up a tiny test bench with a reference ladder.
  2. Run PSNR + SSIM + VMAF with ffmpeg-quality-metrics.
  3. Wrap the whole thing in a CI script that fails on regression.

🛠️ 1. The test bench

You need three things: a reference master, a set of encoded renditions ("distorted" in metric-speak), and a fixed VMAF model file.

# bash
mkdir encoder-qa && cd encoder-qa
mkdir reference distorted models

# A short, representative reference. Pick a clip that matches your worst-case content:
# face-heavy, motion, fine detail. The "TearsOfSteel" or "BigBuckBunny" clips work for demos.
curl -L -o reference/master.mov \
  https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_5MB.mp4
Enter fullscreen mode Exit fullscreen mode

Drop the VMAF model into models/. The vmaf_v0.6.1.json model is the most widely cited Netflix VMAF model; there are also vmaf_4k_v0.6.1.json for 4K viewing and a phone variant. Pin the one you use, exactly:

# bash (clone the model into the repo so CI doesn't fetch over the network)
git clone --depth 1 https://github.com/Netflix/vmaf.git /tmp/vmaf
cp /tmp/vmaf/model/vmaf_v0.6.1.json models/
Enter fullscreen mode Exit fullscreen mode

💡 Tip: treat the model file like a lockfile. Different VMAF model versions produce different scores. If you upgrade, you re-baseline every threshold.

Now generate a reference ladder. We use FFmpeg 8.1.1; older versions are fine but you want at least 6.1 for VMAF-CUDA, and 8.x for the cleanest webvtt/libsvtav1 integration.

# bash (produce three renditions for the test ladder)
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 23 -vf scale=1280:720  distorted/720p.mp4
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 26 -vf scale=854:480   distorted/480p.mp4
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 28 -vf scale=640:360   distorted/360p.mp4
Enter fullscreen mode Exit fullscreen mode

Verify the FFmpeg version. The libvmaf and libsvtav1 wiring tightened up across the 8.0 → 8.1 line:

# bash
$ ffmpeg -version | head -1
ffmpeg version 8.1.1 ...
$ ffmpeg -filters | grep -E "libvmaf|psnr|ssim"
 .. libvmaf            VV->V       Calculate the VMAF between two video streams.
 .. psnr               VV->V       Calculate the PSNR between two video streams.
 .. ssim               VV->V       Calculate the SSIM between two video streams.
Enter fullscreen mode Exit fullscreen mode

If libvmaf is not listed, your FFmpeg was built without --enable-libvmaf. Most distro builds ship it; static builds from BtbN/FFmpeg-Builds include it by default.

📐 2. Running the metrics

You can call ffmpeg -lavfi libvmaf... directly, but the output format is awkward to parse, and you end up writing the same Python wrapper everyone else has. Use ffmpeg-quality-metrics (slhck/ffmpeg-quality-metrics, still actively maintained). It runs all three metrics in one pass, emits JSON, and handles the model-path plumbing for you.

# bash
pip install ffmpeg-quality-metrics
Enter fullscreen mode Exit fullscreen mode

A single rendition through the gate:

# bash
ffmpeg-quality-metrics distorted/720p.mp4 reference/master.mov \
  --metrics psnr ssim vmaf \
  --vmaf-model-path models/vmaf_v0.6.1.json \
  --output-format json > metrics_720p.json
Enter fullscreen mode Exit fullscreen mode

The output looks like this (truncated):

{
  "global": {
    "psnr": { "psnr_avg": 41.82, "psnr_min": 36.41, "psnr_max": 45.97 },
    "ssim": { "ssim_avg": 0.978 },
    "vmaf": { "vmaf_avg": 88.7, "vmaf_min": 71.2, "vmaf_max": 96.4 }
  },
  "input_file_dist": "distorted/720p.mp4",
  "input_file_ref": "reference/master.mov"
}
Enter fullscreen mode Exit fullscreen mode

global is the aggregate. The per-frame data is also in the JSON if you ask for --output-format json with the frame-level flag, and that is the file you want when an encoder regresses and you need to find which 200 frames lost 8 points.

⚠️ Note: VMAF on CPU is workable on short clips but slow on long ones. If you have NVENC-capable hardware, add --vmaf-features cuda (the wrapper passes it through to libvmaf) and decode through h264_cuvid / hevc_cuvid. The roughly 6x speedup NVIDIA documents lines up with what I see in practice when I keep frames on the GPU end-to-end.

🚦 3. Turning it into a CI gate

The script: run the metrics on every rendition, compare the aggregate against a per-rendition threshold, exit non-zero on regression.

# scripts/qa_gate.py
import json
import subprocess
import sys
from pathlib import Path

# Per-rendition VMAF floor. Tune to your content.
# Tighter on top renditions, looser on the bottom rung.
THRESHOLDS = {
    "720p": {"vmaf_avg": 85, "vmaf_min": 65, "ssim_avg": 0.96},
    "480p": {"vmaf_avg": 78, "vmaf_min": 55, "ssim_avg": 0.94},
    "360p": {"vmaf_avg": 70, "vmaf_min": 45, "ssim_avg": 0.92},
}

REFERENCE = "reference/master.mov"
MODEL = "models/vmaf_v0.6.1.json"

def run_metrics(distorted: Path) -> dict:
    out = subprocess.check_output([
        "ffmpeg-quality-metrics", str(distorted), REFERENCE,
        "--metrics", "psnr", "ssim", "vmaf",
        "--vmaf-model-path", MODEL,
        "--output-format", "json",
    ])
    return json.loads(out)["global"]

def main() -> int:
    failures = []
    for name, floor in THRESHOLDS.items():
        rendition = Path(f"distorted/{name}.mp4")
        if not rendition.exists():
            failures.append(f"{name}: rendition missing")
            continue
        metrics = run_metrics(rendition)
        for metric, minimum in floor.items():
            value = metrics.get(metric.split("_")[0], {}).get(metric)
            if value is None or value < minimum:
                failures.append(f"{name} {metric}: got {value}, want >= {minimum}")
        print(f"{name}: vmaf {metrics['vmaf']['vmaf_avg']:.1f}, "
              f"ssim {metrics['ssim']['ssim_avg']:.3f}, "
              f"psnr {metrics['psnr']['psnr_avg']:.1f}")
    if failures:
        print("\nFAILURES:")
        for f in failures:
            print(f"  - {f}")
        return 1
    return 0

if __name__ == "__main__":
    sys.exit(main())
Enter fullscreen mode Exit fullscreen mode

Run it locally first:

# bash
$ python scripts/qa_gate.py
720p: vmaf 88.7, ssim 0.978, psnr 41.8
480p: vmaf 81.2, ssim 0.961, psnr 38.9
360p: vmaf 72.4, ssim 0.943, psnr 36.1
Enter fullscreen mode Exit fullscreen mode

The output you actually want is when somebody breaks the encoder:

# bash
$ python scripts/qa_gate.py
720p: vmaf 79.1, ssim 0.962, psnr 42.4
480p: vmaf 81.2, ssim 0.961, psnr 38.9
360p: vmaf 72.4, ssim 0.943, psnr 36.1

FAILURES:
  - 720p vmaf_avg: got 79.1, want >= 85
exit code: 1
Enter fullscreen mode Exit fullscreen mode

Look at what PSNR did here: it actually went up (42.4 vs the previous 41.8). The encoder traded perceptual quality for raw pixel error and a PSNR-only gate would have shipped it. The VMAF gate caught it.

🤖 4. The GitHub Actions job

# .github/workflows/encoder-qa.yml
name: encoder-qa
on:
  pull_request:
    paths:
      - "encoder/**"
      - "scripts/qa_gate.py"

jobs:
  qa:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4

      - name: Install FFmpeg 8.1.1
        run: |
          curl -L -o ffmpeg.tar.xz \
            https://github.com/BtbN/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz
          tar xf ffmpeg.tar.xz --strip-components=2 -C /usr/local/bin --wildcards '*/bin/ffmpeg' '*/bin/ffprobe'
          ffmpeg -version | head -1

      - name: Install metrics wrapper
        run: pip install ffmpeg-quality-metrics

      - name: Build the test ladder
        run: ./scripts/build_ladder.sh

      - name: Run quality gate
        run: python scripts/qa_gate.py
Enter fullscreen mode Exit fullscreen mode

A few notes on this in production:

  • Cache the reference master in an S3 bucket or a Git LFS object. Re-fetching from a public CDN on every run is a recipe for flaky builds.
  • Treat the model file as a build input. Bump it deliberately, and re-baseline thresholds when you do.
  • Keep the per-frame JSON as a build artifact for at least 30 days. When an encoder regresses, the headline number tells you it broke; the per-frame data tells you where.

⚠️ Things that bite you in real workflows

A short list of things I have had to fix on actual CI pipelines:

  • Different frame counts between reference and distorted. Trim with ffmpeg -ss/-to on both sides, or VMAF will silently truncate to the shorter one and the score will surprise you.
  • Color space mismatches. A BT.709 source compared against a BT.601 encode will produce noisy VMAF scores. Normalize with zscale=transfer=bt709:matrix=bt709.
  • Resolution mismatches. libvmaf scales the distorted to match the reference; if you really want to test "how this looks on a 720p screen", scale the reference down to 720p first, then compare.
  • CPU vs CUDA scores drift slightly. Same model, same files, different paths through the math. Pick one for CI and stick to it.

What's next

A few directions worth exploring:

  • Per-title encoding QA. The same gate, but a thresholds-per-content-class table. Talking heads, sports, animation, screen content; each gets its own floor.
  • Frame-level alerting. Plot the per-frame VMAF in your CI artifact viewer. A two-second dip below 60 is often more telling than the global average.
  • GPU-resident pipelines. If you encode with NVENC and want sub-realtime QA on long content, run h264_cuvid decoders into a libvmaf filter graph that stays on the GPU. The 6x speedup is real; the plumbing is fiddly enough to be its own post.

Wire it up, get the gate green, and the first time a PR turns it red you will be glad you stopped trusting PSNR as the whole story.

video #ffmpeg #tutorial #devops

Top comments (0)