📦 Code: github.com/USER/encoder-qa-ci (replace before publishing)
TL;DR
We are going to wire a perceptual-quality gate into a CI workflow using FFmpeg 8.1.1, libvmaf, and the
ffmpeg-quality-metricsPython wrapper. The job runs PSNR, SSIM, and VMAF against a fixed reference ladder and fails the merge if VMAF drops below threshold. Works on CPU; the same setup runs roughly 6x faster on GPU with VMAF-CUDA.
PSNR has been the default "is my encoder okay" metric in CI pipelines for a decade, and it is starting to show its age. It cannot tell when the encoder traded perceptual quality for raw pixel error, and that is exactly the failure mode per-title and ML-driven encoders walk into. Let's wire up something better.
We will build this in three steps:
- Stand up a tiny test bench with a reference ladder.
- Run PSNR + SSIM + VMAF with
ffmpeg-quality-metrics. - Wrap the whole thing in a CI script that fails on regression.
🛠️ 1. The test bench
You need three things: a reference master, a set of encoded renditions ("distorted" in metric-speak), and a fixed VMAF model file.
# bash
mkdir encoder-qa && cd encoder-qa
mkdir reference distorted models
# A short, representative reference. Pick a clip that matches your worst-case content:
# face-heavy, motion, fine detail. The "TearsOfSteel" or "BigBuckBunny" clips work for demos.
curl -L -o reference/master.mov \
https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_5MB.mp4
Drop the VMAF model into models/. The vmaf_v0.6.1.json model is the most widely cited Netflix VMAF model; there are also vmaf_4k_v0.6.1.json for 4K viewing and a phone variant. Pin the one you use, exactly:
# bash (clone the model into the repo so CI doesn't fetch over the network)
git clone --depth 1 https://github.com/Netflix/vmaf.git /tmp/vmaf
cp /tmp/vmaf/model/vmaf_v0.6.1.json models/
💡 Tip: treat the model file like a lockfile. Different VMAF model versions produce different scores. If you upgrade, you re-baseline every threshold.
Now generate a reference ladder. We use FFmpeg 8.1.1; older versions are fine but you want at least 6.1 for VMAF-CUDA, and 8.x for the cleanest webvtt/libsvtav1 integration.
# bash (produce three renditions for the test ladder)
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 23 -vf scale=1280:720 distorted/720p.mp4
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 26 -vf scale=854:480 distorted/480p.mp4
ffmpeg -i reference/master.mov -c:v libx264 -preset medium -crf 28 -vf scale=640:360 distorted/360p.mp4
Verify the FFmpeg version. The libvmaf and libsvtav1 wiring tightened up across the 8.0 → 8.1 line:
# bash
$ ffmpeg -version | head -1
ffmpeg version 8.1.1 ...
$ ffmpeg -filters | grep -E "libvmaf|psnr|ssim"
.. libvmaf VV->V Calculate the VMAF between two video streams.
.. psnr VV->V Calculate the PSNR between two video streams.
.. ssim VV->V Calculate the SSIM between two video streams.
If libvmaf is not listed, your FFmpeg was built without --enable-libvmaf. Most distro builds ship it; static builds from BtbN/FFmpeg-Builds include it by default.
📐 2. Running the metrics
You can call ffmpeg -lavfi libvmaf... directly, but the output format is awkward to parse, and you end up writing the same Python wrapper everyone else has. Use ffmpeg-quality-metrics (slhck/ffmpeg-quality-metrics, still actively maintained). It runs all three metrics in one pass, emits JSON, and handles the model-path plumbing for you.
# bash
pip install ffmpeg-quality-metrics
A single rendition through the gate:
# bash
ffmpeg-quality-metrics distorted/720p.mp4 reference/master.mov \
--metrics psnr ssim vmaf \
--vmaf-model-path models/vmaf_v0.6.1.json \
--output-format json > metrics_720p.json
The output looks like this (truncated):
{
"global": {
"psnr": { "psnr_avg": 41.82, "psnr_min": 36.41, "psnr_max": 45.97 },
"ssim": { "ssim_avg": 0.978 },
"vmaf": { "vmaf_avg": 88.7, "vmaf_min": 71.2, "vmaf_max": 96.4 }
},
"input_file_dist": "distorted/720p.mp4",
"input_file_ref": "reference/master.mov"
}
global is the aggregate. The per-frame data is also in the JSON if you ask for --output-format json with the frame-level flag, and that is the file you want when an encoder regresses and you need to find which 200 frames lost 8 points.
⚠️ Note: VMAF on CPU is workable on short clips but slow on long ones. If you have NVENC-capable hardware, add
--vmaf-features cuda(the wrapper passes it through to libvmaf) and decode throughh264_cuvid/hevc_cuvid. The roughly 6x speedup NVIDIA documents lines up with what I see in practice when I keep frames on the GPU end-to-end.
🚦 3. Turning it into a CI gate
The script: run the metrics on every rendition, compare the aggregate against a per-rendition threshold, exit non-zero on regression.
# scripts/qa_gate.py
import json
import subprocess
import sys
from pathlib import Path
# Per-rendition VMAF floor. Tune to your content.
# Tighter on top renditions, looser on the bottom rung.
THRESHOLDS = {
"720p": {"vmaf_avg": 85, "vmaf_min": 65, "ssim_avg": 0.96},
"480p": {"vmaf_avg": 78, "vmaf_min": 55, "ssim_avg": 0.94},
"360p": {"vmaf_avg": 70, "vmaf_min": 45, "ssim_avg": 0.92},
}
REFERENCE = "reference/master.mov"
MODEL = "models/vmaf_v0.6.1.json"
def run_metrics(distorted: Path) -> dict:
out = subprocess.check_output([
"ffmpeg-quality-metrics", str(distorted), REFERENCE,
"--metrics", "psnr", "ssim", "vmaf",
"--vmaf-model-path", MODEL,
"--output-format", "json",
])
return json.loads(out)["global"]
def main() -> int:
failures = []
for name, floor in THRESHOLDS.items():
rendition = Path(f"distorted/{name}.mp4")
if not rendition.exists():
failures.append(f"{name}: rendition missing")
continue
metrics = run_metrics(rendition)
for metric, minimum in floor.items():
value = metrics.get(metric.split("_")[0], {}).get(metric)
if value is None or value < minimum:
failures.append(f"{name} {metric}: got {value}, want >= {minimum}")
print(f"{name}: vmaf {metrics['vmaf']['vmaf_avg']:.1f}, "
f"ssim {metrics['ssim']['ssim_avg']:.3f}, "
f"psnr {metrics['psnr']['psnr_avg']:.1f}")
if failures:
print("\nFAILURES:")
for f in failures:
print(f" - {f}")
return 1
return 0
if __name__ == "__main__":
sys.exit(main())
Run it locally first:
# bash
$ python scripts/qa_gate.py
720p: vmaf 88.7, ssim 0.978, psnr 41.8
480p: vmaf 81.2, ssim 0.961, psnr 38.9
360p: vmaf 72.4, ssim 0.943, psnr 36.1
The output you actually want is when somebody breaks the encoder:
# bash
$ python scripts/qa_gate.py
720p: vmaf 79.1, ssim 0.962, psnr 42.4
480p: vmaf 81.2, ssim 0.961, psnr 38.9
360p: vmaf 72.4, ssim 0.943, psnr 36.1
FAILURES:
- 720p vmaf_avg: got 79.1, want >= 85
exit code: 1
Look at what PSNR did here: it actually went up (42.4 vs the previous 41.8). The encoder traded perceptual quality for raw pixel error and a PSNR-only gate would have shipped it. The VMAF gate caught it.
🤖 4. The GitHub Actions job
# .github/workflows/encoder-qa.yml
name: encoder-qa
on:
pull_request:
paths:
- "encoder/**"
- "scripts/qa_gate.py"
jobs:
qa:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
- name: Install FFmpeg 8.1.1
run: |
curl -L -o ffmpeg.tar.xz \
https://github.com/BtbN/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz
tar xf ffmpeg.tar.xz --strip-components=2 -C /usr/local/bin --wildcards '*/bin/ffmpeg' '*/bin/ffprobe'
ffmpeg -version | head -1
- name: Install metrics wrapper
run: pip install ffmpeg-quality-metrics
- name: Build the test ladder
run: ./scripts/build_ladder.sh
- name: Run quality gate
run: python scripts/qa_gate.py
A few notes on this in production:
- Cache the reference master in an S3 bucket or a Git LFS object. Re-fetching from a public CDN on every run is a recipe for flaky builds.
- Treat the model file as a build input. Bump it deliberately, and re-baseline thresholds when you do.
- Keep the per-frame JSON as a build artifact for at least 30 days. When an encoder regresses, the headline number tells you it broke; the per-frame data tells you where.
⚠️ Things that bite you in real workflows
A short list of things I have had to fix on actual CI pipelines:
-
Different frame counts between reference and distorted. Trim with
ffmpeg -ss/-toon both sides, or VMAF will silently truncate to the shorter one and the score will surprise you. -
Color space mismatches. A BT.709 source compared against a BT.601 encode will produce noisy VMAF scores. Normalize with
zscale=transfer=bt709:matrix=bt709. - Resolution mismatches. libvmaf scales the distorted to match the reference; if you really want to test "how this looks on a 720p screen", scale the reference down to 720p first, then compare.
- CPU vs CUDA scores drift slightly. Same model, same files, different paths through the math. Pick one for CI and stick to it.
What's next
A few directions worth exploring:
- Per-title encoding QA. The same gate, but a thresholds-per-content-class table. Talking heads, sports, animation, screen content; each gets its own floor.
- Frame-level alerting. Plot the per-frame VMAF in your CI artifact viewer. A two-second dip below 60 is often more telling than the global average.
-
GPU-resident pipelines. If you encode with NVENC and want sub-realtime QA on long content, run
h264_cuviddecoders into a libvmaf filter graph that stays on the GPU. The 6x speedup is real; the plumbing is fiddly enough to be its own post.
Wire it up, get the gate green, and the first time a PR turns it red you will be glad you stopped trusting PSNR as the whole story.
Top comments (0)