The problem with pixel diff
When comparing two video frames, the obvious approach is pixel
subtraction. Take frame A, take frame B, subtract.
The problem: it's too sensitive. Compression artifacts, a 1-pixel
encode difference, slight brightness variation — all produce noisy
results that don't reflect what a human would actually notice.
SSIM (Structural Similarity Index) solves this. It compares
luminance, contrast, and structure locally — producing a score that
correlates much better with how humans perceive visual quality.
The result: a heatmap that shows where frames differ, not just a
number.
Why MFCC for audio
For audio comparison, waveform correlation was my first attempt.
It's too sensitive to timing offsets — even a 10ms shift between
identical audio files produces a misleadingly low score.
MFCC (Mel-Frequency Cepstral Coefficients) captures the shape of
the sound spectrum rather than the raw signal. It's the same
technique used in speech recognition — robust to minor timing
differences and perceptually meaningful.
The memory problem on a free server
Running frame-by-frame SSIM on a 60-second video at full resolution
would crash a 512MB server instantly.
The solution: resize frames before analysis and sample at intervals
rather than every frame. You lose nothing perceptually relevant —
and the per-second similarity graph still catches exactly the kind
of transient drop that means one scene rendered differently.
What I ended up building
DiffALL — drop any two files, get a structured diff in seconds.
Supports video (SSIM + PSNR + live heatmap player), images
(pixel diff + flexible mode for different angles), audio (MFCC),
subtitles (WER + timing drift), and text/JSON/CSV/YAML.
diffall.onrender.com — free,
no install.
Happy to answer questions about the approach.
Top comments (0)