Perceptual Image Diff Without NumPy — Building a Pillow-Only CLI
imdiffcompares two images and tells you how different they are — with three complementary metrics, a highlighted diff image, and a tiny Docker image. No NumPy, no OpenCV, no subprocess-ing ImageMagick. Just Pillow's C-accelerated operations and about 300 lines of Python.
📦 GitHub: https://github.com/sen-ltd/imdiff
Visual regression testing is the kind of problem where the existing tools are all almost what you want. ImageMagick's compare is fine if you want to shell out to a 120 MB install. Resemble.js is great — in a browser. Pixelmatch is a solid Node library if your CI already has Node. But when you just want "tell me how different these two PNGs are, from a Python script, in a small container," you end up writing the thing yourself.
That's what imdiff is. This article walks through the design, the dHash algorithm, and the specific tradeoffs you hit when you refuse to depend on NumPy.
The problem
Real-world reasons to diff two images:
- Visual regression tests. Did my CSS change silently break the login page? Snapshot the page, diff against the last good snapshot, fail CI if the score drops.
-
Image optimizer validation. I just ran every PNG through
oxipng/pngquant/ a WebP converter. Did any of them break visibly? - Dedup and near-match detection. Are these two screenshots the same asset at different resolutions? Is this user upload a near-duplicate of something we already have?
- CMS / design QA. Design shipped a new header; does the staging screenshot still match the Figma export within noise?
All four want a number (to gate things on) and a picture (for humans to eyeball). And all four care about different definitions of "the same" — sometimes you want pixel-exact, sometimes you want resize-invariant.
imdiff reports three metrics in one call, which turns out to cover all four cases without needing mode flags:
$ imdiff baseline.png candidate.png --format json
{
"a": "baseline.png",
"b": "candidate.png",
"changed_pixel_ratio": 0.210033,
"dhash_similarity": 0.953125,
"hash_a": "9865e4d4d4e46598",
"hash_b": "9865e4d6d2e46598",
"height": 320,
"identical": false,
"mean_pixel_error": 0.011610,
"width": 480
}
Exit code 0 if identical, 1 if different, 2 on bad input. Optional --out diff.png writes a highlighted diff. That's the whole surface.
Three metrics, three questions
Every perceptual diff tool has to pick a point on the sensitivity curve. Too strict and you fail CI on every JPEG re-encoding; too loose and you miss real bugs. I don't think there's one correct point, so imdiff reports three numbers and lets you pick which one to gate on.
dHash similarity. A difference hash is a 64-bit fingerprint that survives resizing, mild blurring, and minor color shifts. Two perceptually identical images have identical hashes. A totally different scene has a Hamming distance somewhere near 32 (half the bits). I'll walk through the algorithm below — it's short enough to explain completely.
Mean pixel error. The arithmetic mean of |luminance(a) - luminance(b)| across all pixels, normalized to [0, 1]. This is the opposite of dHash: it's exact, it's sensitive to single-pixel anti-aliasing, and it reports zero only when the images are byte-for-byte identical in grayscale.
Changed pixel ratio. The fraction of pixels whose luminance delta exceeds --threshold (default 10 on a 0–255 scale — so about 4%). This is the "how much of the image actually moved" number. It models human eyeballs better than the global mean, because a single bright red box of changed pixels and a general 4% darkening report very different changed-pixel ratios but similar mean errors.
Using them together:
-
dhash_similarity == 1.0 && mean_pixel_error == 0 && changed_pixel_ratio == 0→ identical, exit 0. -
dhash_similarity == 1.0 && mean_pixel_error > 0→ same composition, slightly different pixels (JPEG re-encoding, mild color tweak). -
dhash_similarity < 0.9 && changed_pixel_ratio > 0.1→ actually different, large regions moved. -
dhash_similarity < 0.5→ different scene entirely.
The dHash algorithm in ~15 lines
Difference hashing was popularized by Neal Krawetz as a simpler, faster alternative to average hashing. The whole thing is:
from PIL import Image
def compute_dhash(image: Image.Image, size: int = 8) -> int:
gray = image.convert("L")
resized = gray.resize((size + 1, size), Image.Resampling.LANCZOS)
pixels = resized.load()
bits = 0
for y in range(size):
for x in range(size):
left = pixels[x, y]
right = pixels[x + 1, y]
bits = (bits << 1) | (1 if left > right else 0)
return bits
Four steps:
-
Collapse to grayscale.
convert("L")drops to one channel using ITU-R 601-2 luma weights. We only care about luminance gradients. -
Downscale to
(size+1, size). The extra column exists so that adjacent-pixel comparisons in each row produce exactlysizebits. For the defaultsize=8that's a 9×8 image and a 64-bit hash. - Compare horizontally. For every pair of adjacent pixels, emit 1 if the left is brighter, 0 otherwise. This is the "difference" in difference hash — we never look at absolute values, only local gradients.
- Pack the bits into an integer.
The reason this works is almost the same reason convolutional filters work: local gradients are robust features. A resized copy of the same image has the same gradients. A JPEG-recompressed copy has nearly the same gradients. A totally different image has completely unrelated gradients.
Why difference hash and not average hash?
There's an older trick called average hash (aHash): compute the mean brightness of the downscaled image, then emit 1 for every pixel above the mean and 0 for every pixel below. It works — barely. It's famously unstable around high-contrast edges because a single noisy pixel can flip many bits.
dHash is more stable because each bit depends on only two neighbouring pixels, not on a global mean. Noise in one pixel can flip at most two adjacent bits. And because dHash ignores absolute brightness, it's invariant to global contrast/brightness changes.
The tradeoff is that dHash is directional — it only looks at horizontal gradients. You can (and some libraries do) also compute a vertical dHash and concatenate to get a 128-bit hash. For imdiff's use cases, 64 bits with one direction has been plenty.
Similarity, not distance
The raw hash output is a 64-bit integer; comparison is Hamming distance ((a ^ b).bit_count() in modern Python, which is a single POPCNT instruction on x86). To make it comparable to the other metrics I normalize:
def dhash_similarity(a: int, b: int, bits: int = 64) -> float:
return 1.0 - hamming_distance(a, b, bits) / bits
1.0 is identical, ~0.5 is completely random, exact 0.0 is "the hashes are literally inverses of each other" which is vanishingly unlikely for real images.
Metrics in one pass of Pillow
Here's the interesting constraint: I want to compute the mean pixel error and the changed pixel ratio without looping over individual pixels in Python. At 1920×1080 that's 2 million pixels, and a Python for loop over that many items takes multiple seconds. So every hot path has to stay inside Pillow's C code.
from PIL import Image, ImageChops, ImageStat
def compute_metrics(a: Image.Image, b: Image.Image, threshold: int = 10):
lum_a = a.convert("L")
lum_b = b.convert("L")
# C-level per-pixel abs difference.
delta = ImageChops.difference(lum_a, lum_b)
# C-level mean of the delta image.
mean_pixel_error = ImageStat.Stat(delta).mean[0] / 255.0
# C-level LUT thresholding: 0 or 255 for every pixel.
lut = [0] * (threshold + 1) + [255] * (255 - threshold)
mask = delta.point(lut, mode="L")
# Cheap Python sum over a flat byte sequence.
total = a.size[0] * a.size[1]
changed = sum(mask.getdata()) // 255
changed_pixel_ratio = changed / total
return {"mean_pixel_error": mean_pixel_error,
"changed_pixel_ratio": changed_pixel_ratio}
ImageChops.difference, ImageStat.Stat, and Image.point are all implemented in Pillow's C extension. The only Python-level operation is sum(mask.getdata()), which iterates over a buffer that is already a flat sequence of bytes — CPython's built-in sum on a byte sequence is about as fast as a Python-level loop gets, and the per-pixel work is a single integer add.
On my MacBook this path processes a 1920×1080 pair in about 70 ms, including file I/O. That's comparable to what a NumPy version would do, because both approaches are bottlenecked on the same C code underneath.
Rendering the diff image
The diff image is what you paste into a GitHub comment. The recipe is standard:
def make_diff_image(a, b, threshold=10, base_dim=0.4):
lum_a = a.convert("L")
lum_b = b.convert("L")
# Dimmed grayscale base so the red pops.
base_gray = lum_a.point(lambda p: int(p * base_dim))
base_rgb = Image.merge("RGB", (base_gray, base_gray, base_gray))
# Binary mask of changed pixels.
delta = ImageChops.difference(lum_a, lum_b)
lut = [0] * (threshold + 1) + [255] * (255 - threshold)
mask = delta.point(lut, mode="L")
# Paste solid red through the mask.
red_layer = Image.new("RGB", a.size, (255, 0, 0))
base_rgb.paste(red_layer, (0, 0), mask)
return base_rgb
Image.paste with a mask argument uses the mask as alpha: where the mask is 255, the red layer is written; where it's 0, the base is preserved. That one call replaces what would otherwise be a manual per-pixel composite.
Tradeoffs I'm making on purpose
This section is the honest part.
Not SSIM. The "correct" academic answer to perceptual image similarity is SSIM or one of its derivatives (MS-SSIM, LPIPS). They model the human visual system better than dHash does, and they disagree with "naive" metrics in ways that usually match your intuition. I didn't implement SSIM because it requires NumPy in practice, and the goal here was a deliberately small tool. For visual regression CI, dHash + a threshold is enough 90% of the time, and SSIM is what you reach for when it isn't.
Luminance only, not color. Every metric collapses to grayscale first. A change from red to green at the same luminance produces zero mean pixel error and an identical dHash. This is a known gap: it's fine for screenshot diffs (most layout regressions change luminance somewhere), and it's wrong for color-critical work (logo validation, brand-color checks, medical imaging). The fix would be to compute metrics per channel and aggregate — maybe in a future version.
Anti-aliasing noise. A single pixel of anti-aliasing moving between a and b contributes to mean_pixel_error and can trip changed_pixel_ratio. The --threshold flag is specifically there to filter this out. The default of 10 is empirically good for screenshot diffs: it ignores sub-4% deltas, which is below the JPEG quality 90 re-encoding noise floor.
Resize strategy affects metrics. When the two images have different sizes, imdiff resizes the second to match the first using LANCZOS. This means that the resize itself contributes mean pixel error and changed pixel ratio, because LANCZOS is not a pixel-exact transform. --no-resize is provided so CI can fail loudly instead. In practice if you're diffing screenshots from the same browser at the same viewport, this doesn't matter; if you're diffing against a different viewport, you should be doing something else anyway.
Single-bit color red overlay. The diff image paints pure red on changed pixels. You can't tell how different each changed pixel is, only whether it's above threshold. This is intentional: I tried a luminance-proportional overlay and it was much harder to read at a glance. Binary "this moved" is easier than "how much this moved".
Try it in 30 seconds
Everything ships as a tiny Alpine container (80 MB):
docker build -t imdiff https://github.com/sen-ltd/imdiff.git
mkdir -p /tmp/idtest
docker run --rm -v /tmp/idtest:/work --entrypoint python imdiff -c "
from PIL import Image, ImageDraw
img = Image.new('RGB', (200, 200), 'white')
d = ImageDraw.Draw(img)
d.rectangle([50, 50, 150, 150], fill='red')
img.save('/work/a.png')
d.rectangle([50, 50, 150, 150], fill='blue')
img.save('/work/b.png')
"
docker run --rm -v /tmp/idtest:/work imdiff a.png b.png --out diff.png --format json
# → metrics + /tmp/idtest/diff.png highlighted
docker run --rm -v /tmp/idtest:/work imdiff a.png a.png; echo "exit=$?"
# → exit=0, all metrics zero
Or locally:
pip install imdiff
imdiff a.png b.png --out diff.png
A fun gotcha I hit
When I first tested the red-to-blue square diff, dhash_similarity reported 1.0 — the hashes were identical. At first I thought my dHash was broken. It's not. Pure red (255, 0, 0) and pure blue (0, 0, 255) have very similar luminance values under ITU-R 601-2 weights (29% R, 58% G, 11% B → red = 76, blue = 29). Different enough that mean pixel error catches it, but the gradients inside the square are identical — a solid-color square has zero internal gradient regardless of color — so the dHash fingerprint is the same.
This is exactly the "luminance only" tradeoff from the previous section, caught by the tests. The lesson: three metrics aren't redundant. When dHash says 1.0 but mean_pixel_error > 0, that's a color-change signal, and the test suite asserts they can disagree that way.
Closing
Entry #129 in a 100+ portfolio series by SEN LLC. Small tools, written end-to-end, with the design choices explained instead of hidden.
If you want a diff tool that reads like one file and ships in an 80 MB container, give imdiff a try. If you want SSIM, use scikit-image. If you want browser-based, use Resemble.js. Pick the one that fits your pipeline — that's the whole point of having options.
Feedback welcome.

Top comments (0)