ahmet gedik

Posted on Jun 26

Perceptual Hashing for Video Deduplication in Python at Aggregator Scale

#python #ffmpeg #video #hashing

The same clip, eleven times, eleven titles

Last quarter our ingest pipeline pulled a 47-second clip of a Seoul night-market performance into the trending feed eleven separate times. Same footage, eleven different rows in the database. One was the original from a Korean creator. One was a Taiwanese re-upload with hard-coded Traditional Chinese subtitles. One was letterboxed to 1:1 for a vertical feed. Two had a watermark slapped in the corner. The titles were in Korean, Japanese, Traditional Chinese, English, and one was just a string of emoji. To our SQLite FTS5 index with its CJK tokenizer, these were eleven unrelated videos. To a human, they were obviously the same thing.

At TopVideoHub we aggregate trending video across the whole Asia-Pacific region, and cross-region re-uploads are the single biggest source of feed pollution we deal with. When you pull trending lists from JP, KR, TW, HK, and SG within the same hour, you get the same viral clip bouncing between markets with every possible cosmetic mutation. Metadata can't save you here. The only thing that survives re-encoding, re-titling, and re-cropping is the visual content itself. This article walks through how we built a perceptual-hashing dedup layer in Python that catches those duplicates before they ever reach the feed.

Why metadata and file hashes are useless here

The naive instinct is to hash the file. That fails immediately:

MD5/SHA of the bytes changes the moment anyone re-encodes, trims a frame, or changes the bitrate. Two byte-identical uploads of a trending clip basically never happen across platforms.
Duration matching is too coarse. Thousands of unrelated clips are 30 or 60 seconds long, and re-uploads frequently trim intros or append outros, shifting duration by a few seconds.
Title / metadata similarity is hopeless in a multi-language context. A Korean title and its Traditional Chinese re-upload share zero tokens even after CJK segmentation, and machine-translated titles drift unpredictably.
Thumbnail comparison alone is fragile because uploaders pick different thumbnail frames.

What we actually want is a fingerprint of what the video looks like, one that stays stable under compression, scaling, minor cropping, watermarks, and color shifts, but changes when the underlying content is genuinely different. That is exactly what a perceptual hash gives you. Unlike a cryptographic hash, where one flipped bit changes everything, a perceptual hash changes proportionally to how much the image changed. Two visually similar frames produce two hashes that are close in Hamming distance.

Sampling keyframes from the video

A perceptual hash works on images, so the first job is turning a video into a small, representative set of frames. We don't hash every frame; for a trending clip that would be tens of thousands of frames for almost no extra signal. Instead we sample a fixed number of frames spaced evenly across the runtime. We use ffmpeg and ffprobe directly through subprocess because they are already on every ingest worker and nothing in Python matches their decode speed.

import os
import subprocess
import tempfile
from PIL import Image


def video_duration(path: str) -> float:
    out = subprocess.check_output([
        'ffprobe', '-v', 'error',
        '-show_entries', 'format=duration',
        '-of', 'default=noprint_wrappers=1:nokey=1',
        path,
    ])
    return float(out.strip())


def extract_keyframes(path: str, n: int = 8) -> list:
    '''Grab n evenly spaced frames as PIL images.'''
    duration = video_duration(path)
    frames = []
    with tempfile.TemporaryDirectory() as tmp:
        for i in range(n):
            # sample at the midpoint of each of n segments
            ts = duration * (i + 0.5) / n
            out_path = os.path.join(tmp, 'frame_' + str(i) + '.jpg')
            subprocess.run([
                'ffmpeg', '-ss', format(ts, '.2f'), '-i', path,
                '-frames:v', '1', '-q:v', '3', '-y', out_path,
            ], check=True, capture_output=True)
            frames.append(Image.open(out_path).copy())
    return frames

A few decisions worth calling out:

Seek before decode. Putting -ss before -i makes ffmpeg seek to the timestamp first and decode only one frame, instead of decoding the whole file up to that point. For an 8-frame sample of a 10-minute video this is the difference between milliseconds and tens of seconds.
Eight frames is enough. We tested 4, 8, and 16. Eight is the sweet spot for short trending clips; below that, intro/outro variation dominates, and above it you pay decode cost for marginal recall.
Midpoint sampling beats uniform endpoints. Sampling at the center of each segment avoids the black title cards and fade-ins that cluster at the very start and end of re-uploads.

Computing the perceptual hash

The workhorse is the DCT-based pHash. The idea: shrink the frame to grayscale, run a 2D discrete cosine transform, keep only the low-frequency coefficients (which capture overall structure, not noise), and threshold them against their median to produce a 64-bit value. Low-frequency content is exactly what survives JPEG/H.264 compression, so the hash is stable under re-encoding.

Here is a dependency-light implementation using only NumPy and Pillow. The DCT is built from a basis matrix so you can read exactly what is happening.

import numpy as np
from PIL import Image


def _dct2(matrix: np.ndarray) -> np.ndarray:
    '''Unnormalized 2D DCT-II via a basis matrix (fine for hashing).'''
    m = matrix.shape[0]
    n = np.arange(m)
    k = n.reshape((m, 1))
    basis = np.cos(np.pi * (2 * n + 1) * k / (2 * m))
    return basis @ matrix @ basis.T


def phash(image: Image.Image, hash_size: int = 8, highfreq_factor: int = 4) -> int:
    size = hash_size * highfreq_factor
    img = image.convert('L').resize((size, size), Image.LANCZOS)
    pixels = np.asarray(img, dtype=np.float64)
    dct = _dct2(pixels)
    # keep the top-left low-frequency block
    low = dct[:hash_size, :hash_size]
    med = np.median(low)
    bits = (low > med).flatten()
    value = 0
    for bit in bits:
        value = (value << 1) | int(bit)
    return value


def hamming(a: int, b: int) -> int:
    return bin(a ^ b).count('1')

With hash_size=8 you get a 64-bit hash. The highfreq_factor controls how much detail feeds the DCT before you discard the high frequencies; 4 (so a 32x32 working image) is the standard choice. The hamming function is the entire comparison primitive: it counts differing bits between two hashes. Two frames from the same source video typically sit at a Hamming distance of 0 to 8 out of 64; genuinely different frames are usually 25 or more apart. That gap is what makes the whole system work.

A quick sanity check you can run locally:

Hash a frame, then hash the same frame after saving it as a low-quality JPEG. Distance should be 0 to 4.
Hash a frame, then hash a horizontally cropped version (10% off one side). Distance usually stays under 12.
Hash two unrelated frames. Distance should be comfortably above 20.

If your numbers look like that, your pipeline is wired correctly.

From frames to a video signature

One frame hash isn't a video fingerprint. Re-uploads trim and pad, so frame i of the original may align with frame i+1 of the re-upload. We solve this with a small, order-tolerant matching scheme: a video's signature is just its list of frame hashes, and two signatures match if enough of one video's frame hashes find a close partner anywhere in the other's.

def video_signature(path: str, n: int = 8) -> list:
    return [phash(frame) for frame in extract_keyframes(path, n)]


def signatures_match(sig_a, sig_b, frame_threshold: int = 10,
                     min_matches: int = 5) -> bool:
    matches = 0
    for ha in sig_a:
        if any(hamming(ha, hb) <= frame_threshold for hb in sig_b):
            matches += 1
    return matches >= min_matches


if __name__ == '__main__':
    a = video_signature('original_kr.mp4')
    b = video_signature('reupload_tw.mp4')
    print('duplicate' if signatures_match(a, b) else 'distinct')

The two tunables, frame_threshold and min_matches, are the levers that control your false-positive versus false-negative trade-off:

frame_threshold is how close two individual frames must be (in bits) to count as a match. Lower is stricter. We run 10.
min_matches is how many of the eight frames must find a partner. We require 5 of 8. Requiring all 8 misses re-uploads that trimmed an intro; requiring only 2 starts flagging unrelated clips that happen to share a black frame or a common stock intro.

This any-to-any comparison is O(n²) in frames, but with n=8 that's 64 cheap integer XORs per video pair, which is nothing. The real scaling problem is not comparing two videos; it's finding which of your millions of stored videos to compare against. That's where storage strategy matters.

Storing and querying without a full scan

We can't run a Hamming comparison against every video in the database for each new ingest. Our backend is PHP 8.4 on LiteSpeed with SQLite, and SQLite has no native Hamming-distance index. So we use a coarse bucket as a candidate prefilter: store the high 16 bits of each frame hash as an indexed integer, look up only videos that share a bucket with the incoming frames, then verify those few candidates with full Hamming distance in PHP.

<?php
declare(strict_types=1);

final class VideoFingerprintStore
{
    public function __construct(private \PDO $db)
    {
        $this->db->exec(
            'CREATE TABLE IF NOT EXISTS frame_hash (
                video_id  INTEGER NOT NULL,
                frame_idx INTEGER NOT NULL,
                phash     INTEGER NOT NULL,
                bucket    INTEGER NOT NULL,
                PRIMARY KEY (video_id, frame_idx)
            )'
        );
        $this->db->exec('CREATE INDEX IF NOT EXISTS idx_bucket ON frame_hash(bucket)');
    }

    public function store(int $videoId, array $hashes): void
    {
        $stmt = $this->db->prepare(
            'INSERT OR REPLACE INTO frame_hash (video_id, frame_idx, phash, bucket)
             VALUES (?, ?, ?, ?)'
        );
        foreach ($hashes as $idx => $hash) {
            $bucket = ($hash >> 48) & 0xFFFF; // top 16 bits, masked sign-safe
            $stmt->execute([$videoId, $idx, $hash, $bucket]);
        }
    }

    public static function hamming(int $a, int $b): int
    {
        $x = $a ^ $b;
        $count = 0;
        while ($x !== 0) {
            $count += $x & 1;
            $x = ($x >> 1) & 0x7FFFFFFFFFFFFFFF; // logical shift past the sign bit
        }
        return $count;
    }

    public function findDuplicate(array $hashes, int $threshold = 10, int $minMatches = 5): ?int
    {
        $buckets = array_values(array_unique(array_map(
            static fn(int $h): int => ($h >> 48) & 0xFFFF,
            $hashes
        )));
        $in = implode(',', array_fill(0, count($buckets), '?'));
        $stmt = $this->db->prepare(
            'SELECT video_id, phash FROM frame_hash WHERE bucket IN (' . $in . ')'
        );
        $stmt->execute($buckets);

        $matchesByVideo = [];
        foreach ($stmt->fetchAll(\PDO::FETCH_ASSOC) as $row) {
            foreach ($hashes as $h) {
                if (self::hamming($h, (int) $row['phash']) <= $threshold) {
                    $matchesByVideo[$row['video_id']] ??= 0;
                    $matchesByVideo[$row['video_id']]++;
                    break;
                }
            }
        }
        foreach ($matchesByVideo as $videoId => $count) {
            if ($count >= $minMatches) {
                return (int) $videoId;
            }
        }
        return null;
    }
}

Two PHP gotchas that bit us and might bite you:

PHP integers are signed 64-bit. A pHash with its top bit set arrives as a negative number. The bucket mask & 0xFFFF is fine, but the right shift in hamming() must be masked with 0x7FFFFFFFFFFFFFFF after each step, otherwise PHP's arithmetic shift drags the sign bit and the loop never terminates. SQLite stores the signed 64-bit value perfectly either way.
Bucketing trades recall for speed. Two near-duplicate frames can differ in their top 16 bits and land in different buckets, so a pure top-bits bucket misses some duplicates. In production we store three buckets per frame (top, middle, and a rotated slice) and union the candidates, which recovers most of the lost recall. For a stricter guarantee, a BK-tree or a multi-index hashing scheme gives exact threshold search; the bucket prefilter is just the cheapest thing that works well at our volume.

Going parallel for batch backfill

The Python path is great for live ingest, one video at a time. But when we first turned this on we had a backlog of roughly two million stored videos to fingerprint. For that batch job we wanted a tight, allocation-free matcher, so the cross-comparison stage runs in Go. math/bits.OnesCount64 compiles down to a single POPCNT instruction, which makes Hamming distance essentially free, and goroutines let us saturate every core.

package main

import (
    "fmt"
    "math/bits"
)

type Signature struct {
    VideoID int
    Frames  []uint64
}

func hamming(a, b uint64) int {
    return bits.OnesCount64(a ^ b)
}

func framesMatch(a, b Signature, frameThreshold, minMatches int) bool {
    matches := 0
    for _, ha := range a.Frames {
        for _, hb := range b.Frames {
            if hamming(ha, hb) <= frameThreshold {
                matches++
                break
            }
        }
    }
    return matches >= minMatches
}

func main() {
    a := Signature{VideoID: 1, Frames: []uint64{0xA1B2C3D4E5F60718}}
    b := Signature{VideoID: 2, Frames: []uint64{0xA1B2C3D4E5F6071A}}
    if framesMatch(a, b, 10, 1) {
        fmt.Println("duplicate found")
    }
}

One cross-language detail that cost us an afternoon: keep your hash type consistent. Python ints are unbounded, Go uses uint64, PHP is signed 64-bit. As long as every layer treats the value as exactly 64 bits and you compare unsigned bit patterns, the Hamming distances line up. The moment one layer sign-extends or truncates, your distances quietly become wrong and you start either merging unrelated videos or missing obvious dupes.

Tuning, false positives, and the things that fool it

Perceptual hashing is not magic, and pretending otherwise leads to merging videos that shouldn't be merged. The failure modes we actually see:

Shared stock intros and outros. A channel's branded 3-second intro produces near-identical frame hashes across all their videos. The min_matches threshold is your defense; one or two matching frames is not a duplicate.
Static-screen content. Lyric videos, podcasts with a single background image, and slideshow compilations have very low visual variance, so unrelated ones can look similar. We flag low-variance signatures (where a video's own frame hashes are all near each other) and fall back to stricter thresholds plus duration checks for those.
Hard cuts and montages. A heavily re-edited compilation that reuses one clip from the original is not a full duplicate, and our 5-of-8 rule correctly treats it as distinct.
Heavy cropping or rotation. pHash tolerates roughly 10 to 15 percent cropping. A 1:1 crop of a 16:9 video can fall outside that, which is the one case where we accept a miss rather than loosen thresholds and start over-merging.

We arrived at threshold 10 and 5-of-8 by labeling a few hundred real cross-region pairs from our own feed and sweeping the parameters for the best precision at acceptable recall. Do the same with your content. A platform of music videos and a platform of gameplay clips will land on different numbers, because their baseline visual variance is different.

Conclusion

Deduplicating re-uploaded video across a dozen Asia-Pacific markets came down to one principle: hash what the human eye sees, not the bytes or the titles. Sample a handful of keyframes with ffmpeg, reduce each to a 64-bit DCT perceptual hash, compare with Hamming distance, and require several frames to agree before you call two videos the same. Wrap that with a cheap bucket prefilter so you never scan the whole table, push the batch backfill onto a popcount-friendly Go worker, and tune your two thresholds against your own labeled pairs.

The payoff is concrete: that eleven-way Seoul clip now collapses into a single canonical entry the moment the second copy hits ingest, and the trending feed shows eleven different videos instead of one video eleven times. For an aggregator, that is the whole game. Start with the Python snippets above, run the three sanity checks on your own footage, and only reach for BK-trees or multi-index hashing once a bucket prefilter actually becomes your bottleneck.

DEV Community