Building a Perceptual Hashing Pipeline for Video Deduplication in Python

#python #video #algorithms #backend

Across eight regions our crawlers pull the same trending clip dozens of times a day. A music video that charts in the US shows up again in GB with a different thumbnail, again in BR re-uploaded by a mirror channel, and a fourth time as a 480p rip with a watermark slapped on the corner. By the time the discovery feed renders, the same content is competing with itself for shelf space, and our users see four near-identical cards in a row. Exact-match dedup with a SHA-256 of the file bytes catches none of this, because re-encodes, crops, and re-uploads produce completely different byte streams. At TrendVidStream the fix that actually moved the needle was perceptual hashing: fingerprinting what a video looks like rather than what its bytes are. This post walks through the pipeline we run, the math behind it, and the storage decisions that keep lookups fast at scale.

Why Cryptographic Hashes Are the Wrong Tool

The instinct of every backend engineer is to reach for a hash function. The problem is that cryptographic hashes are designed to do the opposite of what we need. A good hash like SHA-256 exhibits the avalanche effect: flip a single bit of input and roughly half the output bits flip. That is exactly what you want for integrity checking and exactly what you do not want for similarity detection.

Consider two files of the same 30-second clip:

original.mp4 — 1080p, H.264, CRF 18
reupload.mp4 — 720p, H.264, CRF 23, with a 2-pixel black border added

To a human these are the same video. To SHA-256 they are two unrelated 256-bit numbers with no exploitable relationship. There is no "distance" you can compute between two cryptographic hashes that tells you anything about the similarity of the inputs.

Perceptual hashing inverts the design goal. We want small visual changes to produce small changes in the hash, so that two similar inputs land close together in hash space. "Close" is measured by Hamming distance — the number of differing bits — and a threshold on that distance becomes our similarity test.

The pHash Algorithm, Step by Step

There are several perceptual hash families. The average hash (aHash) is the simplest: downscale, compute the mean luminance, and emit one bit per pixel based on whether it is above or below the mean. It is fast but fragile — a single bright region drags the mean around. The DCT-based hash (pHash) is the workhorse for production because it keys on the low-frequency structure of the image, which is precisely the part that survives re-encoding, scaling, and mild compression.

Here is the algorithm for a single frame:

Grayscale. Color is mostly noise for this purpose; we care about structure.
Downscale to 32×32. This throws away high-frequency detail and normalizes resolution.
Apply a 2D Discrete Cosine Transform. The DCT concentrates the image's energy into the top-left coefficients — the low frequencies.
Keep the top-left 8×8 block (excluding the DC term, the very first coefficient, which just encodes overall brightness).
Compute the median of those 64 coefficients.
Emit 64 bits: 1 if the coefficient is above the median, 0 otherwise.

The result is a 64-bit fingerprint where every bit reflects a low-frequency structural property of the frame. Two visually similar frames will differ in only a handful of those bits.

You do not need to implement the DCT yourself — imagehash wraps Pillow and scipy cleanly — but it is worth seeing the mechanics once so the threshold tuning later makes sense:

import numpy as np
from scipy.fftpack import dct
from PIL import Image


def phash_frame(image: Image.Image, hash_size: int = 8, highfreq_factor: int = 4) -> int:
    """Compute a 64-bit DCT perceptual hash for a single PIL image."""
    img_size = hash_size * highfreq_factor  # 32x32
    img = image.convert("L").resize((img_size, img_size), Image.Resampling.LANCZOS)
    pixels = np.asarray(img, dtype=np.float64)

    # 2D DCT: transform rows, then columns
    coeffs = dct(dct(pixels, axis=0, norm="ortho"), axis=1, norm="ortho")
    low_freq = coeffs[:hash_size, :hash_size]  # top-left 8x8 block

    # Exclude the DC term from the median so overall brightness doesn't dominate
    median = np.median(low_freq.flatten()[1:])
    bits = low_freq > median

    # Pack 64 booleans into a single integer
    value = 0
    for bit in bits.flatten():
        value = (value << 1) | int(bit)
    return value


def hamming_distance(a: int, b: int) -> int:
    return bin(a ^ b).count("1")

The hamming_distance helper is the entire similarity engine. XOR two hashes, count the set bits, and you have a distance between 0 (identical) and 64 (maximally different). In practice anything under 10 is a confident match for re-encodes, and 5 or below catches near-exact duplicates even with watermarks.

From Frames to Videos

A single frame hash is not enough — videos are temporal. Two different videos can share a near-identical opening frame (think of every gaming channel that starts on the same loading screen), and a single video's frames vary wildly across its runtime. The standard approach is to sample frames at fixed intervals and build a sequence of hashes, then compare sequences rather than single values.

We sample one frame per second using PyAV, which gives us direct access to the decoder without shelling out to ffmpeg for every file. Sampling by timestamp rather than by frame index makes the fingerprint robust to frame-rate differences — a 24fps original and a 30fps re-upload still land on the same wall-clock seconds.

import av
import imagehash
from PIL import Image


def video_fingerprint(path: str, sample_hz: float = 1.0, max_frames: int = 60) -> list[int]:
    """Sample frames at sample_hz and return a list of 64-bit perceptual hashes."""
    container = av.open(path)
    stream = container.streams.video[0]
    stream.thread_type = "AUTO"

    duration = float(stream.duration * stream.time_base) if stream.duration else 0.0
    interval = 1.0 / sample_hz
    targets = [interval * i for i in range(max_frames) if interval * i < duration]

    hashes: list[int] = []
    for ts in targets:
        # Seek to the nearest keyframe before the target timestamp
        offset = int(ts / stream.time_base)
        container.seek(offset, stream=stream, any_frame=False, backward=True)
        for frame in container.decode(stream):
            if float(frame.pts * stream.time_base) >= ts:
                img = frame.to_image()
                hashes.append(int(str(imagehash.phash(img)), 16))
                break

    container.close()
    return hashes

A few things matter in production here. Seeking to keyframes (backward=True) is dramatically faster than decoding linearly — for a 10-minute video you decode 60 short segments instead of every frame. Capping max_frames bounds the cost of a pathologically long file; a 4-hour livestream VOD does not need a 14,400-element fingerprint to be deduplicated against its own clips. And thread_type = "AUTO" lets PyAV use multi-threaded decoding, which on our crawler boxes roughly halves wall time.

Comparing Sequences

With two lists of frame hashes, similarity is a question of how many positions match within tolerance. The simplest robust metric: for each frame in the shorter sequence, find its best Hamming match in the longer sequence within a small temporal window, then report the fraction of frames that matched under the bit threshold.

def sequence_similarity(
    seq_a: list[int],
    seq_b: list[int],
    bit_threshold: int = 10,
    window: int = 2,
) -> float:
    """Fraction of frames in the shorter sequence that match the longer one."""
    if not seq_a or not seq_b:
        return 0.0
    short, long = (seq_a, seq_b) if len(seq_a) <= len(seq_b) else (seq_b, seq_a)

    matches = 0
    for i, h in enumerate(short):
        lo = max(0, i - window)
        hi = min(len(long), i + window + 1)
        best = min((bin(h ^ long[j]).count("1") for j in range(lo, hi)), default=64)
        if best <= bit_threshold:
            matches += 1
    return matches / len(short)

The window parameter absorbs small temporal misalignments — a re-upload that trims a 1.5-second intro will be offset by one or two sampled frames, and the window lets those frames still find their match. We treat a similarity above 0.85 as a duplicate and route it into our merge logic. Below 0.6 we treat as distinct. The grey zone between gets flagged for the editorial review queue rather than auto-merged, because a false merge that hides a legitimately different video is worse for discovery than a missed duplicate.

Storing and Searching 64-Bit Hashes at Scale

Computing fingerprints is the easy half. The hard half is: given a new video's hashes, find every existing video within Hamming distance k, fast, across a catalog of millions of frame hashes. A linear scan that XORs against every stored hash is O(n) per lookup and dies the moment the catalog grows.

Our stack is PHP 8.4 in front of SQLite with FTS5 for text discovery, and the fingerprint store lives in the same SQLite database the rest of the catalog uses. SQLite is not the obvious choice for nearest-neighbor search, but two techniques make it viable without bolting on a vector database.

Technique one: hash prefix banding. Split each 64-bit hash into four 16-bit bands. By the pigeonhole principle, if two hashes differ by at most 3 bits total, at least one of the four 16-bit bands must be identical. So we store four indexed band columns and query for an exact match on any band as a candidate generator, then verify the full Hamming distance only on that small candidate set. This turns a million-row scan into an indexed lookup over a few hundred candidates.

import sqlite3


def bands(h: int) -> tuple[int, int, int, int]:
    return (
        (h >> 48) & 0xFFFF,
        (h >> 32) & 0xFFFF,
        (h >> 16) & 0xFFFF,
        h & 0xFFFF,
    )


def init_store(db: sqlite3.Connection) -> None:
    db.execute(
        """CREATE TABLE IF NOT EXISTS frame_hashes (
            video_id TEXT NOT NULL,
            frame_idx INTEGER NOT NULL,
            hash INTEGER NOT NULL,
            b0 INTEGER NOT NULL, b1 INTEGER NOT NULL,
            b2 INTEGER NOT NULL, b3 INTEGER NOT NULL
        )"""
    )
    for col in ("b0", "b1", "b2", "b3"):
        db.execute(f"CREATE INDEX IF NOT EXISTS idx_{col} ON frame_hashes({col})")


def insert_fingerprint(db: sqlite3.Connection, video_id: str, hashes: list[int]) -> None:
    rows = []
    for idx, h in enumerate(hashes):
        b0, b1, b2, b3 = bands(h)
        rows.append((video_id, idx, h, b0, b1, b2, b3))
    db.executemany(
        "INSERT INTO frame_hashes VALUES (?, ?, ?, ?, ?, ?, ?)", rows
    )
    db.commit()


def find_near_duplicates(db: sqlite3.Connection, h: int, max_dist: int = 8) -> list[str]:
    b0, b1, b2, b3 = bands(h)
    cur = db.execute(
        """SELECT DISTINCT video_id, hash FROM frame_hashes
           WHERE b0 = ? OR b1 = ? OR b2 = ? OR b3 = ?""",
        (b0, b1, b2, b3),
    )
    hits = []
    for video_id, stored in cur.fetchall():
        if bin(h ^ stored).count("1") <= max_dist:
            hits.append(video_id)
    return hits

The banding trick is exact for distances up to 3 and a strong heuristic up to 8 or so — you may miss a small fraction of matches whose differing bits spread across all four bands, but in our experience those are rare enough that the second pass of full-frame sequence comparison catches them. The win is enormous: lookups stay in the single-digit-millisecond range even past a million stored frame hashes, all inside the SQLite file that ships with the rest of the catalog over our FTP-based deploy.

Technique two: store the full hash, verify in the loop. Notice we always re-check the true Hamming distance after the index narrows the field. The bands are a filter, never the final answer. This keeps false positives out of the merge pipeline regardless of how aggressive the candidate generation is.

Wiring It Into a Multi-Region Crawler

The fingerprinting step slots into the ingestion cron between download and catalog insert. Each region's crawler runs on its own schedule, and they all write to the same fingerprint store, which means a clip first seen in the US region is already deduplicated against when the BR crawler trips over a mirror of it six hours later.

A few operational lessons from running this across eight regions:

Hash on ingest, not on read. Fingerprinting is the expensive part. Do it once, when the video first enters the pipeline, and never on the request path. The discovery feed only ever reads the precomputed dedup groups.
Make the threshold per-content-type. Music videos and movie trailers tolerate a tighter threshold because their visual structure is distinctive. Gameplay and screen-recording content needs a looser one, or near-identical UI chrome triggers false merges. We key the threshold off the category the catalog already assigns.
Keep a quarantine band. Anything in the 0.6–0.85 similarity zone goes to a review table instead of auto-merging. Auto-merging an ambiguous pair is the one mistake that visibly degrades discovery, so we bias toward keeping both and letting a human break the tie.
Version your hash function. When you change hash_size or the sampling rate, old fingerprints are no longer comparable to new ones. Store a hash_version column and re-fingerprint lazily as videos are touched, rather than blocking a deploy on a full re-index.
Watch the seek cost. The single biggest performance cliff is decoding long videos linearly. Always seek to keyframes; the backward=True flag in the PyAV example is not optional at scale.

One more subtlety worth calling out: black frames and solid-color intros hash to nearly the same value for every video. If you do not filter them, every fade-to-black title card in your catalog matches every other one and your candidate sets explode. We drop frames whose hash has fewer than 8 bits set or more than 56 — those are the near-uniform frames carrying no structural signal.

Conclusion

Perceptual hashing is one of those techniques that looks like overkill until you see four copies of the same clip fighting for the top of a feed, and then it looks like the obvious answer. The core idea is small — fingerprint the low-frequency structure of sampled frames, compare with Hamming distance, and use prefix banding to keep lookups fast — but the operational details are where it lives or dies: seek to keyframes, filter blank frames, version your hashes, and never auto-merge an ambiguous pair. Built this way, the whole thing fits inside a plain SQLite file and a Python ingest cron, no vector database or GPU required. For a multi-region discovery product the payoff is direct: a cleaner feed, less duplicate inventory to rank, and crawlers in eight regions that quietly agree on what is actually the same video. Start with a single 64-bit pHash and a Hamming threshold of 10, measure your false-merge rate, and tighten from there — the rest is tuning.