ahmet gedik

Posted on Jun 3

Implementing Perceptual Hashing for Video Deduplication in Python

#python #video #performance #backend

When ViralVidVault crossed 80,000 trending video candidates per week last quarter, my SQLite deduplication table had grown into a quiet liar. Two videos with byte-identical SHA-256 hashes? Rare. Two videos that were re-uploads with a 480p-to-720p re-encode, a 2-second intro card, or a mirrored aspect ratio? Everywhere. The database said "unique," the moderators said "we saw this yesterday," and the homepage looked like a Reddit mirror at 3 AM.

This is the problem perceptual hashing solves, and after rolling it into the ingestion pipeline at ViralVidVault, our duplicate cluster detection on European trend feeds climbed from 31% to 94%. Here is how the system works end-to-end — frame sampling, pHash computation in Python, Hamming-distance lookup against a hot SQLite WAL index, and a Cloudflare Worker that short-circuits dedup decisions at the edge before they hit our PHP 8.4 backend.

Why Cryptographic Hashes Fail for Video

A SHA-256 is a brittle equality test. Change one byte in the container, change one frame of metadata, and the hash diverges. That property is exactly what you want for integrity checks and exactly what you do not want for "is this the same video?". Re-encoders are everywhere on the modern web:

YouTube re-muxes uploads to MP4/AV1 variants depending on device.
Re-uploaders crop watermarks, add intro stings, or mirror horizontally to evade copyright bots.
CDN transcoders normalize bitrate and frame rate for adaptive streaming.

A perceptual hash, by contrast, encodes what the frame looks like. Two perceptually similar frames produce perceptually similar hashes, where similarity is measured by the Hamming distance — the number of bit positions that differ between two binary strings. A 64-bit pHash with a Hamming distance ≤ 8 generally indicates same-source content with very high confidence.

For video, we extend this idea across time: sample N representative frames, hash each one, and either concatenate them into a video fingerprint or compare them frame-by-frame. The trick is choosing the right frames — uniform time-based sampling is naive because viral videos often share a logo intro or a black-frame outro, which produces false positives on completely unrelated content.

A Frame Sampling Strategy That Actually Works

Naive sampling pulls one frame every N seconds. That works for talking-head content and fails for everything else. For viral short-form, my pipeline uses a three-stage sampling strategy:

Keyframe extraction via ffprobe — only I-frames, since they encode independent visual state.
Scene-cut filtering with a luminance threshold on consecutive frames.
Adaptive density — short videos under 30s get 8 samples, long-form gets 1 sample per 10 seconds, capped at 32.

Here is the sampler the Python ingestion worker runs. It uses OpenCV for decoding because we already had it in the image deduplication path, and adding a second decoder library was harder to justify than living with OpenCV's quirks.

import subprocess
import cv2
import numpy as np
from pathlib import Path
from typing import Iterator


def extract_keyframe_timestamps(video_path: Path) -> list[float]:
    """Use ffprobe to find I-frame timestamps. Cheaper than full decode."""
    cmd = [
        "ffprobe", "-v", "error",
        "-select_streams", "v:0",
        "-skip_frame", "nokey",
        "-show_entries", "frame=pts_time",
        "-of", "csv=p=0",
        str(video_path),
    ]
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
    return [float(line) for line in result.stdout.splitlines() if line]


def sample_frames(video_path: Path, max_samples: int = 32) -> Iterator[np.ndarray]:
    """Yield representative frames using adaptive density."""
    timestamps = extract_keyframe_timestamps(video_path)
    if not timestamps:
        return

    duration = timestamps[-1]
    target = min(max_samples, max(8, int(duration / 10)))

    stride = max(1, len(timestamps) // target)
    selected = timestamps[::stride][:target]

    cap = cv2.VideoCapture(str(video_path))
    try:
        for ts in selected:
            cap.set(cv2.CAP_PROP_POS_MSEC, ts * 1000)
            ok, frame = cap.read()
            if not ok or frame is None:
                continue
            # Drop near-black intro/outro frames (mean luminance < 12)
            if frame.mean() < 12:
                continue
            yield frame
    finally:
        cap.release()

The dark-frame filter is load-bearing. About 14% of European viral compilations on TikTok mirrors start with a 1-second black fade, and without filtering, my dedup index was clustering completely unrelated videos because their opening hashes matched.

Computing the Perceptual Hash

I use DCT-based pHash, not the average-hash variant. Average hash is faster but brittle to contrast adjustments — and on European TikTok mirrors specifically, contrast tweaks are the most common evasion. dHash (difference hash) is a reasonable middle ground, but pHash's DCT step gives the best false-positive rate I have measured on our test corpus of 2,400 known-duplicate pairs.

import cv2
import numpy as np


def phash_64(frame: np.ndarray) -> int:
    """Compute a 64-bit DCT-based perceptual hash for a single frame."""
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    small = cv2.resize(gray, (32, 32), interpolation=cv2.INTER_AREA)

    dct = cv2.dct(np.float32(small))
    dct_low = dct[:8, :8]

    # Exclude DC component when computing median — it dominates the others.
    coeffs = dct_low.flatten()[1:]
    median = np.median(coeffs)

    bits = (dct_low.flatten() > median).astype(np.uint8)
    return int("".join(str(b) for b in bits), 2)


def hamming_distance(a: int, b: int) -> int:
    """XOR + popcount. Python 3.10+ has int.bit_count() — use it."""
    return (a ^ b).bit_count()


def video_fingerprint(video_path: Path) -> list[int]:
    return [phash_64(f) for f in sample_frames(video_path)]

A video fingerprint is therefore a list of 8 to 32 64-bit integers. For a 30-second viral clip, fingerprint generation runs in roughly 0.8s on a single LiteSpeed worker, dominated by OpenCV's frame decoding. The ffprobe pre-pass adds ~120ms but saves multiples of that downstream because we skip decoding frames we do not need.

Storing Fingerprints in SQLite with WAL

The ingestion side writes; the API side reads. SQLite in WAL mode handles this concurrency cleanly, which is why the backend is built on SQLite rather than Postgres — the read-heavy access pattern, single-writer architecture, and sub-millisecond indexed lookups are perfect for a viral video discovery feed.

import sqlite3
from contextlib import contextmanager


SCHEMA = """
CREATE TABLE IF NOT EXISTS video_fingerprints (
    video_id     INTEGER PRIMARY KEY,
    youtube_id   TEXT UNIQUE NOT NULL,
    duration_s   INTEGER NOT NULL,
    frame_count  INTEGER NOT NULL,
    created_at   INTEGER DEFAULT (unixepoch())
);

CREATE TABLE IF NOT EXISTS fingerprint_hashes (
    video_id     INTEGER NOT NULL,
    frame_idx    INTEGER NOT NULL,
    hash_bucket  INTEGER NOT NULL,    -- top 16 bits, for prefilter
    hash_full    INTEGER NOT NULL,    -- full 64-bit hash
    PRIMARY KEY (video_id, frame_idx),
    FOREIGN KEY (video_id) REFERENCES video_fingerprints(video_id)
);

CREATE INDEX IF NOT EXISTS idx_fp_bucket ON fingerprint_hashes(hash_bucket);
"""


@contextmanager
def db(path: str):
    conn = sqlite3.connect(path, isolation_level=None)
    conn.execute("PRAGMA journal_mode = WAL")
    conn.execute("PRAGMA synchronous = NORMAL")
    conn.execute("PRAGMA cache_size = -64000")  # 64 MB page cache
    try:
        yield conn
    finally:
        conn.close()


def store_fingerprint(conn, youtube_id: str, duration: int, hashes: list[int]):
    cur = conn.execute(
        "INSERT INTO video_fingerprints(youtube_id, duration_s, frame_count) "
        "VALUES (?, ?, ?) RETURNING video_id",
        (youtube_id, duration, len(hashes)),
    )
    video_id = cur.fetchone()[0]
    conn.executemany(
        "INSERT INTO fingerprint_hashes VALUES (?, ?, ?, ?)",
        [
            (video_id, i, (h >> 48) & 0xFFFF, h)
            for i, h in enumerate(hashes)
        ],
    )

The hash_bucket column is the prefilter. It stores the top 16 bits of each 64-bit hash. When you want to find candidates similar to a query hash, you do not compare against every fingerprint in the database — you first filter to rows where hash_bucket matches one of a small set of near buckets, then compute Hamming distance only on that subset.

This bucketing trick is what makes the query practical at 1M+ stored fingerprints. Without it you are popcounting your way through every row.

The Lookup Query

Here is the Hamming-distance lookup. SQLite does not natively expose popcount, but you can register it as a Python function at connection time. For very hot paths I have a Go service that loads the index into memory and serves nearest-neighbour queries over a Unix socket — but the SQLite path is what runs by default, because operational simplicity beats clever 95% of the time.

def register_hamming(conn: sqlite3.Connection):
    conn.create_function(
        "hamming",
        2,
        lambda a, b: (a ^ b).bit_count(),
        deterministic=True,
    )


def find_duplicates(conn, query_hashes: list[int], max_dist: int = 8):
    """Return video_ids whose fingerprint overlaps query in >= 3 frames within dist."""
    register_hamming(conn)

    buckets = {(h >> 48) & 0xFFFF for h in query_hashes}
    placeholders = ",".join("?" * len(buckets))
    candidates = conn.execute(
        f"SELECT DISTINCT video_id FROM fingerprint_hashes "
        f"WHERE hash_bucket IN ({placeholders})",
        list(buckets),
    ).fetchall()

    if not candidates:
        return []

    matches = []
    for (vid,) in candidates:
        stored = [
            r[0] for r in conn.execute(
                "SELECT hash_full FROM fingerprint_hashes WHERE video_id = ?",
                (vid,),
            ).fetchall()
        ]
        hits = sum(
            1 for qh in query_hashes
            if any((qh ^ sh).bit_count() <= max_dist for sh in stored)
        )
        if hits >= 3:
            matches.append((vid, hits))

    return sorted(matches, key=lambda x: -x[1])

The "at least 3 matching frames within Hamming 8" threshold came from manual tuning on a labelled corpus. Anything looser produced false positives on visually similar but distinct content (e.g. two different aerial drone shots of the same Alps region). Anything tighter missed re-encodes with substantial cropping.

Pushing the Decision to the Edge

For the European market, latency from origin to user matters even more than usual — our viral-feed JSON endpoint is hit hard from cold regions like Northern Scandinavia and the Iberian peninsula. We use Cloudflare Workers in front of LiteSpeed to short-circuit the obvious cases: if the incoming video ID is already in the worker's KV-backed dedup map, we serve the canonical ID immediately without touching origin.

This is the worker. It is intentionally minimal — the heavy lifting still happens in the Python pipeline. The worker just propagates known-duplicate decisions:

// Cloudflare Worker: short-circuit dedup decisions at the edge
export default {
  async fetch(request, env) {
    const url = new URL(request.url);
    const ytId = url.searchParams.get("yt");

    if (!ytId || !/^[A-Za-z0-9_-]{11}$/.test(ytId)) {
      return new Response("bad request", { status: 400 });
    }

    // KV map: youtube_id -> canonical_video_id, written by the Python
    // pipeline via the Cloudflare API after each dedup decision.
    const canonical = await env.DEDUP.get(ytId);

    if (canonical) {
      // GDPR-safe: no user data in the redirect, no fingerprinting cookie
      return Response.redirect(
        `https://viralvidvault.com/watch?v=${canonical}`,
        302,
      );
    }

    // Pass through to origin (LiteSpeed) for first-time IDs
    return fetch(request);
  },
};

The Python pipeline writes to KV when a new dedup decision is made — typically when ingestion processes a freshly fetched YouTube video and finds it matches an existing fingerprint cluster. This keeps the edge cache eventually consistent without us ever needing to invalidate it; the canonical mapping is monotonic for the lifetime of a video.

The GDPR Bit Most People Skip

Here is something I had to learn the hard way during our DPO review: a perceptual hash of a video that contains identifiable people is, under a strict reading of GDPR Article 4(1), arguably personal data. Not because the hash itself is reversible — pHash is not — but because the hash can be used to single out a content item that contains personal data, and the linkage between users who view that content and the hash becomes the privacy surface.

What this meant in practice:

Fingerprints are stored only against youtube_id, never against viewer identifiers.
The dedup decision log retains hashes for 90 days, then truncates the hash column while keeping the cluster ID.
Cluster membership is exposed in our admin tools but never returned in public APIs.
The Cloudflare Worker access log is sampled at 0.1% with IPs truncated to /24 (IPv4) or /48 (IPv6), per our DPA template.

None of this affects accuracy. It just means the system is auditable when a user exercises an Article 15 access request.

Operational Gotchas

A few things that bit me in the first month of running this in production:

OpenCV's cv2.CAP_PROP_POS_MSEC lies on some MP4s. For VFR (variable frame rate) videos common on Instagram exports, the actual frame returned can be several seconds off from the requested timestamp. Workaround: validate by reading the actual returned timestamp via cap.get(cv2.CAP_PROP_POS_MSEC) after each read and reject samples that drift more than 500ms.
SQLite WAL files grow unbounded if there is an idle long-running reader. The Python worker that batch-processes fingerprints used to hold a read transaction for the entire batch. Fix: batch in chunks of 1,000 hashes and commit/release between chunks, otherwise the WAL hits multi-GB and checkpoint thrashes.
Bucket collisions on aesthetically similar videos. The 16-bit prefilter buckets are densely populated for content categories with strong visual conventions (e.g. cooking videos with overhead shots on white surfaces). For such categories the candidate set per query can balloon to 50k+ rows. Mitigation: a second-level bucket on the median-frame hash mid-bits.
OpenCV decoder leaks file descriptors on truncated MP4s. This crashed the worker after ~4,000 ingestion attempts the first week. Wrap VideoCapture in a context manager and explicitly check cap.isOpened() before reading.

Measuring What You Built

Do not ship perceptual hashing without a labelled evaluation set. Mine has 2,400 pairs split into:

800 known duplicates (re-encodes, watermark crops, intro stings, mirrored)
800 visually similar but distinct (same creator, same scene type)
800 unrelated (random pairs from the corpus)

The metrics I track:

True positive rate at the production threshold (≥3 frames, Hamming ≤ 8): currently 94.2%.
False positive rate on the "similar but distinct" set: 0.6%.
Median lookup latency against a 1.1M-row fingerprint table: 18ms p50, 41ms p95.
Fingerprint generation throughput per worker: 1.2 videos/second on an 8-core LiteSpeed box.

When I changed the Hamming threshold from 6 to 8 last month, the TPR climbed from 87% to 94% but the FPR climbed from 0.2% to 0.6%. That was a deliberate tradeoff — false positives on viral feeds get caught by the moderator review queue, but false negatives ship straight to the homepage and embarrass us.

Where Perceptual Hashing Stops Working

Be honest about the limits. pHash struggles with:

Heavy stylization — same content with a cartoon filter or color inversion typically diverges far enough to break.
Time-shifted re-edits — if a re-uploader cuts the first 30 seconds and adds 30 seconds of intro, the frame-overlap count drops below the 3-match threshold.
Content montages — when one viral video is composed of clips from five others, frame matching catches only the segments, not the composite identity.

For these we layer additional signals — audio fingerprinting via Chromaprint, OCR on overlay text, channel-relationship graph features — but those are separate pipelines. The pHash stage is just the cheap, high-precision first filter.

Closing Thoughts

Perceptual hashing is not a magic dedup button, but as the cheap first stage in a video pipeline it earns its place quickly. The math is straightforward (DCT plus median threshold), the data structure is boring (a SQLite table with one prefilter index), and the operational surface is small enough that a single engineer can own the entire path from frame decode to edge cache propagation. On the European viral feed this single change cut moderator review load by 38% and improved homepage freshness because we stopped showing the same TikTok mirror three times in a row.

If you are building anything that ingests user-supplied or scraped video, build this first. Audio fingerprinting and ML-based content embeddings are great, but they are the second and third filters, not the first. Get pHash right, store it in something simple like SQLite WAL, and put a Cloudflare Worker in front of it to absorb the obvious cases. The rest of your pipeline gets to be smarter because this layer is dumber.

DEV Community