ahmet gedik

Posted on May 26

Building a Video Deduplication Pipeline with Perceptual Hashing in Python

#python #video #computervision #sqlite

Last quarter our TopVideoHub aggregator pulled 47,000 trending videos from YouTube across nine APAC regions, and roughly 11% of them were duplicates we couldn't catch with content IDs. A Japanese cooking clip would surface in JP trending with one title, then resurface in TW trending as a reupload with Mandarin subtitles burned in, then again in HK with the channel watermark cropped. Same video, three different YouTube IDs, three different thumbnails, three different titles. SHA-256 of the thumbnail says they're all different files. Eye-balling them says they're obviously the same content.

That's the gap perceptual hashing fills. Instead of comparing bits, it compares structure — a video's perceptual fingerprint stays roughly the same even after re-encoding, mild cropping, watermarking, or color shifts. Two videos hash to two 64-bit integers; if their Hamming distance is under some threshold, they're the same content.

This article walks through the production pipeline we actually ship: extracting frames, computing pHash + dHash, indexing for fast lookup, and integrating with our SQLite metadata store. Everything below runs against real YouTube thumbnails from our videos table.

Why Not MD5 or SHA

Cryptographic hashes are useless for this. They're designed so flipping one bit produces a completely different output — that's the avalanche property. For deduplication that's the opposite of what we want. A reuploaded video with one different pixel in the encoder output gets a totally different SHA. The hash doesn't degrade gracefully.

Perceptual hashes degrade gracefully. The two main families:

pHash (DCT-based) — Discrete Cosine Transform on a downscaled grayscale image, threshold the DCT coefficients against their median. Robust to compression, mild rotation, watermarks.
dHash (gradient-based) — Compute brightness differences between adjacent pixels in a downscaled image. Fast, robust to gamma and brightness shifts, weaker against horizontal flips.

We run both. pHash catches re-encodings and watermarked reuploads, dHash catches the cases where pHash gets confused by aggressive overlays (subtitle burn-ins are the worst offender across our TW and KR regions).

For video, we don't hash the whole stream — we sample frames at fixed timestamps and hash each. A 10-minute video becomes a sequence of ~10 perceptual hashes. Two videos match if enough of their frame hashes are close.

Setting Up the Frame Extractor

We use ffmpeg via subprocess for frame extraction and imagehash for the actual perceptual hashing. The dependency list:

pip install ffmpeg-python imagehash Pillow numpy faiss-cpu

Frame extraction at fixed intervals. We grab ten evenly-spaced frames across the duration — enough to cover the structure without ballooning storage:

import subprocess
import tempfile
import os
from PIL import Image
import imagehash

def extract_frames(video_path: str, num_frames: int = 10) -> list[Image.Image]:
    """Extract num_frames evenly-spaced frames from a video file."""
    probe = subprocess.run(
        ['ffprobe', '-v', 'error', '-show_entries', 'format=duration',
         '-of', 'default=noprint_wrappers=1:nokey=1', video_path],
        capture_output=True, text=True, check=True,
    )
    duration = float(probe.stdout.strip())

    frames = []
    with tempfile.TemporaryDirectory() as tmpdir:
        for i in range(num_frames):
            timestamp = duration * (i + 0.5) / num_frames
            output = os.path.join(tmpdir, f'frame_{i:03d}.jpg')
            subprocess.run(
                ['ffmpeg', '-ss', f'{timestamp:.3f}', '-i', video_path,
                 '-vframes', '1', '-q:v', '2', '-y', output],
                capture_output=True, check=True,
            )
            frames.append(Image.open(output).copy())
    return frames


def compute_video_hashes(video_path: str) -> dict:
    frames = extract_frames(video_path)
    return {
        'phash': [str(imagehash.phash(f, hash_size=16)) for f in frames],
        'dhash': [str(imagehash.dhash(f, hash_size=16)) for f in frames],
        'frame_count': len(frames),
    }

A few things worth calling out:

hash_size=16 gives us 256-bit hashes (a 16×16 grid). Default is 8 (64-bit). The bigger hash is more discriminative — false positive rate drops, but storage and comparison cost go up. For our scale (low hundreds of thousands of videos) 256-bit is comfortable.
The -ss flag before -i is a fast seek. Putting it after -i makes ffmpeg decode every frame up to the timestamp. With the flag before -i, ffmpeg seeks to the nearest keyframe — slightly less accurate but ~100x faster for long videos.
We probe duration before extraction. Some YouTube downloads have wonky metadata where nb_frames is wrong; reading format=duration is reliable.

Hashing Just the Thumbnail When That's All You Have

For a YouTube aggregator we rarely have the actual video file — we have the thumbnail URL and metadata. The thumbnail alone is surprisingly effective. Roughly 78% of our duplicate detections come from thumbnail-only hashing, because reuploaders usually keep the same custom thumbnail or use the same auto-generated frame.

import requests
import numpy as np
from io import BytesIO
from PIL import Image
import imagehash

def trim_letterbox(img: Image.Image, threshold: int = 16) -> Image.Image:
    """Crop solid black or white bars from the edges."""
    arr = np.array(img.convert('L'))
    h, w = arr.shape
    row_active = arr.std(axis=1) > threshold
    col_active = arr.std(axis=0) > threshold
    if not (row_active.any() and col_active.any()):
        return img
    top = int(row_active.argmax())
    bot = h - int(row_active[::-1].argmax())
    left = int(col_active.argmax())
    right = w - int(col_active[::-1].argmax())
    if (bot - top) > h * 0.3 and (right - left) > w * 0.3:
        return img.crop((left, top, right, bot))
    return img


def hash_thumbnail(url: str) -> tuple[str, str] | None:
    try:
        resp = requests.get(url, timeout=10)
        resp.raise_for_status()
        img = Image.open(BytesIO(resp.content))
        img = trim_letterbox(img)
        return (
            str(imagehash.phash(img, hash_size=16)),
            str(imagehash.dhash(img, hash_size=16)),
        )
    except Exception as e:
        print(f'hash failed for {url}: {e}')
        return None

The letterbox trim matters more than you'd expect. A bunch of Korean variety show clips get reuploaded to TW channels with thick black bars added top and bottom for vertical aspect display. Without the trim, pHash gives them a Hamming distance of ~24 — over our match threshold. With the trim, distance drops to ~4. Same content, correctly clustered.

Storing Hashes in SQLite

Our metadata store is SQLite with FTS5 for CJK search. Hashes live in a sidecar table keyed by video_id:

CREATE TABLE IF NOT EXISTS video_phash (
    video_id TEXT PRIMARY KEY,
    phash_thumb BLOB NOT NULL,
    dhash_thumb BLOB NOT NULL,
    phash_frames BLOB,
    dhash_frames BLOB,
    hashed_at INTEGER NOT NULL,
    FOREIGN KEY (video_id) REFERENCES videos(video_id) ON DELETE CASCADE
);

CREATE INDEX IF NOT EXISTS idx_phash_hashed_at ON video_phash(hashed_at);

Storing hashes as raw BLOBs instead of hex strings cuts table size in half and makes XOR-based Hamming distance computation cheaper. The imagehash library returns ImageHash objects whose .hash is a numpy boolean array — pack to bytes:

import numpy as np
import imagehash

def hash_to_bytes(h: imagehash.ImageHash) -> bytes:
    return np.packbits(h.hash.flatten()).tobytes()


def bytes_to_hash(b: bytes, hash_size: int = 16) -> imagehash.ImageHash:
    bits = np.unpackbits(np.frombuffer(b, dtype=np.uint8))
    grid = bits[:hash_size * hash_size].reshape(hash_size, hash_size).astype(bool)
    return imagehash.ImageHash(grid)

For pHash with hash_size=16 that's exactly 32 bytes per hash. A whole video's frame sequence (10 frames × 32 bytes) fits in 320 bytes. At 500,000 videos that's 160 MB of hash data — comfortable inside a single SQLite file.

Fast Lookup with FAISS

Naive comparison is O(N) per query. For 500k videos that's 500k XOR ops plus popcount — fast enough in pure C at maybe 50ms, but you're calling from Python and doing it for every new fetch. We push it to FAISS, which does binary-hash similarity search in C++ with SIMD:

import faiss
import numpy as np
import sqlite3

class PHashIndex:
    def __init__(self, hash_bits: int = 256):
        self.hash_bits = hash_bits
        self.index = faiss.IndexBinaryFlat(hash_bits)
        self.id_map: list[str] = []

    def build_from_db(self, db_path: str):
        conn = sqlite3.connect(db_path)
        rows = conn.execute(
            'SELECT video_id, phash_thumb FROM video_phash'
        ).fetchall()
        conn.close()
        if not rows:
            return
        vectors = np.vstack([
            np.frombuffer(blob, dtype=np.uint8) for _, blob in rows
        ])
        self.index.add(vectors)
        self.id_map = [vid for vid, _ in rows]

    def query(self, hash_bytes: bytes, k: int = 5, max_distance: int = 12):
        query_vec = np.frombuffer(hash_bytes, dtype=np.uint8).reshape(1, -1)
        distances, indices = self.index.search(query_vec, k)
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx == -1 or dist > max_distance:
                continue
            results.append((self.id_map[idx], int(dist)))
        return results

IndexBinaryFlat does exact (not approximate) search using SIMD-accelerated XOR + popcount. On our box it processes 500k comparisons in ~3ms. For 5M+ videos we'd switch to IndexBinaryHNSW and trade exactness for sub-millisecond approximate search.

The threshold matters. For 256-bit hashes:

distance ≤ 4 — essentially the same image (re-encoded, mild quality loss)
distance ≤ 12 — same content with watermark, crop, or overlay differences
distance ≤ 20 — similar content but too loose, false positives dominate
distance ≥ 32 — unrelated; random pairs sit around 128 on average

We use 12 as the production threshold and accept the rare false positive — manual spot-checks show we'd lose more real duplicates if we tightened it.

Integrating with the Fetch Pipeline

Our YouTube fetcher pulls trending lists per region and writes to videos. We extended it to compute thumbnail hashes during the same pass and check against the index before marking a row visible:

import time
import sqlite3
from concurrent.futures import ThreadPoolExecutor

def process_new_videos(api_results: list[dict], index: PHashIndex, db_path: str):
    def hash_one(item):
        thumb_url = item['snippet']['thumbnails']['high']['url']
        hashes = hash_thumbnail(thumb_url)
        if not hashes:
            return None
        phash_bytes = hash_to_bytes(imagehash.hex_to_hash(hashes[0]))
        dhash_bytes = hash_to_bytes(imagehash.hex_to_hash(hashes[1]))
        matches = index.query(phash_bytes, k=3, max_distance=12)
        return {
            'video_id': item['id'],
            'phash': phash_bytes,
            'dhash': dhash_bytes,
            'duplicate_of': matches[0][0] if matches else None,
            'duplicate_distance': matches[0][1] if matches else None,
        }

    with ThreadPoolExecutor(max_workers=8) as pool:
        results = list(pool.map(hash_one, api_results))

    conn = sqlite3.connect(db_path)
    conn.execute('PRAGMA journal_mode=WAL')
    conn.execute('PRAGMA busy_timeout=5000')
    now = int(time.time())
    for r in results:
        if r is None:
            continue
        if r['duplicate_of']:
            conn.execute(
                'UPDATE videos SET duplicate_of=?, dedup_distance=? WHERE video_id=?',
                (r['duplicate_of'], r['duplicate_distance'], r['video_id'])
            )
        else:
            conn.execute(
                '''INSERT OR REPLACE INTO video_phash
                   (video_id, phash_thumb, dhash_thumb, hashed_at)
                   VALUES (?, ?, ?, ?)''',
                (r['video_id'], r['phash'], r['dhash'], now)
            )
            vec = np.frombuffer(r['phash'], dtype=np.uint8).reshape(1, -1)
            index.index.add(vec)
            index.id_map.append(r['video_id'])
    conn.commit()
    conn.close()

The duplicate_of column lets us keep duplicate rows in the database for analytics — we just exclude them from the user-facing trending list with a WHERE duplicate_of IS NULL clause in the PHP renderer. That preserves history without polluting the feed, and the LiteSpeed page cache layer in front of it never sees the hidden rows.

A Note on Concurrency and SQLite

We hash thumbnails in a thread pool (max_workers=8) because the work is I/O-bound — pulling thumbnail JPEGs from YouTube's CDN. The hashing itself is fast (~5ms per thumbnail on a 4-vCPU box). But concurrent SQLite writes can deadlock with the default journal mode. WAL fixes it:

journal_mode=WAL allows one writer plus many readers without blocking the whole file
synchronous=NORMAL is safe under WAL and roughly 3x faster than FULL
busy_timeout=5000 makes writers retry instead of erroring under brief contention

With our pipeline doing a write every ~50ms during a fetch burst, this is the difference between completing the run and deadlocking after thirty seconds.

What the Numbers Look Like

Our trending fetch hits 9 APAC regions every 4 hours and pulls roughly 450 videos per region — 4,050 candidates per cycle. Before pHash, after dedupe by video_id alone we still had:

~11% effective duplicates (same content, different YouTube IDs) — caught now
~3% near-duplicates (clip from a longer video, recompressed shorts) — caught at distance ≤ 8
1-2% false positives at distance ≤ 12 (legitimately different videos with very similar thumbnails — usually news segments from the same broadcaster)

The hash table is currently 142 MB across 478k rows. Average FAISS query is 2.4ms. Total per-fetch overhead from the dedup pass is around 14 seconds for 4,050 candidates — dominated by HTTP fetch of thumbnails, not hashing.

A couple of unexpected wins we got for free:

Spam channel detection. A channel re-uploading the same 30 videos under different titles every week lights up as a cluster of high-similarity entries. We auto-flag channels with more than five duplicate uploads to a single content cluster.
Cross-language matching. A Japanese vlog with a Japanese title gets reuploaded as a Mandarin-titled mirror. Our SQLite FTS5 with CJK tokenizer treats them as unrelated text, but pHash bridges them. We now surface "Also available in:" cross-language links by treating dedup clusters as a single canonical video with language variants — exactly the kind of feature an APAC aggregator should ship.

Edge Cases to Watch

A few things that bit us before we settled on the current pipeline:

Animated YouTube thumbnails. YouTube's auto-generated thumbnails sometimes pull a different frame on each fetch. The pHash of the same video can drift by 6-8 Hamming across fetches. Our threshold of 12 absorbs this, but a tighter threshold would create a self-collision problem.
Compilation videos. A 10-hour compilation includes clips that match individual videos perfectly. We don't want to mark the compilation as a duplicate of a single clip. The fix: also compare aspect ratio and duration from metadata, and only mark as duplicate when both hash distance and duration ratio are within bounds.
Black frames. Some uploaders pad the start of a video with 3-5 seconds of black. Frame-based hashing on the first frame returns mostly-black hashes that collide with thousands of other black-padded videos. We skip the first and last 5% of duration when sampling frames.
Index persistence. Building the FAISS index from scratch on every fetch is wasteful. We persist with faiss.write_index_binary after each batch and reload on startup. Cold-start build for 500k vectors is ~8 seconds; reload is 200ms.

Conclusion

Perceptual hashing isn't novel, but the integration story around it — frame sampling strategy, BLOB-packed SQLite storage, FAISS for the hot lookup path, threshold tuning that matches your actual content distribution — is where the engineering effort sits. For a video aggregator working across multiple Asian markets where the same content gets republished with different language overlays, this is the single highest-leverage piece of code I've shipped this year. Eleven percent fewer duplicates in trending means eleven percent more genuinely distinct content surfaced to users, and the entire dedup pass adds only fourteen seconds to a fetch that already takes minutes.

If you're building something similar, start with thumbnail-only hashing — it's 80% of the value at 5% of the complexity. Add frame sampling later when you have an actual case where thumbnail-only fails. Don't tune the threshold by eye; pick 50 known-duplicate pairs and 50 known-different pairs from your data, plot the distance distributions, and pick the threshold where the two histograms stop overlapping.

DEV Community