DEV Community

Cover image for Find duplicate photos on your machine in 0.3ms - no cloud, no GPU
Julien L for WiScale

Posted on

Find duplicate photos on your machine in 0.3ms - no cloud, no GPU

You have 10,000 photos on your laptop.

Some are duplicates. Some are resized. Some are cropped versions of the same scene.

Your brain instantly knows they're related.

Your computer? It thinks they're completely different files.

That's the problem we're fixing today.

The usual approach is wrong

Most image search systems do this:

"Just use embeddings + cosine similarity"

It works... until it doesn't.

Because you're mixing two completely different problems:

  • "Is this the same image?" (pixel-level)
  • "Is this the same scene?" (semantic-level)

Those are not the same question.
And one metric can't answer both well.

The fix: stop using one metric

I split the problem in two.

A two-pass pipeline:

  • โšก Pass 1 - fast filter (Hamming distance)
  • ๐Ÿง  Pass 2 - semantic re-rank (CLIP + Euclidean)

Think of it like this:

The Bouncer and the Detective

Imagine a nightclub with very strict entry rules.

๐Ÿ•ถ๏ธ The Bouncer (Hamming distance)

He looks at you for one second. Decides: "close enough" or "no way." Super fast, but shallow. He catches obvious fakes instantly, but a good disguise fools him.

This is perceptual hashing (dHash): turns images into 256-bit fingerprints and compares them by counting differing bits. Runs in microseconds.

๐Ÿ” The Detective (CLIP + Euclidean)

For the few people the bouncer let through, the detective runs a thorough investigation. He understands what's really going on. Slower, but almost never wrong.

This is CLIP: maps images into a 512-dimensional semantic space and compares meaning using Euclidean distance.

Query image
     |
     v
[ Bouncer: Hamming ]  โ† 0.04ms
     |
 shortlist (top 8)
     |
     v
[ Detective: CLIP ]   โ† 0.27ms
     |
     v
Final ranking          Total: 0.31ms
Enter fullscreen mode Exit fullscreen mode

Why this works (and single-metric systems fail)

Take a beach photo. Create two variants: a resize and a crop.

Variant Hamming (pixels) CLIP (meaning) Reality
Resized copy 3 (nearly identical) 0.44 (noticeable gap) "Same image, just smaller"
Square crop 71 (very different) 0.15 (nearly identical) "Same beach, different framing"

What happens?

  • Hamming says: resized = same โœ… / cropped = different โŒ
  • CLIP says: resized = kinda close / cropped = almost identical โœ…

๐Ÿ‘‰ Alone, both are wrong.
๐Ÿ‘‰ Together, they're right.

What's actually happening under the hood

The Bouncer's trick: turning photos into barcodes

dHash turns any image into a binary barcode in three steps:

  1. Shrink the image to a 16x16 grayscale grid (256 squares)
  2. Compare each square to its right neighbor: "Is it darker?" Yes = 1, No = 0
  3. Read the answers: [1, 0, 1, 1, 0, 0, 1, ...] (256 bits)

Why is this brilliant? When you resize a photo, the dark squares stay darker than their neighbors. The barcode barely changes. Even JPEG compression or brightness tweaks leave it almost identical.

Comparing two barcodes = counting differing bits = Hamming distance.

This is insanely fast because CPUs have a dedicated instruction for it: popcount.

  • Score 0 โ†’ identical image
  • Score ~10 โ†’ near-duplicate
  • Score ~128 โ†’ random / unrelated

The Detective's map: CLIP as an "Idea Machine"

CLIP maps images into a semantic space. Think of it as a geographic map of meaning:

  • Photos of beaches cluster in the North
  • Photos of cities cluster in the South
  • Photos of dogs cluster in the East

Two photos close on the map "mean" the same thing, even if one is a drawing and the other a photograph.

We normalize vectors and use Euclidean distance.

Yes, cosine would work too. But Euclidean is: (1) easier to interpret (0 to 2 scale), and (2) SIMD-friendly via fma instructions, making it slightly faster in VelesDBยฎ's HNSW index.

  • Score 0 โ†’ identical meaning
  • Score < 0.5 โ†’ strong semantic match
  • Score > 1.0 โ†’ different content

Setup (local-first, no cloud)

pip install velesdb imagehash open-clip-torch Pillow
Enter fullscreen mode Exit fullscreen mode

No Docker. No API keys. No external calls.

pip install velesdb downloads a pre-built wheel with the compiled Rust binary included. Nothing else to install.

Everything runs locally. Your photos never leave your machine.

Step 1: give the Bouncer his barcodes

import os
import time
import imagehash
from PIL import Image
import velesdb

PHOTO_DIR = "./my_photos"
HASH_SIZE = 16  # 16x16 grid = 256-bit barcode

db = velesdb.Database("./image_search_db")

def compute_barcode(img_path, hash_size=HASH_SIZE):
    """Turn an image into a binary barcode for the Bouncer."""
    img = Image.open(img_path)
    h = imagehash.dhash(img, hash_size=hash_size)
    return [float(b) for b in h.hash.flatten()]

bouncer = db.get_or_create_collection(
    "perceptual_hashes",
    dimension=HASH_SIZE * HASH_SIZE,  # 256
    metric="hamming"
)

photos = sorted(
    f for f in os.listdir(PHOTO_DIR)
    if f.lower().endswith((".jpg", ".jpeg", ".png", ".webp"))
)

t0 = time.time()
for i, filename in enumerate(photos):
    path = os.path.join(PHOTO_DIR, filename)
    bouncer.upsert(
        i + 1,
        vector=compute_barcode(path),
        payload={"filename": filename, "path": path}
    )
print(f"Indexed {len(photos)} barcodes in {(time.time()-t0)*1000:.1f}ms")
Enter fullscreen mode Exit fullscreen mode

The key line: metric="hamming".

This tells VelesDB the vectors are binary and should be compared with popcount, accelerated by SIMD instructions (AVX-512, AVX2, or NEON depending on your CPU).

Step 2: give the Detective his map

import open_clip
import torch

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k"
)
model.eval()

def compute_meaning(img_path):
    """Place an image on the Detective's map of meaning."""
    img = Image.open(img_path).convert("RGB")
    with torch.no_grad():
        tensor = preprocess(img).unsqueeze(0)
        features = model.encode_image(tensor)
        features /= features.norm(dim=-1, keepdim=True)
        return features.squeeze().numpy().tolist()

detective = db.get_or_create_collection(
    "clip_features",
    dimension=512,
    metric="euclidean"
)

t0 = time.time()
for i, filename in enumerate(photos):
    path = os.path.join(PHOTO_DIR, filename)
    detective.upsert(
        i + 1,
        vector=compute_meaning(path),
        payload={"filename": filename, "path": path}
    )
print(f"Indexed {len(photos)} meanings in {time.time()-t0:.1f}s")
Enter fullscreen mode Exit fullscreen mode

Same database, different metric, different index. VelesDB handles the routing internally.

CLIP indexing is slower (~91ms/image on CPU) because it runs a vision transformer. But this is a one-time cost. Once indexed, searches are sub-millisecond.

Step 3: the two-pass search

This is the part most people get wrong.

Most implementations do this โŒ:

  • Run CLIP on the entire library, or
  • Run one CLIP query per candidate (O(n) searches)

That kills performance.

The correct pattern โœ…: one CLIP query, then join the scores.

def find_similar(query_path, shortlist_k=8, final_k=5):
    """Two-pass search: Bouncer filters, Detective re-ranks."""

    # --- Pass 1: The Bouncer (instant) ---
    query_barcode = compute_barcode(query_path)
    t0 = time.time()
    fast_candidates = bouncer.search(vector=query_barcode, top_k=shortlist_k)
    bouncer_ms = (time.time() - t0) * 1000

    # --- Pass 2: The Detective (thorough) ---
    # ONE single CLIP query. Not N. This is what makes it scale.
    query_meaning = compute_meaning(query_path)
    t0 = time.time()
    all_meanings = detective.search(
        vector=query_meaning, top_k=shortlist_k * 2
    )
    meaning_scores = {r["id"]: r["score"] for r in all_meanings}

    # Re-rank the Bouncer's shortlist with the Detective's scores
    reranked = []
    for c in fast_candidates:
        reranked.append({
            "filename": c["payload"]["filename"],
            "bouncer": c["score"],
            "detective": meaning_scores.get(c["id"], float("inf")),
        })
    reranked.sort(key=lambda x: x["detective"])
    detective_ms = (time.time() - t0) * 1000

    print(f"Bouncer: {bouncer_ms:.2f}ms | Detective: {detective_ms:.2f}ms | "
          f"Total: {bouncer_ms + detective_ms:.2f}ms")
    return reranked[:final_k]

results = find_similar("./my_photos/beach_1.jpg")
Enter fullscreen mode Exit fullscreen mode

Output:

Bouncer: 0.04ms | Detective: 0.27ms | Total: 0.31ms
Enter fullscreen mode Exit fullscreen mode

Reading the scores

Image Bouncer Detective Human verdict
beach_1.jpg 0 0.0000 "That IS the original."
beach_1_square.jpg 71 0.1459 "Same beach, different crop."
beach_1_small.jpg 3 0.4418 "Same photo, just shrunk."
city_2.jpg 113 0.5636 "Different photo entirely."
flowers_1.jpg 106 0.9495 "Not even the same subject."

Look at the re-ordering. With the Bouncer alone, beach_1_small would be #2 (only 3 bits different). But the Detective promotes beach_1_square to #2, because semantically, that cropped beach is much closer to the original.

The Bouncer catches the easy cases. The Detective catches the hard ones.

Performance at scale

Indexing has two very different costs:

Library size Bouncer (dHash) Detective (CLIP, CPU) Detective (CLIP, GPU) Search
1,000 images < 1s ~90s ~5s < 1ms
10,000 images ~3s ~15min ~50s ~1ms
100,000 images ~30s ~2.5h ~8min ~2ms

The Bouncer is nearly free. The Detective (CLIP ViT-B-32) is the bottleneck: ~91ms/image on CPU, ~5ms/image on a mid-range GPU.

Searches stay sub-millisecond regardless of library size, thanks to VelesDB's HNSW index with SIMD acceleration.

Pro tip: You don't have to index everything with CLIP upfront. Index only barcodes (instant), search with the Bouncer (< 0.1ms), then compute CLIP only for the 8-10 shortlisted candidates. Total cost: one neural network call instead of N. This is how you scale to millions of images without a GPU.

The real insight: this applies everywhere

This is not just about images.

It's a general pattern: fast approximate filter + slow precise ranking.

You can reuse it everywhere:

  • Code - MinHash (Jaccard) + embeddings
  • Documents - SimHash + semantic search
  • Audio - fingerprints + audio embeddings
  • Recommendations - rules + dense vectors

VelesDB supports hamming, jaccard, euclidean, cosine, and dot as distance metrics. Mix them in the same database.

Privacy: nothing leaves your machine

This is not a cloud service. VelesDB runs as a ~3MB local binary. Your photos are indexed locally, searched locally, and never uploaded anywhere. No API keys, no external calls, no tracking.

Your photo library is nobody's business but yours.

Getting started

  1. VelesDBยฎ on GitHub - source-available under Elastic License 2.0
  2. Code from the article

A star on the repo helps other developers find the project.

TL;DR

  • Hamming = fast, pixel-level
  • CLIP = slower, semantic
  • Combine both = best of both worlds
  • 0.3ms search, local-first, better results than cosine alone

๐Ÿ‘‰ Stop asking: "What's the best similarity metric?"
๐Ÿ‘‰ Start asking: "What combination of metrics solves my problem?"


What combination would you try? Hamming + Cosine for text? Jaccard + Euclidean for recommendations? Drop a comment below.

Top comments (0)