The surprisingly simple math behind image search, reverse image lookup, and duplicate detection.
Modern vision models can represent an image as a vector of numbers (often 512 dimensions) that capture its semantic meaning.
This is the principle behind visual search and duplicate detection in real-world systems.
The Trick in 3 Steps
1. Turn images into fingerprints
A neural network (CLIP) looks at an image and outputs 512 numbers. Same image → nearly identical embeddings.
# Input: cat.jpg
# Output: [0.023, -0.156, 0.891, ... 512 numbers]
2. Compare fingerprints with dot product
Two images are similar if their embedding vectors point in the same direction. We measure this with cosine similarity.
After normalizing the vectors, cosine similarity reduces to a simple dot product.
cat1.jpg → [0.1, 0.9, -0.3] ──┐
├──→ 0.94 (94% similar) ✓
cat2.jpg → [0.2, 0.8, -0.2] ──┘
dog.jpg → [-0.8, 0.1, 0.5] ──┐
├──→ 0.12 (12% similar) ✗
3. Set a threshold
If similarity > 0.90: it's a match. Same photo, slight crop, resize, or filter.
If similarity < 0.50: completely different content.
Exact thresholds depend on your dataset and model, but these are common starting points
🚀 See it in action: https://image-similarity.balewgize.app/
Why This Works
CLIP (Contrastive Language-Image Pre-training) was trained on 400 million image-text pairs. It learned that a photo of a "golden retriever on beach" should have similar numbers to "dog running on sand" - even if pixels don't match.
Traditional image comparison looks at pixels. Embeddings capture meaning.
This lets you search for images based on what they actually show, instead of relying on filenames or alt-text.
While this model uses 512 numbers, larger models may use 768 or 1024 dimensions to capture finer semantic detail.
What You Can Build
| Use Case | How |
|---|---|
| Duplicate detection | Find photos that are 95%+ similar |
| Reverse image search | Find visually similar images in a database |
| Content moderation | Detect near-duplicates of flagged images |
| Image clustering | Group photos by visual similarity |
The Full Picture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Image 1 │────→│ Embedding │────→│ │
│ │ │ (512 dims) │ │ Cosine │
└─────────────┘ └─────────────┘ │ Similarity │──→ 0.94 → MATCH
│ (0 to 1) │
┌─────────────┐ ┌─────────────┐ │ │
│ Image 2 │────→│ Embedding │────→│ │
│ │ │ (512 dims) │ └─────────────┘
└─────────────┘ └─────────────┘
That’s it conceptually. The hard work is already baked into the model.
💻 Source Code: GitHub Repo Link
🌐 Live Demo: Demo Link
If you’re exploring similar problems, let’s connect.
Code snippet
Install: pip install open-clip-torch opencv-python-headless pillow
import cv2
import numpy as np
import torch
import open_clip
from PIL import Image
from typing import Any, Optional, Tuple
_model: Optional[Any] = None
_preprocess: Optional[Any] = None
_device: Optional[str] = None
def _load_model() -> Tuple[Any, Any, str]:
"""Load OpenCLIP model and preprocessing on CPU/GPU."""
global _model, _preprocess, _device
if _model is not None:
if _preprocess is None or _device is None:
raise RuntimeError("Model cache is incomplete")
return _model, _preprocess, _device
_device = "cuda" if torch.cuda.is_available() else "cpu"
# RN50 is fast and lightweight; ViT-based models give better quality at higher cost.
_model, _, _preprocess = open_clip.create_model_and_transforms(
"RN50", pretrained="openai"
)
if isinstance(_preprocess, (list, tuple)):
_preprocess = _preprocess[-1]
_model = _model.to(_device).eval()
if _preprocess is None or _device is None:
raise RuntimeError("Model initialization failed")
return _model, _preprocess, _device
def _load_image(path):
"""Read an image from disk as BGR array."""
img = cv2.imread(path)
if img is None:
raise FileNotFoundError(path)
return img
def _embed(img: np.ndarray) -> np.ndarray:
"""Compute a normalized CLIP embedding."""
model, preprocess, device = _load_model()
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
tensor = preprocess(Image.fromarray(rgb)).unsqueeze(0).to(device)
with torch.inference_mode():
emb = model.encode_image(tensor)
emb = emb / emb.norm(dim=-1, keepdim=True)
return emb.cpu().numpy().flatten()
def compare(img1, img2):
"""Return cosine similarity for two image paths."""
e1 = _embed(_load_image(img1))
e2 = _embed(_load_image(img2))
return float(np.dot(e1, e2))
if __name__ == "__main__":
image1 = "path/to/image1.jpg"
image2 = "path/to/image2.jpg"
score = compare(image1, image2)
print("Similarity score:", score)
Top comments (0)