DEV Community

Alemnew Marie
Alemnew Marie

Posted on

Your Photos Are Just 512 Numbers

The surprisingly simple math behind image search, reverse image lookup, and duplicate detection.

Modern vision models can represent an image as a vector of numbers (often 512 dimensions) that capture its semantic meaning.

This is the principle behind visual search and duplicate detection in real-world systems.

The Trick in 3 Steps

1. Turn images into fingerprints

A neural network (CLIP) looks at an image and outputs 512 numbers. Same image → nearly identical embeddings.

# Input: cat.jpg
# Output: [0.023, -0.156, 0.891, ... 512 numbers]
Enter fullscreen mode Exit fullscreen mode

2. Compare fingerprints with dot product

Two images are similar if their embedding vectors point in the same direction. We measure this with cosine similarity.

After normalizing the vectors, cosine similarity reduces to a simple dot product.

cat1.jpg  →  [0.1, 0.9, -0.3] ──┐
                                ├──→ 0.94 (94% similar) ✓
cat2.jpg  →  [0.2, 0.8, -0.2] ──┘

dog.jpg   →  [-0.8, 0.1, 0.5] ──┐
                                ├──→ 0.12 (12% similar) ✗
Enter fullscreen mode Exit fullscreen mode

3. Set a threshold

If similarity > 0.90: it's a match. Same photo, slight crop, resize, or filter.

If similarity < 0.50: completely different content.

Exact thresholds depend on your dataset and model, but these are common starting points

🚀 See it in action: https://image-similarity.balewgize.app/

Why This Works

CLIP (Contrastive Language-Image Pre-training) was trained on 400 million image-text pairs. It learned that a photo of a "golden retriever on beach" should have similar numbers to "dog running on sand" - even if pixels don't match.

Traditional image comparison looks at pixels. Embeddings capture meaning.

This lets you search for images based on what they actually show, instead of relying on filenames or alt-text.

While this model uses 512 numbers, larger models may use 768 or 1024 dimensions to capture finer semantic detail.

What You Can Build

Use Case How
Duplicate detection Find photos that are 95%+ similar
Reverse image search Find visually similar images in a database
Content moderation Detect near-duplicates of flagged images
Image clustering Group photos by visual similarity

The Full Picture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Image 1   │────→│  Embedding  │────→│             │
│             │     │ (512 dims)  │     │   Cosine    │
└─────────────┘     └─────────────┘     │  Similarity │──→ 0.94 → MATCH
                                        │  (0 to 1)   │
┌─────────────┐     ┌─────────────┐     │             │
│   Image 2   │────→│  Embedding  │────→│             │
│             │     │ (512 dims)  │     └─────────────┘
└─────────────┘     └─────────────┘
Enter fullscreen mode Exit fullscreen mode

That’s it conceptually. The hard work is already baked into the model.

💻 Source Code: GitHub Repo Link

🌐 Live Demo: Demo Link

If you’re exploring similar problems, let’s connect.


Code snippet

Install: pip install open-clip-torch opencv-python-headless pillow

import cv2
import numpy as np
import torch
import open_clip
from PIL import Image
from typing import Any, Optional, Tuple

_model: Optional[Any] = None
_preprocess: Optional[Any] = None
_device: Optional[str] = None


def _load_model() -> Tuple[Any, Any, str]:
    """Load OpenCLIP model and preprocessing on CPU/GPU."""
    global _model, _preprocess, _device
    if _model is not None:
        if _preprocess is None or _device is None:
            raise RuntimeError("Model cache is incomplete")
        return _model, _preprocess, _device

    _device = "cuda" if torch.cuda.is_available() else "cpu"
    # RN50 is fast and lightweight; ViT-based models give better quality at higher cost.
    _model, _, _preprocess = open_clip.create_model_and_transforms(
        "RN50", pretrained="openai"
    )
    if isinstance(_preprocess, (list, tuple)):
        _preprocess = _preprocess[-1]
    _model = _model.to(_device).eval()
    if _preprocess is None or _device is None:
        raise RuntimeError("Model initialization failed")
    return _model, _preprocess, _device


def _load_image(path):
    """Read an image from disk as BGR array."""
    img = cv2.imread(path)
    if img is None:
        raise FileNotFoundError(path)
    return img


def _embed(img: np.ndarray) -> np.ndarray:
    """Compute a normalized CLIP embedding."""
    model, preprocess, device = _load_model()
    rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    tensor = preprocess(Image.fromarray(rgb)).unsqueeze(0).to(device)
    with torch.inference_mode():
        emb = model.encode_image(tensor)
        emb = emb / emb.norm(dim=-1, keepdim=True)
    return emb.cpu().numpy().flatten()


def compare(img1, img2):
    """Return cosine similarity for two image paths."""
    e1 = _embed(_load_image(img1))
    e2 = _embed(_load_image(img2))
    return float(np.dot(e1, e2))


if __name__ == "__main__":
    image1 = "path/to/image1.jpg"
    image2 = "path/to/image2.jpg"
    score = compare(image1, image2)
    print("Similarity score:", score)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)