Julien L for WiScale

Posted on Mar 29 • Edited on Apr 5

Stop Using Cosine for Everything: 5 Distance Metrics That Unlock Hidden Powers in Your Vector Database

#ai #python #database #tutorial

Everyone uses cosine similarity. Tutorials use it. Frameworks default to it. If you ask "which distance metric should I use?", the answer is always "cosine, probably."

But here is the thing: your vector database supports other metrics. And those metrics unlock use cases that cosine literally cannot handle.

This is not a math lecture. This is a practical guide. Five metrics, five real-world problems, working code you can run in two minutes. By the end, you will look at your vector database differently.

A quick mental model (no math degree required)

Before we dive in, let's build some intuition. Imagine you have two arrows on a piece of paper:

Cosine asks: "Do these arrows point in the same direction?" It does not care how long they are.

Euclidean asks: "How far apart are the tips of these arrows?" It cares about both direction and length.

Dot Product asks: "Do they point the same way, AND are they both strong signals?" Direction plus intensity.

Hamming asks: "How many switches are flipped differently between these two?" It works on binary on/off data.

Jaccard asks: "How much overlap is there between these two sets?" It works on yes/no memberships.

That is it. Five different questions, five different superpowers. Let's see them in action.

Setup

pip install velesdb

One line. No Docker. No API keys. VelesDB® is an embedded database written in Rust (~6MB binary). It runs in your process.

import velesdb

db = velesdb.Database("./my_data")

1. Cosine: the one you already know (but maybe not why)

The question it answers: "Are these two things about the same topic?"

Cosine measures the angle between two vectors. If two documents are about machine learning, their embedding vectors will point in roughly the same direction, regardless of document length. A 280-character tweet and a 10-page paper about the same topic will score high.

collection = db.create_collection("articles", dimension=4, metric="cosine")

# Simulated embeddings: [tech, science, cooking, sports]
collection.upsert([
    {"id": 1, "vector": [0.9, 0.8, 0.0, 0.0], "payload": {"title": "Introduction to Machine Learning"}},
    {"id": 2, "vector": [0.8, 0.9, 0.0, 0.0], "payload": {"title": "Neural Networks Explained"}},
    {"id": 3, "vector": [0.0, 0.1, 0.9, 0.8], "payload": {"title": "Best Pasta Recipes"}},
    {"id": 4, "vector": [0.1, 0.0, 0.0, 0.9], "payload": {"title": "World Cup 2026 Preview"}},
    {"id": 5, "vector": [0.7, 0.6, 0.0, 0.1], "payload": {"title": "Deep Learning for Beginners"}},
])

results = collection.search(vector=[0.85, 0.75, 0.0, 0.0], top_k=3)

score=1.000  Introduction to Machine Learning
score=0.994  Deep Learning for Beginners
score=0.993  Neural Networks Explained

All three AI articles cluster together. Pasta and football? Nowhere close. This is cosine's sweet spot: semantic similarity where you care about topic, not intensity.

When to use cosine: semantic search, document similarity, FAQ matching, any text embedding comparison.

2. Euclidean: the anomaly hunter

The question it answers: "How far is this point from what's normal?"

Here is where it gets interesting. Imagine IoT sensors on a factory floor, each sending readings every minute: temperature, pressure, humidity, vibration. Normal readings cluster in a tight neighborhood. An anomaly is a point that is physically far from the cluster.

Cosine would miss this. Why? Because a reading of [78C, 985hPa, 12%, 4.8g] could point in a similar direction to [22C, 1013hPa, 45%, 0.3g]. Same general "shape" of data, vastly different magnitudes. Cosine says "similar." Euclidean says "these are 70 units apart, something is on fire."

collection = db.create_collection("sensors", dimension=4, metric="euclidean")

# Sensor: [temperature_C, pressure_hPa, humidity_%, vibration_g]
collection.upsert([
    {"id": 1, "vector": [22.0, 1013.0, 45.0, 0.3], "payload": {"status": "normal", "time": "08:00"}},
    {"id": 2, "vector": [22.5, 1012.5, 47.0, 0.3], "payload": {"status": "normal", "time": "09:00"}},
    {"id": 3, "vector": [21.8, 1013.5, 44.0, 0.4], "payload": {"status": "normal", "time": "10:00"}},
    {"id": 4, "vector": [23.0, 1012.0, 46.0, 0.3], "payload": {"status": "normal", "time": "11:00"}},
    {"id": 5, "vector": [78.0, 985.0, 12.0, 4.8],  "payload": {"status": "ANOMALY", "time": "11:15"}},
])

results = collection.search(vector=[22.0, 1013.0, 45.0, 0.3], top_k=5)

distance=  0.00  08:00 - normal
distance=  1.14  10:00 - normal
distance=  1.73  11:00 - normal
distance=  2.12  09:00 - normal
distance= 70.92  11:15 - ANOMALY  *** ALERT ***

Normal readings are all within distance 0-2 of each other. The anomaly is at distance 70. That is not a subtle difference. That is a fire alarm.

When to use euclidean: anomaly detection, IoT monitoring, fraud detection, anything where the absolute values matter, not just the direction.

3. Dot Product: the smart recommender

The question it answers: "Is this relevant to me, and how confident are you?"

Dot product is cosine's bigger sibling. Cosine only looks at direction. Dot product looks at direction AND magnitude. Think of it this way: two movies can both be sci-fi (same direction), but a critically acclaimed blockbuster has a "louder" embedding signal than a forgettable B-movie.

With cosine, they would score equally. With dot product, quality rises to the top.

collection = db.create_collection("movies", dimension=4, metric="dotproduct")

# Dimensions: [sci_fi, action, drama, comedy]
# Higher magnitude = stronger signal / higher quality
collection.upsert([
    {"id": 1, "vector": [0.95, 0.80, 0.30, 0.05], "payload": {"title": "Interstellar", "rating": 8.7}},
    {"id": 2, "vector": [0.40, 0.35, 0.10, 0.02], "payload": {"title": "Low-Budget Sci-Fi B-Movie", "rating": 3.2}},
    {"id": 3, "vector": [0.85, 0.90, 0.20, 0.10], "payload": {"title": "The Matrix", "rating": 8.7}},
    {"id": 4, "vector": [0.05, 0.05, 0.10, 0.95], "payload": {"title": "Comedy Special", "rating": 7.0}},
    {"id": 5, "vector": [0.70, 0.60, 0.50, 0.05], "payload": {"title": "Blade Runner 2049", "rating": 8.0}},
])

results = collection.search(vector=[0.9, 0.8, 0.1, 0.0], top_k=5)

score=1.525  Interstellar (rating: 8.7)
score=1.505  The Matrix (rating: 8.7)
score=1.160  Blade Runner 2049 (rating: 8.0)
score=0.650  Low-Budget Sci-Fi B-Movie (rating: 3.2)
score=0.095  Comedy Special (rating: 7.0)

Notice: the B-movie is sci-fi (same direction as Interstellar), but it ranks way below because its embedding magnitude is weaker. Dot product naturally surfaces quality. This is why recommendation systems at scale often prefer it over cosine.

When to use dot product: recommendation engines, search ranking where content quality matters, any case where you want relevance weighted by confidence.

4. Hamming: the duplicate detective

The question it answers: "How many bits are different between these two?"

This one is completely different from the previous three. Hamming works on binary vectors (0s and 1s) and simply counts how many positions differ. It is lightning fast and perfect for one thing: comparing hashes.

Real-world scenario: you run a content platform. Users upload images. You want to detect reposts, even if someone cropped the image, added a filter, or recompressed it. The approach: compute a perceptual hash (pHash) of each image, which produces a binary fingerprint. Near-duplicate images have fingerprints that differ by just a few bits.

collection = db.create_collection("image_hashes", dimension=16, metric="hamming")

# Simulated 16-bit perceptual hashes
collection.upsert([
    {"id": 1, "vector": [1,0,1,1,0,0,1,0,1,1,0,1,0,0,1,1],
     "payload": {"file": "sunset_original.jpg", "source": "photographer"}},
    {"id": 2, "vector": [1,0,1,1,0,0,1,0,1,1,0,1,0,0,1,0],
     "payload": {"file": "sunset_cropped.jpg", "source": "instagram repost"}},
    {"id": 3, "vector": [1,0,1,1,0,0,1,0,1,1,0,1,0,1,1,1],
     "payload": {"file": "sunset_filtered.jpg", "source": "pinterest"}},
    {"id": 4, "vector": [0,1,0,0,1,1,0,1,0,0,1,0,1,1,0,0],
     "payload": {"file": "cat_photo.jpg", "source": "original"}},
    {"id": 5, "vector": [1,0,1,1,0,0,1,0,1,1,0,1,0,0,0,1],
     "payload": {"file": "sunset_watermarked.jpg", "source": "stock site"}},
])

new_upload = [1,0,1,1,0,0,1,0,1,1,0,1,0,0,1,1]  # same as original
results = collection.search(vector=new_upload, top_k=5)

hamming= 0 bits  [DUPLICATE      ]  sunset_original.jpg (photographer)
hamming= 1 bits  [NEAR-DUPLICATE ]  sunset_cropped.jpg (instagram repost)
hamming= 1 bits  [NEAR-DUPLICATE ]  sunset_filtered.jpg (pinterest)
hamming= 1 bits  [NEAR-DUPLICATE ]  sunset_watermarked.jpg (stock site)
hamming=16 bits  [different image]  cat_photo.jpg (original)

Zero to two bits difference? That is the same image with minor modifications. A content moderation bot can flag these in real-time. No ML model needed. No GPU. Just binary comparison at database speed.

When to use hamming: image deduplication, audio fingerprinting, DNA sequence comparison, binary feature matching, plagiarism detection with locality-sensitive hashing.

5. Jaccard: the taste matcher

The question it answers: "How much overlap is there between these two sets?"

Jaccard is beautifully simple: take two sets, divide the size of their intersection by the size of their union. If you and I both like 3 of the same genres out of 4 total unique genres between us, that is 75% Jaccard similarity.

No embeddings. No ML model. No neural network. Just set math.

collection = db.create_collection("user_profiles", dimension=10, metric="jaccard")

# Genres: [action, comedy, sci-fi, horror, drama, romance, thriller, documentary, anime, musical]
collection.upsert([
    {"id": 1, "vector": [1,0,1,0,1,0,1,0,0,0],
     "payload": {"user": "Alice", "likes": "action, sci-fi, drama, thriller"}},
    {"id": 2, "vector": [0,1,0,0,0,1,0,0,0,1],
     "payload": {"user": "Bob", "likes": "comedy, romance, musical"}},
    {"id": 3, "vector": [1,0,1,1,0,0,1,0,1,0],
     "payload": {"user": "Charlie", "likes": "action, sci-fi, horror, thriller, anime"}},
    {"id": 4, "vector": [1,1,1,1,1,1,1,1,1,1],
     "payload": {"user": "Dave", "likes": "literally everything"}},
    {"id": 5, "vector": [1,0,1,0,0,0,0,1,0,0],
     "payload": {"user": "Eve", "likes": "action, sci-fi, documentary"}},
])

results = collection.search(vector=[1,0,1,0,1,0,0,0,0,0], top_k=5)

75.0% match  Alice (action, sci-fi, drama, thriller)
50.0% match  Eve (action, sci-fi, documentary)
33.3% match  Charlie (action, sci-fi, horror, thriller, anime)
30.0% match  Dave (literally everything)
 0.0% match  Bob (comedy, romance, musical)

Alice shares 3 out of 4 unique genres with you. Dave likes everything, but his union is 10 genres while the overlap is only 3, so Jaccard penalizes him. Bob has zero overlap. The math is transparent, explainable, and instant.

When to use jaccard: user matching, product tagging, skill matching in recruiting, collaborative filtering, any comparison of categorical membership.

The cheat sheet

Metric	Best for	Score meaning	Think of it as
Cosine	Semantic search	1.0 = same topic	"Same direction?"
Euclidean	Anomaly detection	Lower = closer	"How far apart?"
Dot Product	Recommendations	Higher = better match	"Same direction + strong signal?"
Hamming	Hash comparison	Lower = more similar	"How many bits differ?"
Jaccard	Set overlap	1.0 = identical sets	"How much in common?"

The full picture

Choosing the right distance metric is like choosing the right tool. You can hammer a screw into wood, but a screwdriver works better.

Most developers never think about this because most tutorials only show cosine. But the moment you realize that euclidean catches anomalies that cosine misses, or that Jaccard gives you a recommendation engine without any ML, your vector database becomes something much more versatile than a "semantic search box."

All the code in this article runs as-is. You can grab the complete script from GitHub and try it yourself.

pip install velesdb
python distance_metrics_demo.py

VelesDB is a source-available embedded database (Elastic License 2.0) written in Rust. ~6MB binary, no Docker, no server process, no API keys. Just pip install and go.

Full docs: velesdb.com/en
GitHub: github.com/cyberlife-coder/VelesDB
Python SDK: pypi.org/project/velesdb

What distance metric surprised you the most? Have you used Hamming or Jaccard in a real project? I'd love to hear about it in the comments.