Parth Sarthi Sharma

Posted on Dec 30, 2025 • Edited on Jan 1

Vector Dimensions, Cosine Similarity, Dot Product — and Why Your Distance Metric Silently Ruins Relevance

#ai #vectordatabase #rag #generativeai

If dense vectors power semantic search, then distance metrics decide what “relevant” actually means.

Most RAG systems don’t fail because of bad embeddings.
They fail because of wrong similarity choices that no one notices.

Let’s fix that.

What Is a Vector Dimension (Really)?

When you generate embeddings, you get vectors like:

384 dimensions
768 dimensions
1024 or 1536 dimensions

Each dimension is not a word.
It’s a latent signal learned by the model to represent meaning.

Think of it like this:

A vector is not a point.
It’s a direction in meaning space.

Higher dimensions:

Capture more nuance
Cost more memory
Are harder to reason about

Lower dimensions:

Faster
Cheaper
Lose subtle distinctions

📌 Bigger is not automatically better.
Matching dimension size to your use case matters more than chasing numbers.

The Three Distance Metrics That Matter

Most vector databases support multiple similarity functions — but they do not behave the same.

1️⃣ Cosine Similarity (Most Common)

Cosine similarity measures the angle between vectors, not their magnitude.

In plain terms:

“Are these two vectors pointing in the same direction?”

This makes cosine similarity:

Scale-invariant
Stable across varying text lengths
Excellent for semantic similarity

✅ Best for:

Natural language queries
Chatbots
Knowledge search
RAG systems

This is why most embedding models are trained assuming cosine similarity.

2️⃣ Dot Product (Often Misused)

Dot product considers:

Direction
Magnitude

This means:

Longer documents often score higher
Repeated tokens inflate relevance
Results can look “relevant” but feel off

Dot product works only when embeddings are normalized
(or when the model was explicitly trained for it).

❌ Common mistake:

Switching from cosine to dot product without normalization

This silently degrades relevance.

3️⃣ Euclidean Distance (Usually the Wrong Choice)

Euclidean distance measures straight-line distance between vectors.

Sounds reasonable — but in high-dimensional spaces:

Distances become less meaningful
Nearest neighbors cluster unnaturally
Small noise causes large rank changes

❌ Euclidean distance is rarely ideal for text embeddings.

It’s better suited for:

Low-dimensional numeric data
Vision embeddings (sometimes)
Geometry-based problems

The Silent Failure Mode No One Talks About

Here’s the dangerous part:

Your system will still “work” with the wrong distance metric.

You’ll still get:

Answers
Retrieved chunks
Confident LLM responses

But relevance slowly decays:

Slightly worse answers
Subtle hallucinations
“Feels off” responses

And because nothing crashes — no one notices.

A Simple Rule of Thumb (Use This)

If you remember nothing else, remember this:

Cosine similarity → default for semantic search
Dot product → only if embeddings are normalized or model-trained for it
Euclidean distance → avoid for text unless you know why

Also:

Don’t mix metrics between indexing and querying
Don’t change metrics mid-experiment
Always document which metric you used

What to Use in Production

From real RAG systems:

Embedding model → defines the metric
Metric → locked at index creation
Cosine similarity → 90% of cases
Dot product → only with explicit normalization
Always validate relevance manually with real queries

What’s Coming Next

In the next article, we’ll go deeper into Chunking Internals:

How LLMs Actually “See” Context (Tokens, Chunks, Windows)

DEV Community