DEV Community

Parth Sarthi Sharma
Parth Sarthi Sharma

Posted on

Vector Dimensions, Cosine Similarity, Dot Product — and Why Your Distance Metric Silently Ruins Relevance

Diagram showing difference between Vector Dimensions, Cosine Similarity and Dot Product

If dense vectors power semantic search, then distance metrics decide what “relevant” actually means.

Most RAG systems don’t fail because of bad embeddings.
They fail because of wrong similarity choices that no one notices.

Let’s fix that.

What Is a Vector Dimension (Really)?

When you generate embeddings, you get vectors like:

  • 384 dimensions
  • 768 dimensions
  • 1024 or 1536 dimensions

Each dimension is not a word.
It’s a latent signal learned by the model to represent meaning.

Think of it like this:

A vector is not a point.
It’s a direction in meaning space.

Higher dimensions:

  • Capture more nuance
  • Cost more memory
  • Are harder to reason about

Lower dimensions:

  • Faster
  • Cheaper
  • Lose subtle distinctions

📌 Bigger is not automatically better.
Matching dimension size to your use case matters more than chasing numbers.

The Three Distance Metrics That Matter

Most vector databases support multiple similarity functions — but they do not behave the same.

1️⃣ Cosine Similarity (Most Common)

Cosine similarity measures the angle between vectors, not their magnitude.

In plain terms:

“Are these two vectors pointing in the same direction?”

This makes cosine similarity:

  • Scale-invariant
  • Stable across varying text lengths
  • Excellent for semantic similarity

✅ Best for:

  • Natural language queries
  • Chatbots
  • Knowledge search
  • RAG systems

This is why most embedding models are trained assuming cosine similarity.

2️⃣ Dot Product (Often Misused)

Dot product considers:

  • Direction
  • Magnitude

This means:

  • Longer documents often score higher
  • Repeated tokens inflate relevance
  • Results can look “relevant” but feel off

Dot product works only when embeddings are normalized
(or when the model was explicitly trained for it).

❌ Common mistake:

Switching from cosine to dot product without normalization

This silently degrades relevance.

3️⃣ Euclidean Distance (Usually the Wrong Choice)

Euclidean distance measures straight-line distance between vectors.

Sounds reasonable — but in high-dimensional spaces:

  • Distances become less meaningful
  • Nearest neighbors cluster unnaturally
  • Small noise causes large rank changes

❌ Euclidean distance is rarely ideal for text embeddings.

It’s better suited for:

  • Low-dimensional numeric data
  • Vision embeddings (sometimes)
  • Geometry-based problems

The Silent Failure Mode No One Talks About

Here’s the dangerous part:

Your system will still “work” with the wrong distance metric.

You’ll still get:

  • Answers
  • Retrieved chunks
  • Confident LLM responses

But relevance slowly decays:

  • Slightly worse answers
  • Subtle hallucinations
  • “Feels off” responses

And because nothing crashes — no one notices.

A Simple Rule of Thumb (Use This)

If you remember nothing else, remember this:

  • Cosine similarity → default for semantic search
  • Dot product → only if embeddings are normalized or model-trained for it
  • Euclidean distance → avoid for text unless you know why

Also:

  • Don’t mix metrics between indexing and querying
  • Don’t change metrics mid-experiment
  • Always document which metric you used

What to Use in Production

From real RAG systems:

  • Embedding model → defines the metric
  • Metric → locked at index creation
  • Cosine similarity → 90% of cases
  • Dot product → only with explicit normalization
  • Always validate relevance manually with real queries

What’s Coming Next

In the next article, we’ll go deeper into LangChain internals:

Document loaders, text splitters, chunk sizes —
and how bad chunking destroys even perfect embeddings.

Top comments (0)