Cosine Similarity Lies. Here's What to Use When Your Embeddings All Cluster at 0.85

#ai #rag #embeddings #llm

Book: RAG Pocket Guide
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You open the eval notebook on a Friday. Your retriever returned the wrong chunk for a query that should have been a layup. You print the cosine similarities of the top-50 candidates: 0.847, 0.842, 0.839, 0.838, 0.836, 0.835, 0.834. The right document is at rank 38 with a score of 0.819. 0.028 of a similarity point separates the right answer from twenty-seven distractors.

The threshold tuning code you wrote three weeks ago picks anything above 0.83. It returns thirty chunks. The reranker behind it eats half a second per query. Everyone is mad.

The issue is not your embedder. It is that cosine similarity, on the embedding spaces every common encoder produces, lies to you about how different two vectors actually are. The whole point cloud sits on a narrow cone. Angles stay small, cosines stay high, and the small differences that do carry signal get drowned in the constant offset.

This is the anisotropy problem, and 0.020 of a unit separates you from a working retriever.

What anisotropy actually means

Kawin Ethayarajh's 2019 paper, How Contextual are Contextualized Word Representations, made the property concrete. He looked at BERT, ELMo, and GPT-2 layer by layer and measured the average cosine similarity between random pairs of tokens. In an isotropic space (vectors spread uniformly on the unit sphere) that number should be close to zero. In BERT's upper layers it sat around 0.30 to 0.55 depending on layer; in GPT-2's last layer it was close to 0.99. The vectors were cramped into a narrow cone.

What that means for retrieval: every cosine score you compute is a sum of two things: a baseline that captures the shared cone offset between any two vectors from the encoder, and a small residual that captures actual semantic distance. The first term is junk; the second is signal. Vanilla cosine sums them and hands you the total.

You see this on production sentence encoders too. Su et al., 2021 showed that BERT sentence embeddings have a strongly non-zero mean and highly correlated dimensions. The mean alone explains why your random query and a random chunk from your corpus already score in the 0.6-0.8 band on cosine before either of them has anything to do with the other.

Modern OpenAI embeddings (text-embedding-3-small, text-embedding-3-large) are better calibrated than 2019-era BERT, but they are not isotropic. On a domain corpus you will still see the histogram pile up in a narrow band. Same shape, the band has just moved.

Make the histogram

Before you fix anything, look at it. Take 10,000 random pairs from your corpus, compute cosine, plot the distribution.

import numpy as np
import matplotlib.pyplot as plt

# embeddings: (N, d) float32, unit-normalized
N = embeddings.shape[0]
rng = np.random.default_rng(0)
i = rng.integers(0, N, size=10_000)
j = rng.integers(0, N, size=10_000)
mask = i != j
i, j = i[mask], j[mask]

sims = (embeddings[i] * embeddings[j]).sum(axis=1)
print(f"mean={sims.mean():.3f} std={sims.std():.3f}")
print(f"p05={np.percentile(sims, 5):.3f}")
print(f"p95={np.percentile(sims, 95):.3f}")

plt.hist(sims, bins=80)
plt.xlabel("cosine similarity")
plt.ylabel("count")
plt.title("Random pairs, raw cosine")
plt.show()

On a typical support-doc corpus with text-embedding-3-small, the shape is unmistakable. Mean somewhere around 0.30 to 0.45, standard deviation around 0.05 to 0.08, the whole mass squeezed into a band 0.20 wide. There is your cone. The exact numbers are corpus-dependent, so run it on yours before quoting any of this.

Now the same histogram for known related pairs (a query and the chunk a human marked as the answer). On most corpora it sits 0.05 to 0.15 to the right of the random distribution. The two distributions overlap heavily. Retrieval errors live in that overlap.

Fix one: mean-centering the corpus

The cheapest fix is to subtract the corpus mean from every embedding before you index or query.

mu = embeddings.mean(axis=0)              # shape: (d,)
centered = embeddings - mu
centered /= np.linalg.norm(
    centered, axis=1, keepdims=True
)

# at query time
q_centered = (q - mu)
q_centered /= np.linalg.norm(q_centered)

That single subtraction kills most of the cone offset. The mean cosine of random pairs drops toward zero. Useful similarity differences that were buried in the third decimal place show up in the first. Your 0.847 vs 0.819 gap can become something like 0.31 vs 0.18: same ranking, different signal-to-noise.

Two production gotchas. The mean has to be computed on a representative sample of your corpus, not on the queries (queries are usually shorter and have a different mean). And you have to recompute the mean when you re-embed at scale, or store it alongside the model version. A mean from one embedder applied to vectors from another is worse than no mean at all.

Fix two: whitening

Mean-centering removes the offset. Whitening removes the correlation between dimensions. It is what makes the post-centering distribution actually round instead of an oblong puddle stretched along the principal directions of the original cone.

The recipe from Su et al., 2021, short enough to fit on one screen:

def whiten_fit(X):
    mu = X.mean(axis=0, keepdims=True)
    cov = np.cov((X - mu).T)
    u, s, _ = np.linalg.svd(cov)
    W = u @ np.diag(1.0 / np.sqrt(s + 1e-6))
    return mu, W

def whiten_apply(X, mu, W):
    Y = (X - mu) @ W
    Y /= np.linalg.norm(Y, axis=1, keepdims=True)
    return Y

mu, W = whiten_fit(embeddings_sample)
indexed = whiten_apply(embeddings, mu, W)
q_white = whiten_apply(q[None, :], mu, W)[0]

Fit mu and W on a sample (50k-100k vectors is plenty). Store them next to your index. Apply them to every embedding you index and every query you search. Cosine similarity in the whitened space behaves like cosine similarity should: the random-pairs histogram is centered near zero with a clean spread, and related pairs sit visibly to the right.

According to Su et al., whitened embeddings hit STS-benchmark numbers competitive with BERT-flow at a fraction of the complexity, while also letting you reduce dimensions (drop the smallest singular values) for faster ANN search.

The trade-off: whitening is a learned transform. If your corpus distribution drifts, your mu and W go stale. Re-fit them on a schedule, or whenever you swap the embedder.

Fix three: z-score the cosines, not the vectors

Sometimes you cannot rewrite the index. The vendor returns cosine scores. The pipeline is fixed. You still want to know whether 0.847 is "good" for this query.

Z-scoring fixes that without touching the embedding space. For each query, retrieve top-K (say K=200), compute the mean and std of those scores, and report (score - mean) / std instead of the raw score.

def zscore_scores(scores):
    s = np.asarray(scores)
    return (s - s.mean()) / (s.std() + 1e-6)

A z-score of 3 means the document is three standard deviations above the rest of the candidate pool for this query, a strong signal. A z-score of 0.4 means it is barely distinguishable. The threshold you tune is now in z-space, which is comparable across queries even if the raw cosines drift.

This is what you reach for when the retrieval layer is a managed service and you only get scores out of it. It does not undo anisotropy; it sidesteps the question by normalizing per-query.

Fix four: drop normalization, use MIPS

Cosine similarity is a normalized inner product. The normalization throws away vector magnitude. On many encoders that magnitude carries signal: frequent or generic chunks have larger norms, rare or specific ones smaller. If you index with maximum inner product search instead of cosine, you keep the magnitude.

In pgvector, that is the difference between the <=> operator (cosine distance) and <#> (negative inner product). HNSW supports both. Switch the operator class on the index:

CREATE INDEX chunks_emb_idx
  ON chunks USING hnsw (embedding vector_ip_ops);

MIPS does not solve anisotropy on its own. It changes which information you keep. Pair it with mean-centering and you usually get a small lift over normalized cosine, especially on corpora where chunk-length variance is real.

Fix five: the calibrated reranker

If your retrieval is non-negotiable and the upstream cosines are what you have, put a cross-encoder reranker behind it. Rerank top-50 with bge-reranker-v2-m3 or Cohere Rerank 3. The reranker is trained on (query, passage) pairs end-to-end and does not inherit the cone. Its output is a calibrated relevance score rather than a cosine.

This is the most expensive option (a rough 50-150ms per query, depending on hardware and batch size) and the most reliable. It does not fix your embedder; it puts a smarter judge after it. Most production RAG stacks end up here regardless.

What to actually do on Monday

In rough order of cost-to-benefit:

Plot the random-pairs histogram on your corpus. Confirm the cone exists. If your random-pair mean is below 0.1, you are already isotropic enough — stop reading.
Mean-center the corpus and queries. Re-run your eval. If recall@10 moves up, ship it; the change is ten lines.
Fit a whitening transform on a sample, re-index. Compare recall@10 again. Whitening usually helps when the encoder is older or the corpus is narrow (legal, medical, code).
If you cannot re-index, z-score the scores per query before thresholding.
If retrieval is mission-critical, add a reranker on top of whichever of the above you landed on.

The sequence that keeps showing up in production is mean-center + reranker. Whitening is the next dial when that is not enough. Threshold-tuning is the wrong knob to turn when the right answer scores 0.819; the space itself is.

If this was useful

The RAG Pocket Guide walks through the embedding-space hygiene work the tutorials skip — when to whiten, when to z-score, how to pick the threshold once you stop trusting raw cosine, and how to wire the reranker step in without doubling your latency. If your retrieval feels like it is working but the eval numbers will not stop bouncing, this is the part of the stack the book spends the most time on.