Working with Messy Embeddings in Real Systems: A Quick Post from Today's Debug Session

Today was supposed to be a routine day. I was reviewing some logs for a multi-modal retrieval pipeline we’ve been running—camera images, lidar frames, and a few NLP tags all go into a vector store for downstream search. Pretty standard setup, right?

But then the recall dropped. Quietly. No errors, no crashes, just… worse results.

Turns out, this whole thing was caused by a seemingly small detail: inconsistent embedding norms from different modalities. It sent me down a 3-hour rabbit hole involving cosine distances, vector scaling, and my own past assumptions about database behavior. Here’s what I learned (again).

Context: The Setup

We’re storing multi-modal embeddings into a vector database—specifically, lidar-to-text retrieval for a roadside perception system. Each data point looks roughly like this:

image_embedding: 512-dim vision encoder output
lidar_embedding: 256-dim learned BEV encoder output
text_embedding: 768-dim from a BERT variant
Metadata: GPS, weather, scenario tags, etc.

The system uses Milvus (v2.3) with HNSW for approximate search. Each modality goes into its own collection, but the RAG pipeline combines results at query time via re-ranking.

The Problem: Recall Drift

We noticed that queries with natural language inputs (e.g. "car parked under bridge in fog") were retrieving fewer relevant lidar segments than expected. Visual embeddings still worked well, but lidar retrieval became noticeably noisier.

The embeddings were going in, indexes were fine, metadata filters were working. So what changed?

The culprit: vector magnitude variance.

Some of our lidar embeddings had significantly lower norms (around 0.5–1.2), while the text embeddings were tightly clustered around 7–9.
Cosine similarity, which we used for all retrievals, is theoretically scale-invariant—but in practice, index-level normalization matters, especially when mixed with filtered + hybrid queries.

Lessons Learned (or Re-Learned)

1. Always normalize before insert. Always.

I had assumed that the downstream ingestion code was already l2-normalizing the embeddings. It wasn’t. And even though cosine distance is supposed to ignore magnitude, many ANN libraries (including Faiss and Milvus’s HNSW) use raw dot product internally and normalize at query time only.

Result? Insert-time magnitude variance = weird scoring behavior.

Fix: added embedding = embedding / np.linalg.norm(embedding) before inserts. Immediately improved recall by ~15%.

2. Vector DBs don’t protect you from messy upstream models

No matter how good your vector database is, it doesn’t validate the statistical properties of your data. If your embedding distribution drifts (like ours did after a model retrain), the index won’t scream at you. It’ll just… get worse.

In this case, the new lidar encoder was producing vectors on a much smaller scale. Nothing broke, but everything degraded.

Takeaway: embedding stats should be part of CI. Track means, norms, sparsity, drift. It’s cheap and saves hours later.

3. Metadata filters can mask retrieval bugs

When recall dropped, our re-ranking + metadata filtering kept returning "reasonable" results, which made debugging harder. The top-3 looked OK—until we noticed they were all from the same location tag.

Moral: if you're using metadata filters (which you should), test recall both with and without filters. Otherwise, you’re debugging the wrong component.

Final Notes

No, this wasn’t a massive failure. It was one of those slow, silent bugs that creep into production pipelines when different teams train models, build retrievers, and wire up search logic. Nothing crashed—but the user experience got worse.

I’m sharing this mostly to remind myself (and maybe you) that ANN infrastructure is only as good as the vectors you feed it. And the most boring parts—like normalization—still bite you the hardest.

If you’ve run into similar issues with mixed-modality embeddings or have better ways to track embedding drift, I’m all ears. Thinking of adding some lightweight checksums or vector histograms to our monitoring pipeline next.