PSBigBig

Posted on Aug 28

Day 7 — FAISS empty vectors, metric mismatch, and recall collapse (ProblemMap No.8)

#rag #vectordatabase #programming #ai

tl;dr

Ingestion success does not prove retrieval health. If your neighbors look the same for unrelated queries, check for zero vectors and NaNs, confirm metric policy matches the index, rebuild from clean embeddings, and smoke test with neighbor overlap and recall.

why recall dies even though ingestion looks fine

Start with root causes. Fixes are easier when the failure mode is named.

Zero vectors inserted Empty spans or batch bugs produced all zeros. FAISS accepts them and your index looks valid.
Metric mismatch Embeddings assumed cosine but the index used L2, or the reverse. Distances no longer reflect meaning.
Normalization drift Query side normalized, corpus side not. Or some shards normalized and others not.
Dimension or dtype errors Wrong d when saving or loading silently truncated vectors. Device or dtype cast wrote zeros to disk.
Codebook reuse IVF or PQ codebooks reused after whitening or a model swap. Geometry changed but centroids did not.
ID collisions and silent overwrite You overwrote rows while keeping index.ntotal correct.
Boot order You trained IVF before dedup or before boilerplate masking, then you re embedded later.

Label this family as No.8. If the boot sequence contributed, also mark No.14. If you tested against an empty or mixed store, mark No.16.

60 second health check

Do this first. It catches most silent failures.

1) sample 5k rows from your corpus embeddings

2) compute row norms and count zeros, NaN, and Inf

3) verify your metric and normalization policy match the index type

4) run ten random queries and measure neighbor overlap at k=20

rules of thumb

zero vector rate must be 0.0%
NaN or Inf count must be 0
cosine retrieval requires L2 normalized vectors on both sides
if average neighbor overlap across ten queries is > 0.35, geometry or ingestion is wrong

minimal fix that usually restores recall

Goal. Make retrieval trustworthy again with the smallest change set.

Reject bad rows before add Fail loudly if zeros or non finite vectors exist.

   import numpy as np

   def reject_bad(embs, d_expected):
       assert embs.ndim == 2 and embs.shape[1] == d_expected, f"dim mismatch {embs.shape}"
       norms = np.linalg.norm(embs, axis=1)
       z = np.where(norms == 0)[0]
       nf = np.where(~np.isfinite(norms))[0]
       if len(z) or len(nf):
           raise RuntimeError(f"zero:{len(z)} naninf:{len(nf)}")
       return norms

Align metric with vector state
Cosine retrieval means normalize corpus and queries then use L2 or IP consistently.
Inner product retrieval means avoid renormalizing twice and control norms explicitly.
Rebuild from clean embeddings
Do not patch mixed shards. Trash and rebuild. Retrain IVF or PQ if geometry changed.
Run a five question smoke test
Fixed questions with known spans. If recall stays low, stop and re check geometry.

tiny scripts you can paste

neighbor overlap sanity

`python
def overlap_k(a_ids, b_ids, k=20):
a, b = set(a_ids[:k]), set(b_ids[:k])
return len(a & b) / float(k)

run across 10 random queries and average. healthy spaces are well below 0.35

FAISS rebuild for cosine via L2

`python
import faiss, numpy as np
from sklearn.preprocessing import normalize

Z = np.load("embeddings.npy").astype("float32")
Z = normalize(Z, axis=1) # cosine requires L2 normalized vectors
faiss.normalize_L2(Z) # belt and suspenders

d = Z.shape[1]
index = faiss.IndexHNSWFlat(d, 32)
index.hnsw.efConstruction = 200
index.add(Z)
faiss.write_index(index, "hnsw_cosine.faiss")
`

IVF PQ retrain sketch

`python
import faiss, numpy as np

Z = np.load("embeddings_clean.npy").astype("float32")
nlist, m, nbits = 4096, 64, 8

quant = faiss.IndexFlatL2(Z.shape[1])
ivfpq = faiss.IndexIVFPQ(quant, Z.shape[1], nlist, m, nbits)
ivfpq.train(Z) # train on a large clean sample
ivfpq.add(Z)
ivfpq.nprobe = 32
faiss.write_index(ivfpq, "ivfpq_l2.faiss")
`

acceptance criteria before you call it fixed

zero vector and NaN rates are 0.0%
metric and normalization policy are documented and match the index type
after whitening for cosine, PC1 explained variance is in a healthy band. no single axis dominates
neighbor overlap across 20 random queries is ≤ 0.35 at k 20
recall at k on the heldout set rises and top k lists actually change with the query
index.ntotal equals the number of valid rows you ingested. no silent drops

when minimal is not enough

retrain IVF or PQ codebooks on a large clean deduped sample
corpus hygiene. dedup near duplicates and mask boilerplate before embedding
single policy per store. one metric, one normalization, one whitening state
dimension contract. assert embs.shape[1] == d at every hop and log it

series index

This article is part of the Semantic Clinic series. Sixteen reproducible failure modes with minimal fixes.
All posts in one place: Problem Map Articles
https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md

DEV Community