tl;dr
Ingestion success does not prove retrieval health. If your neighbors look the same for unrelated queries, check for zero vectors and NaNs, confirm metric policy matches the index, rebuild from clean embeddings, and smoke test with neighbor overlap and recall.
why recall dies even though ingestion looks fine
Start with root causes. Fixes are easier when the failure mode is named.
- Zero vectors inserted Empty spans or batch bugs produced all zeros. FAISS accepts them and your index looks valid.
- Metric mismatch Embeddings assumed cosine but the index used L2, or the reverse. Distances no longer reflect meaning.
- Normalization drift Query side normalized, corpus side not. Or some shards normalized and others not.
-
Dimension or dtype errors
Wrong
d
when saving or loading silently truncated vectors. Device or dtype cast wrote zeros to disk. - Codebook reuse IVF or PQ codebooks reused after whitening or a model swap. Geometry changed but centroids did not.
-
ID collisions and silent overwrite
You overwrote rows while keeping
index.ntotal
correct. - Boot order You trained IVF before dedup or before boilerplate masking, then you re embedded later.
Label this family as No.8. If the boot sequence contributed, also mark No.14. If you tested against an empty or mixed store, mark No.16.
60 second health check
Do this first. It catches most silent failures.
1) sample 5k rows from your corpus embeddings
2) compute row norms and count zeros, NaN, and Inf
3) verify your metric and normalization policy match the index type
4) run ten random queries and measure neighbor overlap at k=20
rules of thumb
- zero vector rate must be 0.0%
- NaN or Inf count must be 0
- cosine retrieval requires L2 normalized vectors on both sides
- if average neighbor overlap across ten queries is > 0.35, geometry or ingestion is wrong
minimal fix that usually restores recall
Goal. Make retrieval trustworthy again with the smallest change set.
- Reject bad rows before add Fail loudly if zeros or non finite vectors exist.
import numpy as np
def reject_bad(embs, d_expected):
assert embs.ndim == 2 and embs.shape[1] == d_expected, f"dim mismatch {embs.shape}"
norms = np.linalg.norm(embs, axis=1)
z = np.where(norms == 0)[0]
nf = np.where(~np.isfinite(norms))[0]
if len(z) or len(nf):
raise RuntimeError(f"zero:{len(z)} naninf:{len(nf)}")
return norms
`
Align metric with vector state
Cosine retrieval means normalize corpus and queries then use L2 or IP consistently.
Inner product retrieval means avoid renormalizing twice and control norms explicitly.Rebuild from clean embeddings
Do not patch mixed shards. Trash and rebuild. Retrain IVF or PQ if geometry changed.Run a five question smoke test
Fixed questions with known spans. If recall stays low, stop and re check geometry.
tiny scripts you can paste
neighbor overlap sanity
`python
def overlap_k(a_ids, b_ids, k=20):
a, b = set(a_ids[:k]), set(b_ids[:k])
return len(a & b) / float(k)
run across 10 random queries and average. healthy spaces are well below 0.35
`
FAISS rebuild for cosine via L2
`python
import faiss, numpy as np
from sklearn.preprocessing import normalize
Z = np.load("embeddings.npy").astype("float32")
Z = normalize(Z, axis=1) # cosine requires L2 normalized vectors
faiss.normalize_L2(Z) # belt and suspenders
d = Z.shape[1]
index = faiss.IndexHNSWFlat(d, 32)
index.hnsw.efConstruction = 200
index.add(Z)
faiss.write_index(index, "hnsw_cosine.faiss")
`
IVF PQ retrain sketch
`python
import faiss, numpy as np
Z = np.load("embeddings_clean.npy").astype("float32")
nlist, m, nbits = 4096, 64, 8
quant = faiss.IndexFlatL2(Z.shape[1])
ivfpq = faiss.IndexIVFPQ(quant, Z.shape[1], nlist, m, nbits)
ivfpq.train(Z) # train on a large clean sample
ivfpq.add(Z)
ivfpq.nprobe = 32
faiss.write_index(ivfpq, "ivfpq_l2.faiss")
`
acceptance criteria before you call it fixed
- zero vector and NaN rates are 0.0%
- metric and normalization policy are documented and match the index type
- after whitening for cosine, PC1 explained variance is in a healthy band. no single axis dominates
- neighbor overlap across 20 random queries is ≤ 0.35 at k 20
- recall at k on the heldout set rises and top k lists actually change with the query
-
index.ntotal
equals the number of valid rows you ingested. no silent drops
when minimal is not enough
- retrain IVF or PQ codebooks on a large clean deduped sample
- corpus hygiene. dedup near duplicates and mask boilerplate before embedding
- single policy per store. one metric, one normalization, one whitening state
- dimension contract. assert
embs.shape[1] == d
at every hop and log it
series index
This article is part of the Semantic Clinic series. Sixteen reproducible failure modes with minimal fixes.
All posts in one place: Problem Map Articles
https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md
Top comments (0)