Why Your Vector Index Returns Five Copies of the Same Doc

#rag #ai #embeddings #python

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ask the retriever for the top five chunks. It hands you back five copies of the same paragraph. Different IDs, slightly different scores, the same text underneath. The LLM reads them, decides the answer is whatever that one paragraph says, and ignores the four other paragraphs in your corpus that would have nuanced it.

This is the failure mode you do not catch in unit tests. Recall@5 looks fine. The single relevant chunk is in there. It is in there five times over. Your context window is now five paragraphs of redundant content and zero paragraphs of useful contrast.

The fix is two ideas glued together: hash-dedup before you rank, then run MMR over what is left. Two call sites in your retrieval path, about forty lines of Python behind them, and the problem goes away.

Why the same document keeps winning

Three causes show up over and over, and they stack. That is why the symptom is so loud.

Cause one: duplicated content in your corpus. Your crawler grabbed the same article from example.com/posts/42 and example.com/blog/2024/foo-bar. Your CMS exports the printable view and the regular view as separate documents. A vendor's API returns the same FAQ entry under three category paths. The bodies are byte-identical. The embeddings are byte-identical. They all match the query equally well, and they all show up in top-K together.

Cause two: chunk-level overlap. You chunked with a sliding window of 512 tokens and a stride of 128. That is a sensible config. It means every sentence near a chunk boundary lives in two or three chunks. A well-phrased sentence that matches the query lights up every chunk it sits inside. You did not duplicate the document. You duplicated the high-signal sentences inside the document.

Cause three: there is no diversity in nearest-neighbor search. Top-K cosine returns the K closest points to the query in embedding space. Closeness is the only criterion. If two passages live in the same neighborhood, the index returns both, even if they say the same thing. Vector indexes do not know what "saying the same thing" means. They know distance.

Corpus hygiene helps with the duplication and overlap, and you should chip away at it, but that cleanup work never finishes and the duplicates will be back next quarter. The algorithmic blindness you can patch at retrieval time without re-indexing anything.

Step one: dedup by content hash

Before MMR, kill the byte-identical duplicates. They contribute nothing and they confuse the diversity step.

import hashlib

def normalize(text: str) -> str:
    return " ".join(text.split()).strip().lower()

def content_hash(text: str) -> str:
    return hashlib.sha1(
        normalize(text).encode("utf-8")
    ).hexdigest()

def dedup(hits):
    seen = set()
    out = []
    for h in hits:
        key = content_hash(h["text"])
        if key in seen:
            continue
        seen.add(key)
        out.append(h)
    return out

normalize collapses whitespace and lowercases. That catches the printable-view-vs-regular-view case where the only difference is markup the chunker stripped differently. If you want to be stricter on near-duplicates, swap the SHA-1 for a SimHash or MinHash and threshold the distance. For most corpora, exact-after-normalize is enough.

Run this before MMR. It is cheap and it removes a class of bug that MMR cannot fix on its own (MMR will happily pick two identical-text chunks if their vectors differ slightly due to chunk metadata).

Step two: Maximal Marginal Relevance

The diversity question was answered in 1998 by Carbonell and Goldstein in The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries: rank candidates by relevance to the query, but penalize each candidate for being similar to what you already picked. The score is

MMR(d) = lambda * sim(d, q)
       - (1 - lambda) * max(sim(d, s)
                            for s in selected)

Pick the candidate with the highest MMR. Add it to selected. Repeat until you have K items. lambda is the relevance-vs-diversity dial: 1.0 is pure relevance (vanilla top-K), and 0.0 is pure diversity, which gives you the most spread-out points in your candidate pool regardless of the query. The useful range is 0.5 to 0.8. Start at 0.7.

Forty lines of Python over OpenAI's text-embedding-3-small:

import numpy as np
from openai import OpenAI

client = OpenAI()
MODEL = "text-embedding-3-small"

def embed(texts):
    resp = client.embeddings.create(
        model=MODEL, input=texts
    )
    return np.array(
        [d.embedding for d in resp.data],
        dtype=np.float32,
    )

def mmr(query_vec, doc_vecs, k=5, lam=0.7):
    selected = []
    candidates = list(range(len(doc_vecs)))
    sim_q = doc_vecs @ query_vec
    sim_dd = doc_vecs @ doc_vecs.T

    while candidates and len(selected) < k:
        best_i, best_score = None, -1e9
        for i in candidates:
            if not selected:
                penalty = 0.0
            else:
                penalty = max(
                    sim_dd[i, j] for j in selected
                )
            score = lam * sim_q[i] - (1 - lam) * penalty
            if score > best_score:
                best_score, best_i = score, i
        selected.append(best_i)
        candidates.remove(best_i)
    return selected

The vectors must be unit-normalized for the dot products to behave as cosine similarity. OpenAI's text-embedding-3-small returns unit vectors by default, so this code is safe over its output. If you use a different embedder, normalize your vectors first or swap the dot products for explicit cosine, otherwise the diversity penalty silently goes wrong. The sim_dd matrix is N x N where N is your candidate-pool size; that is fine for N=50 and painful for N=10000, so run MMR over the top-K-from-vector-DB results, not your whole corpus. After the matrix is built, the loop is O(N * K), which is the right complexity for the diversity step at typical K like 5 or 10.

Wiring it into the retrieval call

The full pipeline is: vector search for top-50, hash-dedup, MMR down to top-K.

def retrieve(query, store, k=5, pool=50):
    q_vec = embed([query])[0]
    hits = store.search(q_vec, top_k=pool)
    hits = dedup(hits)
    if len(hits) <= k:
        return hits

    doc_vecs = np.array(
        [h["vector"] for h in hits],
        dtype=np.float32,
    )
    picks = mmr(q_vec, doc_vecs, k=k, lam=0.7)
    return [hits[i] for i in picks]

store.search is a stand-in for whichever vector DB you use. The shape is the same in pgvector, Pinecone, Qdrant, Weaviate, and the duct-taped numpy-array index your prototype is still running on. What changes is how you fetch the stored vectors back so MMR can compare candidates against each other. Most clients return them in the result payload; if yours does not, fetch them in a second batch by ID.

Pool size of 50 is a reasonable default. Bigger pools cost a bit more compute in exchange for a better chance that MMR finds genuinely different content; tune from there if your candidates still come back near-identical. If your recall@50 is poor, no amount of MMR will save you, and the problem is upstream.

What to expect after you ship it

Your prompt template stops repeating itself. A query that used to return five copies of the "what is X" chunk now returns the "what is X" chunk, the "how X is configured" chunk, the "common X failure modes" chunk, and two siblings. The LLM has more to work with, and the answers stop being weirdly confident about the one thing it saw five times.

The harder-to-see win is in your eval rig: it stops being so volatile across embedder upgrades. A new embedder that nudges all the duplicates from rank 1, 2, 3, 4, 5 to rank 1, 3, 5, 7, 9 would have looked like a regression to a vanilla top-K eval. With dedup plus MMR, the regression collapses to "same answer, plus four other useful ones" and your eval numbers move smoothly.

MMR with lambda=0.7 does sometimes drop the second-most-relevant chunk for a third-most-relevant-but-different one. On corpora with very narrow query intents, that can be the wrong call. Tune lambda per use case if you can; default to 0.7 if you cannot.

What to do tomorrow

Three steps, in order. Run the hash-dedup pass over a recent batch of retrieval results. Count how many duplicates per query you find. If the answer is "lots", you have corpus hygiene work to do regardless of MMR. Add the MMR step on top of dedup with lambda=0.7 and pool=50. Re-run your eval and look at the qualitative change in answer quality, not just recall@K.

The two-line fix is dedup(hits) then mmr(q, vecs, k=5, lam=0.7).

If this was useful

The RAG Pocket Guide covers the retrieval-time hygiene work end-to-end: when to dedup, where to put MMR in the pipeline, how to pick lambda for your corpus, and the diversity-versus-recall trade-off the tutorials skip. If your top-K is still full of near-duplicates after the two-line fix, the next chapters in there are where to go.