Santanu Mohanta

Posted on Jun 23

I added a reranker to my RAG pipeline — it broke everything, then I fixed it

#rag #python #ai #fastapi

In v2 I added hybrid retrieval (FAISS + BM25) to fix keyword blindspots. All 19 test questions passed. The next item on my list was a cross-encoder reranker for better precision.

The idea is standard: over-fetch candidates, rerank with a smarter model, keep the top-k. Every RAG tutorial recommends it. It took me 20 minutes to implement and immediately broke 2 of my 19 tests.

Here's what went wrong and the strategy I landed on.

What a cross-encoder does (and why it's better)

In v2, retrieval uses bi-encoders — the query and each chunk are embedded independently, then compared by cosine similarity. Fast, but the model never sees query and chunk together.

A cross-encoder is different. It takes the (query, chunk) pair as a single input and outputs a relevance score. It can attend to both simultaneously — word-level interactions, negation, paraphrasing. Much more accurate, but too slow for first-stage retrieval because you'd need to score every chunk in the index.

The standard two-stage pattern:

Stage 1: cheap retrieval (FAISS + BM25) → broad candidate set
Stage 2: cross-encoder reranks candidates → precise top-k → LLM

The implementation (the easy part)

New file — app/reranker.py:

from sentence_transformers import CrossEncoder

RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"

_reranker = None

def get_reranker():
    global _reranker
    if _reranker is None:
        _reranker = CrossEncoder(RERANKER_MODEL_NAME)
    return _reranker

def rerank(query, retrievals, top_k):
    model = get_reranker()
    pairs = [[query, r.chunk.text] for r in retrievals]
    scores = model.predict(pairs)
    for r, score in zip(retrievals, scores):
        r.score = float(score)
    ranked = sorted(retrievals, key=lambda r: r.score, reverse=True)
    return ranked[:top_k]

And in main.py, over-fetch then rerank:

# Before (v2): retrieve top_k directly
retrieved = store.search(query_vec, top_k=req.top_k, query_text=req.question)

# After (v3): over-fetch, then rerank
candidates = store.search(query_vec, top_k=req.top_k * 2, query_text=req.question)
retrieved = rerank(req.question, candidates, top_k=req.top_k)

No new dependency — cross-encoder/ms-marco-MiniLM-L-6-v2 works through sentence-transformers which was already installed. The model is ~80MB, runs on CPU.

I ran the eval. Two tests broke.

What broke

Question:  Who is the CEO of Zentara Robotics?
Expected:  ['Iris Kallas']
Got:       I couldn't find that in the document.

Question:  How many employees does Zentara have?
Expected:  ['287']
Got:       I couldn't find that in the document.

The exact same two questions that failed in v1 with pure FAISS. Hybrid retrieval fixed them. The reranker un-fixed them.

Why the cross-encoder hates tables

The CEO chunk looks like this:

Company: Zentara Robotics | CEO: Iris Kallas | Employees: 287 | Founded: 2018 ...

Dense. Tabular. Eight facts crammed together.

The cross-encoder (ms-marco-MiniLM-L-6-v2) was trained on MS MARCO — a web search dataset where passages are natural language paragraphs. When it sees a fact-packed table row as a "passage" for the query "Who is the CEO?", it scores it low. It doesn't look like a good answer, even though it contains the answer.

Meanwhile, hybrid retrieval ranked this chunk #1 — BM25 matched "CEO" exactly and RRF boosted it. The cross-encoder then threw it away.

What I tried (and why it failed)

I went through 7 approaches before finding one that worked. Here's the progression:

#	Approach	Result
1	Pure CE rerank	CE buries table chunks
2	Bigger candidate pool (15)	More candidates = more competition
3	Score blending (0.7 CE + 0.3 RRF)	CE score is so negative it still dominates
4	Score blending (0.5 + 0.5)	Still not enough
5	RRF fusion of CE + first-stage rankings	K=60 makes all rank contributions ~equal, CE rank wins
6	Weighted RRF (2x first-stage)	Still too flat with K=60
7	Smaller pool (top_k * 2)	CE still pushes table chunks out

The core issue: the cross-encoder's score for table chunks is so negative that no amount of score blending or rank fusion can compensate. It's not a "this chunk ranks slightly lower" problem — it's a "the model actively rejects this format" problem.

What actually worked: guaranteed slots

The insight: the first-stage results are already good. Hybrid retrieval passed all 19 tests. The reranker should improve those results, not override them.

The strategy:

top_k = 3:  guaranteed slots = 2 (from first-stage)  +  1 CE pick
top_k = 5:  guaranteed slots = 4 (from first-stage)  +  1 CE pick

The top first-stage results are preserved. The cross-encoder only gets to fill the last slot from the remaining candidates. Here's the final implementation:

def rerank(query, retrievals, top_k):
    if not retrievals or top_k >= len(retrievals):
        return retrievals

    n_guaranteed = top_k - 1
    n_ce_slots = 1

    guaranteed = retrievals[:n_guaranteed]
    remaining = retrievals[n_guaranteed:]

    if remaining:
        model = get_reranker()
        pairs = [[query, r.chunk.text] for r in remaining]
        scores = model.predict(pairs)
        for r, score in zip(remaining, scores):
            r.score = round(float(score), 4)
        remaining.sort(key=lambda r: r.score, reverse=True)

    return guaranteed + remaining[:n_ce_slots]

The CEO chunk (first-stage #1) is always guaranteed. The employee chunk (~rank 3-4 at top_k=5) is also preserved. The CE still adds value by selecting the most relevant candidate for the final slot.

Result: 19/19 passing.

The pipeline now

PDF ─► extract text ─► chunk ─► embed (MiniLM-L6-v2)
                                        │
                                        ▼
question ─► FAISS + BM25 (2× top_k candidates, RRF fused)
         ─► cross-encoder reranks remaining candidates
         ─► guaranteed first-stage slots + 1 CE-picked slot
         ─► top_k chunks ─► LLM ─► answer + sources

Three stages of retrieval now: vector search, keyword search, cross-encoder. Each catches something the others miss.

What I learned

Rerankers aren't drop-in improvements. Every RAG tutorial shows "add a cross-encoder, get better results." In practice, cross-encoders trained on natural language passages can actively hurt retrieval quality on structured or tabular content.
Your eval set is your safety net. Without the 19-question eval harness, I would've shipped this and had no idea I'd regressed on 2 questions. The eval caught it in seconds.
Guaranteed slots > score blending. I tried 7 different ways to blend CE and first-stage scores. None worked because the CE's score for table chunks was so negative it dominated every blend. The fix wasn't mathematical — it was structural: protect what's already working, let the CE improve the margins.
The retriever still matters most. v1 → v2 (adding BM25) was the biggest accuracy jump. v2 → v3 (adding the reranker) was a precision refinement that nearly caused regressions. Invest in your first-stage retrieval before reaching for rerankers.

What's next

Streaming responses
Conversation memory
Possibly a Streamlit UI

Try it yourself

v3 (reranker): github.com/santanu2908/chat-with-pdf-rag
v2 (hybrid retrieval): github.com/santanu2908/chat-with-pdf-rag/tree/v2
v1 (pure FAISS): github.com/santanu2908/chat-with-pdf-rag/tree/v1

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload the sample PDF, and try "Who is the CEO?" — it still works, even with the reranker.

If you've hit similar issues with cross-encoders on structured content, I'd love to hear your approach.

I'm Santanu Mohanta — connect with me on LinkedIn or check out my projects on GitHub.

Top comments (3)

Max Quimby • Jun 23

The guaranteed-slots fix is genuinely clever — "protect what already works, let the reranker fight for the margin" is a much saner default than trusting a score blend you can't tune. Your root-cause diagnosis (an MS MARCO-trained cross-encoder rejecting a table row because it doesn't look like a prose passage) points at a second lever worth naming: fix the chunk, not the ranker. If you normalize that pipe-delimited row into a sentence at ingestion — "The CEO of Zentara Robotics is Iris Kallas; it has 287 employees…" — the chunk now matches the CE's training distribution, so it scores fairly and you get to keep the reranker active on every slot. The trade-off is ingestion-time complexity vs. a retrieval-time guard. I think your structural fix is right when tables are a minority of the corpus, but it'd degrade if most chunks were tabular — then you're protecting almost everything and the CE never gets to do its job. If tables were 70% of your docs, which way would you lean?

Santanu Mohanta • Jun 24

Great observation — you've nailed the exact trade-off I was weighing.

On chunk normalization at ingestion: I considered this. Converting table rows into natural sentences ("The CEO of Zentara Robotics is Iris Kallas; it has 287 employees…") would absolutely help the CE score them fairly. The reason I didn't go that route is that it introduces a lossy transformation — you're betting that your sentence template captures what a future question will ask about. If someone asks "how many employees?" and the normalized sentence buried that detail mid-sentence, retrieval might still struggle. The raw table row at least preserves the original structure for BM25 to keyword-match against.

On the 70% tabular corpus question: Honestly, at that point I'd lean toward fixing the chunks rather than protecting them. If most of your corpus is tabular, the guaranteed-slots strategy degrades exactly as you described — you're shielding almost everything and the CE becomes decorative. I'd probably do both: normalize tables into sentences at ingestion and switch the CE to one fine-tuned on structured data (or at least evaluated on table-heavy benchmarks). The structural guard was the right call for a 5-page doc with one dense table — but it's a patch, not a scalable strategy.

The honest answer is that for a production system with table-heavy PDFs, you probably want a dedicated table extraction step (something like table detection → structured parse → per-cell indexing) rather than treating tables as text chunks at all.

Aly • Jun 24

Your experience with the reranker in your RAG pipeline highlights a common challenge in ensuring accurate retrieval. One way to enhance the reliability of your results is by implementing a document provenance system. Using evidence bundles can provide a tamper-evident way to capture the documents' origins and changes. This could be particularly useful in debugging and ensuring compliance with data usage policies. If you're interested, you might want to explore how the MCP tool can help in this regard, as it allows for capturing and verifying document interactions seamlessly. More details can be found at docimprint.com/mcp.