In v2 I added hybrid retrieval (FAISS + BM25) to fix keyword blindspots. All 19 test questions passed. The next item on my list was a cross-encoder reranker for better precision.
The idea is standard: over-fetch candidates, rerank with a smarter model, keep the top-k. Every RAG tutorial recommends it. It took me 20 minutes to implement and immediately broke 2 of my 19 tests.
Here's what went wrong and the strategy I landed on.
What a cross-encoder does (and why it's better)
In v2, retrieval uses bi-encoders — the query and each chunk are embedded independently, then compared by cosine similarity. Fast, but the model never sees query and chunk together.
A cross-encoder is different. It takes the (query, chunk) pair as a single input and outputs a relevance score. It can attend to both simultaneously — word-level interactions, negation, paraphrasing. Much more accurate, but too slow for first-stage retrieval because you'd need to score every chunk in the index.
The standard two-stage pattern:
Stage 1: cheap retrieval (FAISS + BM25) → broad candidate set
Stage 2: cross-encoder reranks candidates → precise top-k → LLM
The implementation (the easy part)
New file — app/reranker.py:
from sentence_transformers import CrossEncoder
RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
_reranker = None
def get_reranker():
global _reranker
if _reranker is None:
_reranker = CrossEncoder(RERANKER_MODEL_NAME)
return _reranker
def rerank(query, retrievals, top_k):
model = get_reranker()
pairs = [[query, r.chunk.text] for r in retrievals]
scores = model.predict(pairs)
for r, score in zip(retrievals, scores):
r.score = float(score)
ranked = sorted(retrievals, key=lambda r: r.score, reverse=True)
return ranked[:top_k]
And in main.py, over-fetch then rerank:
# Before (v2): retrieve top_k directly
retrieved = store.search(query_vec, top_k=req.top_k, query_text=req.question)
# After (v3): over-fetch, then rerank
candidates = store.search(query_vec, top_k=req.top_k * 2, query_text=req.question)
retrieved = rerank(req.question, candidates, top_k=req.top_k)
No new dependency — cross-encoder/ms-marco-MiniLM-L-6-v2 works through sentence-transformers which was already installed. The model is ~80MB, runs on CPU.
I ran the eval. Two tests broke.
What broke
Question: Who is the CEO of Zentara Robotics?
Expected: ['Iris Kallas']
Got: I couldn't find that in the document.
Question: How many employees does Zentara have?
Expected: ['287']
Got: I couldn't find that in the document.
The exact same two questions that failed in v1 with pure FAISS. Hybrid retrieval fixed them. The reranker un-fixed them.
Why the cross-encoder hates tables
The CEO chunk looks like this:
Company: Zentara Robotics | CEO: Iris Kallas | Employees: 287 | Founded: 2018 ...
Dense. Tabular. Eight facts crammed together.
The cross-encoder (ms-marco-MiniLM-L-6-v2) was trained on MS MARCO — a web search dataset where passages are natural language paragraphs. When it sees a fact-packed table row as a "passage" for the query "Who is the CEO?", it scores it low. It doesn't look like a good answer, even though it contains the answer.
Meanwhile, hybrid retrieval ranked this chunk #1 — BM25 matched "CEO" exactly and RRF boosted it. The cross-encoder then threw it away.
What I tried (and why it failed)
I went through 7 approaches before finding one that worked. Here's the progression:
| # | Approach | Result |
|---|---|---|
| 1 | Pure CE rerank | CE buries table chunks |
| 2 | Bigger candidate pool (15) | More candidates = more competition |
| 3 | Score blending (0.7 CE + 0.3 RRF) | CE score is so negative it still dominates |
| 4 | Score blending (0.5 + 0.5) | Still not enough |
| 5 | RRF fusion of CE + first-stage rankings | K=60 makes all rank contributions ~equal, CE rank wins |
| 6 | Weighted RRF (2x first-stage) | Still too flat with K=60 |
| 7 | Smaller pool (top_k * 2) | CE still pushes table chunks out |
The core issue: the cross-encoder's score for table chunks is so negative that no amount of score blending or rank fusion can compensate. It's not a "this chunk ranks slightly lower" problem — it's a "the model actively rejects this format" problem.
What actually worked: guaranteed slots
The insight: the first-stage results are already good. Hybrid retrieval passed all 19 tests. The reranker should improve those results, not override them.
The strategy:
top_k = 3: guaranteed slots = 2 (from first-stage) + 1 CE pick
top_k = 5: guaranteed slots = 4 (from first-stage) + 1 CE pick
The top first-stage results are preserved. The cross-encoder only gets to fill the last slot from the remaining candidates. Here's the final implementation:
def rerank(query, retrievals, top_k):
if not retrievals or top_k >= len(retrievals):
return retrievals
n_guaranteed = top_k - 1
n_ce_slots = 1
guaranteed = retrievals[:n_guaranteed]
remaining = retrievals[n_guaranteed:]
if remaining:
model = get_reranker()
pairs = [[query, r.chunk.text] for r in remaining]
scores = model.predict(pairs)
for r, score in zip(remaining, scores):
r.score = round(float(score), 4)
remaining.sort(key=lambda r: r.score, reverse=True)
return guaranteed + remaining[:n_ce_slots]
The CEO chunk (first-stage #1) is always guaranteed. The employee chunk (~rank 3-4 at top_k=5) is also preserved. The CE still adds value by selecting the most relevant candidate for the final slot.
Result: 19/19 passing.
The pipeline now
PDF ─► extract text ─► chunk ─► embed (MiniLM-L6-v2)
│
▼
question ─► FAISS + BM25 (2× top_k candidates, RRF fused)
─► cross-encoder reranks remaining candidates
─► guaranteed first-stage slots + 1 CE-picked slot
─► top_k chunks ─► LLM ─► answer + sources
Three stages of retrieval now: vector search, keyword search, cross-encoder. Each catches something the others miss.
What I learned
Rerankers aren't drop-in improvements. Every RAG tutorial shows "add a cross-encoder, get better results." In practice, cross-encoders trained on natural language passages can actively hurt retrieval quality on structured or tabular content.
Your eval set is your safety net. Without the 19-question eval harness, I would've shipped this and had no idea I'd regressed on 2 questions. The eval caught it in seconds.
Guaranteed slots > score blending. I tried 7 different ways to blend CE and first-stage scores. None worked because the CE's score for table chunks was so negative it dominated every blend. The fix wasn't mathematical — it was structural: protect what's already working, let the CE improve the margins.
The retriever still matters most. v1 → v2 (adding BM25) was the biggest accuracy jump. v2 → v3 (adding the reranker) was a precision refinement that nearly caused regressions. Invest in your first-stage retrieval before reaching for rerankers.
What's next
- Streaming responses
- Conversation memory
- Possibly a Streamlit UI
Try it yourself
- v3 (reranker): github.com/santanu2908/chat-with-pdf-rag
- v2 (hybrid retrieval): github.com/santanu2908/chat-with-pdf-rag/tree/v2
- v1 (pure FAISS): github.com/santanu2908/chat-with-pdf-rag/tree/v1
uv sync
cp .env.example .env # set your API key
uv run uvicorn app.main:app --reload
Open http://localhost:8000/docs, upload the sample PDF, and try "Who is the CEO?" — it still works, even with the reranker.
If you've hit similar issues with cross-encoders on structured content, I'd love to hear your approach.
I'm Santanu Mohanta — connect with me on LinkedIn or check out my projects on GitHub.
Top comments (0)