In my last post, I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS. It scored 17/19 on my test set. But two questions failed:
- "Who is the CEO?" — couldn't find it
- "How many employees does Zentara have?" — couldn't find it
Both answers were right there on page 1. So what went wrong, and how did I fix it?
Why pure vector search failed
The problem was a dense "Company snapshot" table on page 1 — CEO, CTO, HQ, employee count, revenue, all packed into one chunk. The embedding for that chunk became a muddy average of 8+ topics, so when I asked "Who is the CEO?", it didn't rank highly against any specific query.
This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search would find it instantly. But vector search relies on semantic similarity, and a short query doesn't produce a strong enough match against a chunk that's mostly about other things.
The fix: hybrid retrieval
The solution is to run two searches in parallel and combine the results:
- FAISS (dense) — semantic similarity, good at "What's the charging time?" style questions
- BM25 (sparse) — keyword matching, good at "Who is the CEO?" style questions
Then merge them using Reciprocal Rank Fusion (RRF) — a standard algorithm that combines ranked lists from different sources.
question ─► embed ─► FAISS search ──┐
├─► RRF fusion ─► top-k chunks ─► LLM ─► answer
question ─► tokenize ─► BM25 search ┘
How RRF works
RRF is simple. For each chunk that appears in either ranked list, compute:
rrf_score = 1/(k + rank_in_faiss) + 1/(k + rank_in_bm25)
Where k = 60 (standard constant). A chunk that ranks well in both searches scores higher than one that ranks #1 in only one.
Example: chunk 5 is ranked #1 by BM25, #4 by FAISS:
From FAISS: 1/(60 + 4) = 0.0156
From BM25: 1/(60 + 1) = 0.0164
RRF score: 0.0320 ← beats a FAISS-only #1 (0.0164)
The implementation
Only 3 files changed. Here's the core — the updated store.py:
from rank_bm25 import BM25Okapi
RRF_K = 60
def _tokenize(text: str) -> list[str]:
return re.findall(r"[a-z0-9]+", text.lower())
class VectorStore:
def __init__(self):
self.index = faiss.IndexFlatIP(EMBED_DIM)
self.chunks = []
self.bm25 = None
def add(self, vectors, chunks):
self.index.add(vectors)
self.chunks.extend(chunks)
# Build BM25 index from the same chunks
tokenized = [_tokenize(c.text) for c in self.chunks]
self.bm25 = BM25Okapi(tokenized)
def search(self, query_vector, top_k=3, query_text=""):
top_k_fetch = min(top_k * 3, self.index.ntotal)
# Dense search
_, faiss_indices = self.index.search(query_vector.reshape(1, -1), top_k_fetch)
faiss_ranking = [int(i) for i in faiss_indices[0] if i != -1]
# Sparse search
bm25_scores = self.bm25.get_scores(_tokenize(query_text))
bm25_ranking = np.argsort(bm25_scores)[::-1][:top_k_fetch].tolist()
# Reciprocal Rank Fusion
rrf_scores = {}
for rank, idx in enumerate(faiss_ranking):
rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (RRF_K + rank + 1)
for rank, idx in enumerate(bm25_ranking):
rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (RRF_K + rank + 1)
sorted_indices = sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]
return [Retrieval(chunk=self.chunks[i], score=rrf_scores[i]) for i in sorted_indices]
The only change in main.py — one extra parameter:
# Before (v1)
retrieved = store.search(query_vec, top_k=req.top_k)
# After (v2)
retrieved = store.search(query_vec, top_k=req.top_k, query_text=req.question)
That's it. No changes to chunking, embedding, PDF extraction, or LLM logic.
Results: before and after
| Question | v1 (FAISS only) | v2 (hybrid) |
|---|---|---|
| Who is the CEO of Zentara Robotics? | Failed | Correct |
| How many employees does Zentara have? | Failed | Correct (top_k=5) |
| All other 17 questions | Correct | Correct |
The CEO question now works at default top_k=3 — BM25 matches "CEO" directly and RRF promotes it.
The employee count question works at top_k=5. The chunk still ranks lower because it's packed with many facts, but hybrid retrieval brings it within reach. A reranker (cross-encoder) would likely fix this at top_k=3 — that's next on the list.
What I learned
Pure vector search has a keyword blindspot. If a term appears once in a dense chunk, semantic similarity alone won't reliably surface it. BM25 catches these instantly.
RRF is elegant. No score normalization needed, no tuning of weights between the two retrievers. Just ranks and a constant. It works out of the box.
The retriever matters more than the LLM. Both failures in v1 were retrieval failures, not LLM failures. The LLM never even saw the right chunk. Improving retrieval quality is where RAG gets better — not by switching to a fancier model.
Hybrid didn't fully solve dense chunks. The employee count still needs
top_k=5. The real fix is either better chunking (split dense tables into smaller pieces) or a reranker that can re-score candidates more precisely.
What's next
- Reranker (cross-encoder) — re-score the top-k for better precision
- Evaluation harness — automate the 19-question test set instead of testing manually
- Streaming — better UX for longer answers
Try it yourself
- v2 (hybrid retrieval): github.com/santanu2908/chat-with-pdf-rag
- v1 (pure FAISS): github.com/santanu2908/chat-with-pdf-rag/tree/v1
uv sync
cp .env.example .env # set your API key
uv run uvicorn app.main:app --reload
Open http://localhost:8000/docs, upload the included sample PDF (data/sample_test_file.pdf), and try "Who is the CEO?" — it works now.
If you've implemented hybrid retrieval or have experience with rerankers, I'd love to hear what worked for you.
I'm Santanu Mohanta — connect with me on LinkedIn or check out my projects on GitHub.
Top comments (0)