Santanu Mohanta

Posted on Jun 3

My RAG pipeline couldn't find the CEO — here's how I fixed it with hybrid retrieval

#rag #python #ai #fastapi

In my last post, I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS. It scored 17/19 on my test set. But two questions failed:

"Who is the CEO?" — couldn't find it
"How many employees does Zentara have?" — couldn't find it

Both answers were right there on page 1. So what went wrong, and how did I fix it?

Why pure vector search failed

The problem was a dense "Company snapshot" table on page 1 — CEO, CTO, HQ, employee count, revenue, all packed into one chunk. The embedding for that chunk became a muddy average of 8+ topics, so when I asked "Who is the CEO?", it didn't rank highly against any specific query.

This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search would find it instantly. But vector search relies on semantic similarity, and a short query doesn't produce a strong enough match against a chunk that's mostly about other things.

The fix: hybrid retrieval

The solution is to run two searches in parallel and combine the results:

FAISS (dense) — semantic similarity, good at "What's the charging time?" style questions
BM25 (sparse) — keyword matching, good at "Who is the CEO?" style questions

Then merge them using Reciprocal Rank Fusion (RRF) — a standard algorithm that combines ranked lists from different sources.

question ─► embed ─► FAISS search ──┐
                                    ├─► RRF fusion ─► top-k chunks ─► LLM ─► answer
question ─► tokenize ─► BM25 search ┘

How RRF works

RRF is simple. For each chunk that appears in either ranked list, compute:

rrf_score = 1/(k + rank_in_faiss) + 1/(k + rank_in_bm25)

Where k = 60 (standard constant). A chunk that ranks well in both searches scores higher than one that ranks #1 in only one.

Example: chunk 5 is ranked #1 by BM25, #4 by FAISS:

From FAISS:  1/(60 + 4) = 0.0156
From BM25:   1/(60 + 1) = 0.0164
RRF score:                0.0320  ← beats a FAISS-only #1 (0.0164)

The implementation

Only 3 files changed. Here's the core — the updated store.py:

from rank_bm25 import BM25Okapi

RRF_K = 60

def _tokenize(text: str) -> list[str]:
    return re.findall(r"[a-z0-9]+", text.lower())

class VectorStore:
    def __init__(self):
        self.index = faiss.IndexFlatIP(EMBED_DIM)
        self.chunks = []
        self.bm25 = None

    def add(self, vectors, chunks):
        self.index.add(vectors)
        self.chunks.extend(chunks)
        # Build BM25 index from the same chunks
        tokenized = [_tokenize(c.text) for c in self.chunks]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query_vector, top_k=3, query_text=""):
        top_k_fetch = min(top_k * 3, self.index.ntotal)

        # Dense search
        _, faiss_indices = self.index.search(query_vector.reshape(1, -1), top_k_fetch)
        faiss_ranking = [int(i) for i in faiss_indices[0] if i != -1]

        # Sparse search
        bm25_scores = self.bm25.get_scores(_tokenize(query_text))
        bm25_ranking = np.argsort(bm25_scores)[::-1][:top_k_fetch].tolist()

        # Reciprocal Rank Fusion
        rrf_scores = {}
        for rank, idx in enumerate(faiss_ranking):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (RRF_K + rank + 1)
        for rank, idx in enumerate(bm25_ranking):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (RRF_K + rank + 1)

        sorted_indices = sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]
        return [Retrieval(chunk=self.chunks[i], score=rrf_scores[i]) for i in sorted_indices]

The only change in main.py — one extra parameter:

# Before (v1)
retrieved = store.search(query_vec, top_k=req.top_k)

# After (v2)
retrieved = store.search(query_vec, top_k=req.top_k, query_text=req.question)

That's it. No changes to chunking, embedding, PDF extraction, or LLM logic.

Results: before and after

Question	v1 (FAISS only)	v2 (hybrid)
Who is the CEO of Zentara Robotics?	Failed	Correct
How many employees does Zentara have?	Failed	Correct (top_k=5)
All other 17 questions	Correct	Correct

The CEO question now works at default top_k=3 — BM25 matches "CEO" directly and RRF promotes it.

The employee count question works at top_k=5. The chunk still ranks lower because it's packed with many facts, but hybrid retrieval brings it within reach. A reranker (cross-encoder) would likely fix this at top_k=3 — that's next on the list.

What I learned

Pure vector search has a keyword blindspot. If a term appears once in a dense chunk, semantic similarity alone won't reliably surface it. BM25 catches these instantly.
RRF is elegant. No score normalization needed, no tuning of weights between the two retrievers. Just ranks and a constant. It works out of the box.
The retriever matters more than the LLM. Both failures in v1 were retrieval failures, not LLM failures. The LLM never even saw the right chunk. Improving retrieval quality is where RAG gets better — not by switching to a fancier model.
Hybrid didn't fully solve dense chunks. The employee count still needs top_k=5. The real fix is either better chunking (split dense tables into smaller pieces) or a reranker that can re-score candidates more precisely.

What's next

Reranker (cross-encoder) — re-score the top-k for better precision
Evaluation harness — automate the 19-question test set instead of testing manually
Streaming — better UX for longer answers

Try it yourself

v2 (hybrid retrieval): github.com/santanu2908/chat-with-pdf-rag
v1 (pure FAISS): github.com/santanu2908/chat-with-pdf-rag/tree/v1

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload the included sample PDF (data/sample_test_file.pdf), and try "Who is the CEO?" — it works now.

If you've implemented hybrid retrieval or have experience with rerankers, I'd love to hear what worked for you.

I'm Santanu Mohanta — connect with me on LinkedIn or check out my projects on GitHub.

Top comments (1)

Tae Kim • Jun 4

Hit this same wall on the Graph RAG side. Pulled the work back to ingest: extract (Company, role, Person) tuples at parse time and use Splink to collapse 'Zentara' and 'Zentara Inc.' into one canonical node, so 'Who is the CEO?' becomes a graph edge lookup instead of a retrieval gamble. BM25+RRF is the right call when ingest is locked, but if you control parsing the dense-table problem disappears entirely.