jacobjerryarackal

Posted on Apr 8

I Built a RAG Pipeline. Then I Realized Retrieval Is the Real Model

#ai #rag #llm #machinelearning

Everyone talks about the LLM. GPT‑4, Claude, Gemini – that’s the celebrity. But after building my first real RAG pipeline, I learned something humbling: the LLM is the interchangeable part. The retrieval system is the actual worker.

Let me show you what I mean.

The 4‑Step Pipeline We All Copy

You’ve seen the tutorial code a hundred times:

Ingest – chunk your documents
Embed – turn chunks into vectors
Retrieve – find top‑k similar chunks
Generate – LLM answers with that context

It works. My bot could answer company policy questions with citations. I felt smart.

Then I asked: “Can I get a refund for a digital product?”

The LLM gave a beautiful, confident answer which was completely wrong. Because my retrieval returned a chunk about physical returns (30 days, original packaging) and completely missed the digital product exception sitting two paragraphs away.

The LLM did its job perfectly. The retrieval failed.

Why Retrieval Is the Real Model

Here’s what I learned the hard way:

What you think matters	What actually matters
Which LLM you use	How you chunk documents
Prompt engineering	Embedding quality
System prompts	Re‑ranking after retrieval

The LLM just formats the answer. Retrieval decides whether the answer is true.

The Code That Fixed My Pipeline

Semantic search alone misses exact phrases like “non‑refundable after download”. Keyword search alone misses meaning. Hybrid search combines both. Here’s the core (using FAISS + BM25):

from sentence_transformers import SentenceTransformer
import faiss, numpy as np
from rank_bm25 import BM25Okapi

# 1. Load documents and embed
docs = ["Refund within 30 days, physical items only.",
        "Digital products: non-refundable after download.",
        "Contact support for defective digital items."]
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(docs)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(np.array(embeddings, dtype='float32'))

# 2. BM25 keyword index (tokenized)
tokenized_docs = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized_docs)

# 3. Hybrid search function
def hybrid_search(query, top_k=2, alpha=0.5):
    # Semantic score (distance -> similarity)
    query_vec = model.encode([query])
    distances, indices = index.search(query_vec, top_k)
    semantic_scores = 1 / (1 + distances[0])  

    # Keyword score
    query_tokens = query.lower().split()
    bm25_scores = bm25.get_scores(query_tokens)
    top_bm25_idx = np.argsort(bm25_scores)[-top_k:][::-1]
    keyword_scores = [bm25_scores[i] for i in top_bm25_idx]

    # Combine (normalized)
    combined = {}
    for i, idx in enumerate(indices[0]):
        combined[idx] = alpha * semantic_scores[i]
    for i, idx in enumerate(top_bm25_idx):
        combined[idx] = combined.get(idx, 0) + (1-alpha) * (keyword_scores[i] / max(keyword_scores))

    return sorted(combined.items(), key=lambda x: x[1], reverse=True)[:top_k]

# 4. Test
query = "Can I get my money back for a digital product?"
results = hybrid_search(query)
for idx, score in results:
    print(f"Score: {score:.2f} | {docs[idx]}")
# Output: Score: 0.92 | Digital products: non-refundable after download.

That alpha=0.5 balances meaning and exact wording. Without hybrid search, the digital product chunk was #3 (ignored). With it, #1.

Three Changes That 10x’ed My Pipeline

Chunk size is not a default – Moved to overlapping chunks (200 tokens with 50 overlap).
Semantic search alone lies – Added BM25 hybrid search (see code above).
Re‑ranking changes everything – A small cross‑encoder re‑scored top‑10 chunks, lifting accuracy from 72% to 91%.

The Mistake Most People Make

We treat RAG as an LLM problem. So we tweak prompts, swap models, add system instructions.

But the LLM is forced to use whatever context you give it. If you feed it the wrong chunk, it will hallucinate confidently. If you feed it the right chunk, even a small model answers correctly.

The bottleneck is almost never the LLM. It’s the retriever.

What I Do Differently Now

Before I write a single line of agent code, I ask three questions:

“If I searched my vector database by hand, would I find the exact sentence that answers this?”
“Does my retrieval work for synonyms AND exact keywords?” → if no, hybrid search.
“Is the top‑1 retrieved chunk actually the best?” → if no, add a re‑ranker.

The Bottom Line

The AI industry sells you on the model. But in production RAG systems, the model is the cheapest, most replaceable component. The hard part – the part that separates working bots from demoware – is getting the right information into the context window.

The LLM is the pen. Retrieval is the memory. And memory is what makes a system useful.

So next time your RAG bot fails, don’t blame GPT. Look at what you retrieved. I promise that’s where the real problem lives.

Top comments (1)

Max Quimby • Apr 8

"The LLM is the interchangeable part" is a framing that should be required reading for anyone starting a RAG project. The model selection debate dominates most conversations, but as you found, a confidently wrong answer from the right model is worse than a slightly awkward answer with the right context.

The 72% → 91% jump from adding cross-encoder re-ranking is substantial, and it's underused because it adds latency and feels like complexity. The mental model shift helps: bi-encoder retrieval is a fast filter, cross-encoder re-ranking is the precision step. Running the expensive re-ranking only on your top-k candidates keeps it tractable in production.

One dimension worth adding if you extend this: query expansion or HyDE (hypothetical document embedding) before retrieval. For policy documents especially, users phrase questions in completely different registers than the documents use. Generating a synthetic answer to embed and retrieve against can bridge that vocabulary gap before you even hit the hybrid search layer.

The chunking decision you landed on — overlapping at 200/50 — is also underappreciated. Most tutorials use defaults and wonder why their results are mediocre.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.