Everyone talks about the LLM. GPT‑4, Claude, Gemini – that’s the celebrity. But after building my first real RAG pipeline, I learned something humbling: the LLM is the interchangeable part. The retrieval system is the actual worker.
Let me show you what I mean.
The 4‑Step Pipeline We All Copy
You’ve seen the tutorial code a hundred times:
- Ingest – chunk your documents
- Embed – turn chunks into vectors
- Retrieve – find top‑k similar chunks
- Generate – LLM answers with that context
It works. My bot could answer company policy questions with citations. I felt smart.
Then I asked: “Can I get a refund for a digital product?”
The LLM gave a beautiful, confident answer which was completely wrong. Because my retrieval returned a chunk about physical returns (30 days, original packaging) and completely missed the digital product exception sitting two paragraphs away.
The LLM did its job perfectly. The retrieval failed.
Why Retrieval Is the Real Model
Here’s what I learned the hard way:
| What you think matters | What actually matters |
|---|---|
| Which LLM you use | How you chunk documents |
| Prompt engineering | Embedding quality |
| System prompts | Re‑ranking after retrieval |
The LLM just formats the answer. Retrieval decides whether the answer is true.
The Code That Fixed My Pipeline
Semantic search alone misses exact phrases like “non‑refundable after download”. Keyword search alone misses meaning. Hybrid search combines both. Here’s the core (using FAISS + BM25):
from sentence_transformers import SentenceTransformer
import faiss, numpy as np
from rank_bm25 import BM25Okapi
# 1. Load documents and embed
docs = ["Refund within 30 days, physical items only.",
"Digital products: non-refundable after download.",
"Contact support for defective digital items."]
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(docs)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(np.array(embeddings, dtype='float32'))
# 2. BM25 keyword index (tokenized)
tokenized_docs = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized_docs)
# 3. Hybrid search function
def hybrid_search(query, top_k=2, alpha=0.5):
# Semantic score (distance -> similarity)
query_vec = model.encode([query])
distances, indices = index.search(query_vec, top_k)
semantic_scores = 1 / (1 + distances[0])
# Keyword score
query_tokens = query.lower().split()
bm25_scores = bm25.get_scores(query_tokens)
top_bm25_idx = np.argsort(bm25_scores)[-top_k:][::-1]
keyword_scores = [bm25_scores[i] for i in top_bm25_idx]
# Combine (normalized)
combined = {}
for i, idx in enumerate(indices[0]):
combined[idx] = alpha * semantic_scores[i]
for i, idx in enumerate(top_bm25_idx):
combined[idx] = combined.get(idx, 0) + (1-alpha) * (keyword_scores[i] / max(keyword_scores))
return sorted(combined.items(), key=lambda x: x[1], reverse=True)[:top_k]
# 4. Test
query = "Can I get my money back for a digital product?"
results = hybrid_search(query)
for idx, score in results:
print(f"Score: {score:.2f} | {docs[idx]}")
# Output: Score: 0.92 | Digital products: non-refundable after download.
That alpha=0.5 balances meaning and exact wording. Without hybrid search, the digital product chunk was #3 (ignored). With it, #1.
Three Changes That 10x’ed My Pipeline
- Chunk size is not a default – Moved to overlapping chunks (200 tokens with 50 overlap).
- Semantic search alone lies – Added BM25 hybrid search (see code above).
- Re‑ranking changes everything – A small cross‑encoder re‑scored top‑10 chunks, lifting accuracy from 72% to 91%.
The Mistake Most People Make
We treat RAG as an LLM problem. So we tweak prompts, swap models, add system instructions.
But the LLM is forced to use whatever context you give it. If you feed it the wrong chunk, it will hallucinate confidently. If you feed it the right chunk, even a small model answers correctly.
The bottleneck is almost never the LLM. It’s the retriever.
What I Do Differently Now
Before I write a single line of agent code, I ask three questions:
- “If I searched my vector database by hand, would I find the exact sentence that answers this?”
- “Does my retrieval work for synonyms AND exact keywords?” → if no, hybrid search.
- “Is the top‑1 retrieved chunk actually the best?” → if no, add a re‑ranker.
The Bottom Line
The AI industry sells you on the model. But in production RAG systems, the model is the cheapest, most replaceable component. The hard part – the part that separates working bots from demoware – is getting the right information into the context window.
The LLM is the pen. Retrieval is the memory. And memory is what makes a system useful.
So next time your RAG bot fails, don’t blame GPT. Look at what you retrieved. I promise that’s where the real problem lives.
Top comments (0)