How to make LLMs feel âgroundedâ in your dataâwithout turning your app into a prompt-factory.
Large Language Models are incredible at language, but they still have two awkward traits in production:
- They donât know your private data by default (docs, tickets, code, policies).
- They can sound confident even when theyâre guessing.
Retrieval-Augmented Generation (RAG) is the most reliable pattern Iâve used to fix bothâby giving the model just-in-time access to relevant context at the moment it answers.
This post is a practical, medium-depth tour of RAG: the core architecture, the failure modes, and the âadvanced knobsâ that actually move quality (reranking, routing, query strategies, and better indexing). Iâll also point you to a great open-source reference implementation that Iâve been using as a sanity check.
đ The Core Idea: Donât Train, Retrieve
Think of RAG as two systems working together:
- Retriever: finds the best supporting context for a question.
- Generator (LLM): writes the final answer using the retrieved context.
Instead of trying to cram your entire knowledge base into model weights, you keep your knowledge in stores that are good at search (vector DBs, relational DBs, graph DBs), retrieve the best bits, and then let the LLM do what it does best: compose a response.
A good mental model:
RAG = Search + Reasoning
Search brings facts. Reasoning provides coherence.
đď¸ A Clean RAG Architecture (What Actually Matters)
Most RAG diagrams look complex because they include every optional component. Hereâs a simple backbone that scales:
- Ingest documents (PDFs, web pages, internal wikis, tickets)
- Chunk them into retrievable units
- Embed chunks into vectors
- Index vectors in a vector store
- Retrieve top-$k$ chunks for a question
- Generate an answer with citations / grounded context
In code, the minimal version feels like:
question -> embed(question) -> similarity_search -> context -> LLM(prompt + context)
If you only build that, youâll get something working quicklyâbut youâll also quickly hit the real-world issues:
- Retrieval returns ânearbyâ chunks that donât actually answer the question
- The best chunk is buried at rank 17
- A single query phrasing misses the right terminology
- Some questions should query SQL or a graph, not embeddings
Thatâs where the next layers matter.
đŚ Retrieval Isnât Only Vectors: Pick the Right Store
A mature RAG system doesnât have to be âvector-onlyâ. Depending on the question, retrieval can come from:
- Vector stores: semantic search over unstructured text (docs, emails, transcripts)
- Relational DBs: exact structured facts (orders, users, pricing, logs)
- Graph DBs: relationships and traversals (org charts, dependency graphs, knowledge graphs)
In practice, you often end up with a hybrid:
| Data type | Best retrieval style | Example question |
|---|---|---|
| Policies / long docs | Vector search | âWhatâs our parental leave policy?â |
| Metrics / records | SQL | âWhat was churn last quarter in EMEA?â |
| Relationships | Cypher / graph queries | âWho owns service X and what depends on it?â |
This is why modern RAG stacks include things like Text-to-SQL, Text-to-Cypher, and self-query retrievers (where the model generates a structured search query and metadata filters).
đ§ Routing: The âSecret Sauceâ for Multi-Source RAG
If you only have one data source, retrieval is straightforward. But the moment you add a relational database, a vector store, and maybe a graphâyour first big design decision becomes:
How do I route a userâs question to the right retriever?
Two patterns show up repeatedly:
1) Logical routing
Simple rules or a lightweight classifier.
- âIf the question mentions revenue, query SQL.â
- âIf the question mentions âpolicyâ, use the handbook index.â
2) Semantic routing
Use embeddings (or a small LLM prompt) to decide which tool to call.
This reduces âtool spamâ and usually improves relevance because you retrieve from the right store first.
đ§ Query Strategies That Increase Recall (Without Overfetching)
Most weak RAG answers are not generation problemsâtheyâre retrieval problems.
A single user question is often ambiguous. Strong pipelines expand the query space before retrieving.
Here are query strategies Iâve seen consistently help:
Multi-query
Generate multiple paraphrases of the question and retrieve for each.
Why it works: different phrasing hits different vocabulary.
Step-back questions
Ask a higher-level sub-question first (âWhat concept is this about?â), then use that to retrieve.
Why it works: reduces lexical mismatch and anchors retrieval.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer document, embed that, and retrieve based on it.
Why it works: the hypothetical answer contains domain language the user may not use.
RAG-Fusion
Retrieve multiple lists (from multi-query, HyDE, etc.) and then fuse rankings (often using Reciprocal Rank Fusion).
Why it works: you get strong recall without blindly increasing $k$.
đĽ Reranking: Fix âThe Answer Was in the Context, ButâŚâ
If youâve built a basic RAG system, youâve likely seen this failure mode:
- The right chunk is retrieved
- But itâs ranked too low
- The LLM focuses on the wrong chunk
Reranking is the clean fix.
A common pipeline looks like:
- Retrieve top 20â50 chunks cheaply (vector similarity)
- Rerank top candidates with a stronger model (cross-encoder, LLM-based ranker, or a reranker API)
- Feed the top 3â8 chunks to the generator
Youâll see reranking approaches referenced as:
- Cross-encoder rerankers
- LLM ranking (sometimes called RankGPT-style ranking)
- RRF (Reciprocal Rank Fusion) when merging multiple retrieval lists
This is one of the highest ROI upgrades in RAG.
đ§š Filter & Compress: The Missing Piece for Long Context
Even if retrieval is good, the final prompt can still be noisy:
- repeated information
- irrelevant paragraphs
- chunks that overlap heavily
Thatâs where contextual compression comes in: after retrieval, you summarize, extract, or filter down to only what matters.
This is especially important as your data grows and you start using larger $k$ values.
đď¸ Indexing: Where Most Teams Underinvest
Indexing decisions quietly determine your ceiling.
Here are indexing techniques worth knowing (and testing):
Chunk optimization
Chunk size is not a constant. Different document types want different chunking.
- Too small â context fragments
- Too large â retrieval becomes âblurryâ
Semantic splitting
Split on meaning (headings, sections), not arbitrary character counts.
Parent-document retrieval
Store embeddings for child chunks but return a larger âparentâ span when answering.
Multi-representation indexing
Index both:
- fine-grained chunks for precision
- summaries for recall
Specialized embeddings / fine-tuning
If your domain has unique language (legal, medicine, internal code), embeddings matter.
Hierarchical indexing (RAPTOR-like)
Build a tree of summaries from leaves â root so retrieval can happen at multiple abstraction levels.
Token-level retrieval (ColBERT-style)
A stronger retrieval approach when semantics are subtle and bag-of-vector similarity struggles.
You donât need all of these. But the point is: RAG quality is frequently an indexing problem disguised as an LLM problem.
đ Active Retrieval (and Why Itâs the Future)
Some questions require the system to work:
- ask clarifying questions
- reformulate queries mid-flight
- retry retrieval when evidence is weak
Youâll sometimes see this category described as active retrieval (including approaches like CRAG / self-correcting retrieval patterns).
The takeaway: the best RAG systems arenât one-shot. They behave more like a careful researcher.
đ§Ş A Hands-On Reference: bRAG-langchain
If you want something concrete to learn from (and compare against your own implementation), I recommend checking out the open-source project here:
What I like about it:
- It walks from baseline RAG â multi-query â routing â advanced indexing â reranking
- Itâs notebook-driven, so you can test ideas quickly
- It keeps the focus on practical patterns (not just theory)
A suggested learning path mirrors the notebook sequence:
- Baseline RAG setup
- Multi-query improvements
- Routing + query construction
- Advanced indexing
- Retrieval + reranking + fusion
Use it like a âcookbookâ: borrow the ideas, not the exact words.
đ¨âđť Code Walkthrough (Inspired by bRAG-langchain)
Below are two rewritten snippets inspired by the projectâs notebooks (especially full_basic_rag.ipynb). The goal is to show the shape of a clean RAG pipelineâwithout dumping an entire notebook into a blog post.
Attribution: the reference implementation that inspired these patterns is bRAG AI: https://github.com/bRAGAI/bRAG-langchain/
1) A minimal LangChain RAG chain (loader â chunks â vectors â retriever â chain)
This is the âboring baselineâ that should work before you touch reranking, routing, or fancy indexing.
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
load_dotenv() # expects OPENAI_API_KEY, PINECONE_INDEX_NAME, etc.
def join_docs(docs) -> str:
return "\n\n".join(d.page_content for d in docs)
# 1) Load
docs = PyPDFLoader("path/to/your.pdf").load()
# 2) Chunk
splitter = RecursiveCharacterTextSplitter(chunk_size=900, chunk_overlap=150)
chunks = splitter.split_documents(docs)
# 3) Embed + index
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
index_name=os.environ["PINECONE_INDEX_NAME"],
)
# 4) Retrieve
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# 5) Generate
prompt = ChatPromptTemplate.from_template(
"""You are a grounded assistant. Use ONLY the context to answer.
Context:
{context}
Question: {question}
If the answer is not in the context, say you don't know.
"""
)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
rag_chain = (
{"context": retriever | join_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
print(rag_chain.invoke("What is this document about?"))
Why this pattern is nice: retrieval is a pure function of the question, and prompt+LLM are pure functions of {context, question}. That separation makes it easy to add routing, reranking, eval, caching, etc.
2) Multi-query + fusion (high recall without blindly increasing k)
The repoâs later notebooks explore multi-query / fusion and reranking. The key mental model is:
- generate multiple query variants
- retrieve for each
- fuse the ranked lists (so strong hits bubble up)
- optionally rerank the merged set
Hereâs a compact sketch using Reciprocal Rank Fusion (RRF):
from collections import defaultdict
def rrf_fuse(ranked_lists, *, k: int = 60, top_n: int = 10):
"""Fuse multiple ranked lists using Reciprocal Rank Fusion.
ranked_lists: list[list[Document]]
"""
scores = defaultdict(float)
by_id = {}
for docs in ranked_lists:
for rank, doc in enumerate(docs):
# Prefer a stable ID if you have one; fallback to content hash
doc_id = doc.metadata.get("id") or hash(doc.page_content)
by_id[doc_id] = doc
scores[doc_id] += 1.0 / (k + rank + 1)
fused = sorted(scores, key=scores.get, reverse=True)
return [by_id[i] for i in fused[:top_n]]
def generate_queries(question: str) -> list[str]:
# In practice: use an LLM prompt to produce 3â8 diverse rewrites.
return [
question,
f"Explain {question} with concrete examples",
f"What are the key concepts behind: {question}?",
]
question = "How does RAG reduce hallucinations?"
queries = generate_queries(question)
ranked_lists = [retriever.get_relevant_documents(q) for q in queries]
fused_docs = rrf_fuse(ranked_lists, top_n=6)
answer = rag_chain.invoke(question) # or rebuild chain to use fused_docs
print(answer)
In production youâd typically rebuild the chain so the âcontextâ comes from fused_docs (and then optionally apply a learned reranker like Cohere Rerank on that smaller candidate set).
â A Production Checklist (Short, but Useful)
Before you ship RAG to real users, make sure you can answer:
- Evaluation: How will you measure grounded correctness (not just fluency)?
- Citations: Can you show which sources supported the answer?
- Fallbacks: What happens when retrieval confidence is low?
- Security: Are you filtering sensitive docs by user permissions before retrieval?
- Freshness: How often is the index updated? (and can you delete data reliably?)
- Latency: Can you keep response time acceptable with reranking and multi-query?
Conclusion
RAG isnât a single techniqueâitâs a toolbox:
- retrieval across the right stores
- routing to the right tool
- smarter query generation (multi-query, step-back, HyDE)
- reranking and fusion
- compression for long context
- indexing strategies that scale
If you get retrieval right, generation becomes the easy part.
Resources
- bRAG LangChain project (hands-on notebooks): https://github.com/bRAGAI/bRAG-langchain/
- RAG architecture diagram source material: see RAG_Consolidated.jpg
About the Author
Suraj Khaitan â Gen AI Architect | Building the next generation of AI-powered development tools
Connect on LinkedIn | Follow for more AI and software engineering insights
Tags: #AI #RAG #LLM #LangChain #VectorDatabases #InformationRetrieval #GenerativeAI
Top comments (0)