Most RAG tutorials teach you to stuff documents into a vector store and call it a day. Then your users ask a question and get back completely wrong answers because the retriever pulled the wrong chunks.
Retrieval Augmented Generation is the most common pattern in production AI systems. It lets an LLM answer questions using your own data — internal docs, codebases, knowledge bases — without fine-tuning. The concept is straightforward: retrieve relevant documents, feed them to the model, get grounded answers.
The implementation is where teams struggle. Bad chunking produces fragments that lose context. Naive retrieval returns semantically similar but factually irrelevant results. And most tutorials stop before showing you how to evaluate whether your pipeline actually works.
This guide walks through 4 patterns that make RAG pipelines reliable. Every code example uses LangChain (as of v0.3+, March 2026), runs on Python 3.10+, and is verified against the official documentation.
What You Need
Install the dependencies:
pip install langchain-openai langchain-chroma langchain-community \
langchain-text-splitters chromadb beautifulsoup4
Set your OpenAI API key:
export OPENAI_API_KEY="your-key-here"
All examples below use OpenAI embeddings and models. You can swap in any LangChain-compatible provider (Anthropic, Ollama, Cohere) by changing the import and model name.
Pattern 1: Document Loading and Chunking That Preserves Context
The first failure point in most RAG pipelines is chunking. Split too small and you lose context. Split too large and you dilute relevance. The key is overlap: every chunk shares some text with its neighbors, so the retriever can find relevant passages even when the answer spans a chunk boundary.
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import bs4
# Load a web page, extracting only the content you need
loader = WebBaseLoader(
web_paths=["https://lilianweng.github.io/posts/2023-06-23-agent/"],
bs_kwargs={
"parse_only": bs4.SoupStrainer(
class_=("post-title", "post-header", "post-content")
)
},
)
docs = loader.load()
# RecursiveCharacterTextSplitter tries paragraph breaks first,
# then sentences, then words. This preserves natural boundaries.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
add_start_index=True, # tracks where each chunk came from
)
splits = text_splitter.split_documents(docs)
print(f"Loaded {len(docs)} documents, split into {len(splits)} chunks")
Three things matter here:
chunk_size=1000keeps chunks large enough to contain complete thoughts. A 200-token chunk rarely contains enough context to answer a question on its own.chunk_overlap=200means adjacent chunks share 200 characters. When an answer spans two chunks, both show up in retrieval results.add_start_index=Truerecords the character offset where each chunk starts in the original document. This lets you trace any retrieved chunk back to its source position — critical for debugging retrieval quality.
RecursiveCharacterTextSplitter is the default choice for most use cases. It splits on paragraph breaks (\n\n) first, then sentence breaks (\n, .), then words. This hierarchy preserves the most natural reading boundaries.
Pattern 2: Embeddings and Vector Store Setup
Once your documents are chunked, you need to convert them to vectors and store them for retrieval. ChromaDB is the simplest vector store for local development — no external services, no Docker containers, just pip install.
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
# OpenAI's text-embedding-3-small is fast and cheap
# For higher accuracy, use text-embedding-3-large
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create the vector store from your document chunks
vectorstore = Chroma.from_documents(
documents=splits,
embedding=embeddings,
persist_directory="./chroma_db", # saves to disk
)
# Turn it into a retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}, # return top 4 matches
)
# Test it
results = retriever.invoke("What is task decomposition?")
for doc in results:
print(f"[Score chunk from index {doc.metadata.get('start_index', '?')}]")
print(doc.page_content[:200])
print("---")
The persist_directory parameter saves your vectors to disk. Without it, ChromaDB stores everything in memory and you re-embed on every restart. For a knowledge base with thousands of documents, re-embedding costs real money.
Choosing k: Start with k=4. Too few results and you miss relevant context. Too many and you flood the LLM's context window with noise. Measure retrieval precision (are the returned chunks actually relevant?) and adjust.
When to use a different vector store: ChromaDB works for local development and small datasets (under 1 million chunks). For production with larger datasets, consider Pinecone, Weaviate, or PostgreSQL with pgvector. The LangChain API is the same — swap the import, change the constructor, keep your retrieval code.
Pattern 3: The RAG Chain
Here is where retrieval meets generation. You build a chain that takes a question, retrieves relevant chunks, formats them into a prompt, and passes everything to the LLM.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# The prompt template grounds the LLM in your retrieved context
prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the following context.
If the context doesn't contain the answer, say "I don't have
enough information to answer that."
Context:
{context}
Question: {question}
Answer:"""
)
def format_docs(docs):
"""Join retrieved documents into a single string."""
return "\n\n".join(doc.page_content for doc in docs)
# Build the RAG chain using LCEL (LangChain Expression Language)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Run it
answer = rag_chain.invoke("What is task decomposition?")
print(answer)
Two design decisions in this prompt matter:
"Based only on the following context" prevents the LLM from using its training data. Without this constraint, the model mixes retrieved facts with memorized (potentially outdated) information.
The fallback instruction ("say I don't have enough information") stops the model from hallucinating when the retriever returns irrelevant chunks. Most RAG failures happen here: the retriever returns something vaguely related, and the model confidently generates a wrong answer from it.
The chain itself uses LangChain Expression Language (LCEL). The | pipe operator connects components: retriever feeds into format_docs, which feeds into the prompt template, which feeds into the LLM, which feeds into the output parser.
RunnablePassthrough() passes the user's question through unchanged. The retriever receives the same question string to perform the similarity search.
Pattern 4: Evaluate Whether Your Pipeline Actually Works
This is the pattern most tutorials skip. You built a RAG pipeline. How do you know it returns correct answers? You need a test set of questions with known answers, and a systematic way to check retrieval quality.
# Simple evaluation: does the retriever find relevant chunks?
test_questions = [
{
"question": "What is task decomposition?",
"expected_keywords": ["subgoal", "decompose", "smaller"],
},
{
"question": "What are the types of agent memory?",
"expected_keywords": ["short-term", "long-term", "sensory"],
},
]
def evaluate_retrieval(retriever, test_cases):
"""Check if retrieved chunks contain expected keywords."""
results = []
for case in test_cases:
docs = retriever.invoke(case["question"])
retrieved_text = " ".join(d.page_content for d in docs).lower()
found = [
kw for kw in case["expected_keywords"]
if kw.lower() in retrieved_text
]
missing = [
kw for kw in case["expected_keywords"]
if kw.lower() not in retrieved_text
]
score = len(found) / len(case["expected_keywords"])
results.append({
"question": case["question"],
"score": score,
"found": found,
"missing": missing,
})
status = "PASS" if score >= 0.5 else "FAIL"
print(f"[{status}] {case['question']} — {score:.0%}")
if missing:
print(f" Missing: {missing}")
avg = sum(r["score"] for r in results) / len(results)
print(f"\nAverage retrieval score: {avg:.0%}")
return results
evaluate_retrieval(retriever, test_questions)
This is a minimal evaluation. It checks whether the retriever pulls back chunks that contain the right concepts. A score below 50% means your chunking strategy is wrong — go back to Pattern 1 and adjust chunk_size and chunk_overlap.
For production evaluation, add these layers:
- Answer correctness: Compare generated answers against ground truth using an LLM-as-judge (ask a model to score the answer's factual accuracy against a reference answer).
- Faithfulness: Check whether the answer is grounded in the retrieved context. If the answer contains claims not present in any retrieved chunk, the model is hallucinating.
- Retrieval relevance: For each retrieved chunk, score whether it is actually relevant to the question. Low relevance scores mean your embeddings or chunking need work.
Frameworks like DeepEval and RAGAS automate these checks. But start with the keyword-based evaluation above. It catches the obvious failures — wrong chunks, empty retrievals, missing concepts — before you invest in a full evaluation pipeline.
Putting It All Together
Here is the complete pipeline in one script:
"""Complete RAG pipeline — load, chunk, embed, retrieve, generate, evaluate."""
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# 1. Load
loader = WebBaseLoader(
web_paths=["https://lilianweng.github.io/posts/2023-06-23-agent/"],
bs_kwargs={
"parse_only": bs4.SoupStrainer(
class_=("post-title", "post-header", "post-content")
)
},
)
docs = loader.load()
# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
splits = splitter.split_documents(docs)
# 3. Embed + Store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# 4. Generate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template(
"""Answer based only on this context. If unsure, say so.
Context: {context}
Question: {question}
Answer:"""
)
rag_chain = (
{"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
"question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# 5. Run
question = "What is task decomposition?"
print(rag_chain.invoke(question))
38 lines from raw documents to grounded answers.
What to Do Next
Three improvements that matter most after your first pipeline works:
Add metadata filtering. Tag your documents with source, date, and category. Use
search_kwargs={"filter": {"source": "docs"}}to restrict retrieval to specific document sets.Try hybrid search. Vector similarity misses exact keyword matches. ChromaDB and most vector stores support combining vector search with keyword (BM25) search. This catches queries where the user uses exact terminology from the documents.
Monitor retrieval quality. Log every query, the chunks retrieved, and the generated answer. Review the logs weekly. The queries your pipeline answers badly tell you exactly which documents to add or how to adjust your chunking.
RAG is not a one-time setup. It is a system that improves as you add documents, adjust chunking, and measure what works.
Follow @klement_gunndu for more AI engineering content. We're building in public.
Top comments (2)
Great walkthrough on RAG! One thing that dramatically improved my pipeline quality was structuring prompts to the retrieval/synthesis LLM more carefully — vague instructions give vague answers even with perfect context chunks.
I built flompt (flompt.dev) for this — it breaks prompts into semantic blocks (role, context, constraints, output format, etc.) and compiles to structured XML. Made a noticeable difference in how consistently my RAG pipeline synthesized answers. Free, open-source, no account needed.
Spot on about prompt structure being the underrated lever in RAG — you can have perfect retrieval and still get incoherent synthesis if the prompt doesn't decompose the task clearly. Breaking it into semantic blocks (role, constraints, output format) is exactly the right pattern. Will check out flompt — structured XML compilation for prompts is a clean approach, especially for pipelines where you need consistency across runs rather than ad-hoc prompting.