Every few months, someone declares RAG (Retrieval-Augmented Generation) dead. "Just use a million-token context window," they say. "Fine-tune instead," others suggest.
They're wrong. RAG isn't dead — naive RAG is dead. The pattern of "chunk documents → embed → cosine similarity → stuff into prompt" was always a prototype, not a production system. In 2026, production RAG looks radically different.
This article covers the patterns that separate toy demos from systems that actually work.
Why Naive RAG Fails
The classic RAG pipeline has predictable failure modes:
- Chunking destroys context — Splitting at 512 tokens breaks paragraphs, separates questions from answers, and loses document structure
- Embedding similarity ≠ relevance — "How do I reset my password?" and "Password reset policy" have high similarity but serve different intents
- Top-K retrieval is crude — The 5 most similar chunks aren't necessarily the 5 most useful
- No query understanding — The raw user query goes straight to vector search with no transformation
Let's fix each of these.
Pattern 1: Semantic Chunking
Instead of fixed-size chunks, split at semantic boundaries:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90,
)
chunks = chunker.split_text(document_text)
The semantic chunker computes embeddings for each sentence, then splits where the cosine distance between consecutive sentences exceeds a threshold. Sentences about the same topic stay together.
Contextual Retrieval (Anthropic's Approach)
Prepend each chunk with context about where it fits in the document:
def add_context(chunk: str, full_document: str) -> str:
"""Use an LLM to generate context for each chunk."""
prompt = f"""Given this document:
{full_document[:2000]}
And this specific chunk:
{chunk}
Write a 2-3 sentence context that explains where this chunk
fits within the overall document. Be specific."""
context = llm.invoke(prompt)
return f"CONTEXT: {context}\n\n{chunk}"
This costs more upfront but dramatically improves retrieval accuracy — Anthropic reported 49% fewer retrieval failures.
Pattern 2: Hybrid Search
Vector search alone misses exact matches. BM25 alone misses semantic similarity. Combine them:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Qdrant
# Vector retriever
vector_store = Qdrant.from_documents(
documents, embeddings,
url="http://localhost:6333",
collection_name="docs"
)
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 10})
# BM25 retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
# Combine with Reciprocal Rank Fusion
ensemble = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4], # Favor semantic for most use cases
)
The EnsembleRetriever uses Reciprocal Rank Fusion (RRF) to merge results. A document ranking #1 in vector search and #3 in BM25 scores higher than one ranking #2 in both.
Adding Knowledge Graphs
For structured relationships (org charts, product hierarchies, dependency trees), add a graph layer:
from neo4j import GraphDatabase
def graph_enhanced_retrieval(query: str, vector_results: list) -> list:
"""Enrich vector results with graph context."""
driver = GraphDatabase.driver("bolt://localhost:7687")
enriched = []
for doc in vector_results:
# Find related entities in the graph
with driver.session() as session:
result = session.run("""
MATCH (n)-[r]-(related)
WHERE n.name = $entity
RETURN related.name, type(r), related.description
LIMIT 5
""", entity=doc.metadata.get("entity"))
context = [f"{r['type(r)']}: {r['related.name']}" for r in result]
doc.page_content += f"\n\nRelated: {', '.join(context)}"
enriched.append(doc)
return enriched
Pattern 3: Re-Ranking
Initial retrieval casts a wide net. Re-ranking uses a cross-encoder to score each (query, document) pair more accurately:
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
# Retrieve 20 candidates, re-rank to top 5
reranker = CohereRerank(
model="rerank-english-v3.0",
top_n=5,
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=ensemble, # From hybrid search above
)
results = compression_retriever.invoke("How do I configure SSO?")
Cross-encoders process the query and document together (unlike bi-encoders which encode them separately), enabling much more nuanced relevance scoring. The tradeoff is speed — which is why we re-rank a pre-filtered set rather than the entire corpus.
ColBERT v2: Best of Both Worlds
ColBERT stores per-token embeddings and uses late interaction for scoring, giving near cross-encoder accuracy at near bi-encoder speed:
from ragatouille import RAGPretrainedModel
rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
rag.index(
collection=[doc.page_content for doc in documents],
document_metadatas=[doc.metadata for doc in documents],
index_name="my_index",
)
results = rag.search(query="SSO configuration", k=5)
Pattern 4: Query Transformation
Don't send the raw user query to retrieval. Transform it first.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, then search for documents similar to that answer:
def hyde_retrieval(query: str) -> list:
# Generate hypothetical answer
hypothetical = llm.invoke(
f"Write a short paragraph that would answer: {query}"
)
# Search using the hypothetical document's embedding
# This often matches better than the question embedding
return vector_store.similarity_search(hypothetical, k=5)
Multi-Query Expansion
Generate multiple perspectives on the same question:
from langchain.retrievers.multi_query import MultiQueryRetriever
multi_retriever = MultiQueryRetriever.from_llm(
retriever=vector_store.as_retriever(),
llm=llm,
)
# Internally generates 3+ query variants and deduplicates results
results = multi_retriever.invoke("Why is our API slow?")
# Generates: "API performance issues", "latency root causes",
# "slow response time debugging"
Step-Back Prompting
For specific questions, first ask a broader question:
def step_back_retrieval(query: str) -> list:
# Generate a more general question
broader = llm.invoke(
f"Given this specific question: '{query}'\n"
f"What is a more general question that would help answer it?"
)
# Retrieve for both specific and general queries
specific_docs = vector_store.similarity_search(query, k=3)
general_docs = vector_store.similarity_search(broader, k=3)
return deduplicate(specific_docs + general_docs)
Pattern 5: Agentic RAG
The biggest evolution: let the LLM decide what to retrieve and when.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
class AgentState(TypedDict):
question: str
documents: list
answer: str
needs_more_info: bool
def retrieve(state: AgentState) -> AgentState:
docs = retriever.invoke(state["question"])
return {"documents": docs}
def grade_documents(state: AgentState) -> AgentState:
"""Let the LLM decide if retrieved docs are sufficient."""
grade = llm.invoke(
f"Question: {state['question']}\n"
f"Documents: {state['documents']}\n"
f"Are these documents sufficient to answer the question? "
f"Reply YES or NO with a brief reason."
)
return {"needs_more_info": "NO" in grade.upper()}
def generate(state: AgentState) -> AgentState:
answer = llm.invoke(
f"Answer based on these documents:\n"
f"{state['documents']}\n\n"
f"Question: {state['question']}"
)
return {"answer": answer}
def rewrite_query(state: AgentState) -> AgentState:
better_query = llm.invoke(
f"The following question didn't get good search results: "
f"'{state['question']}'. Rewrite it for better retrieval."
)
return {"question": better_query}
# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("rewrite", rewrite_query)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges(
"grade",
lambda s: "rewrite" if s["needs_more_info"] else "generate",
)
workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("generate", END)
app = workflow.compile()
This agent retrieves, evaluates quality, rewrites the query if needed, and only generates when it has sufficient context.
Pattern 6: Evaluation with RAGAS
You can't improve what you don't measure:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
result = evaluate(
dataset=eval_dataset, # Questions + ground truth answers
metrics=[
faithfulness, # Is the answer grounded in retrieved context?
answer_relevancy, # Does the answer address the question?
context_precision, # Are the retrieved docs relevant?
context_recall, # Did we retrieve all necessary info?
],
)
print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
# 'context_precision': 0.85, 'context_recall': 0.79}
Track these metrics over time. When you change chunking strategy, embedding model, or retrieval pipeline, you'll know immediately if it helped.
RAG vs. Fine-Tuning vs. Long Context
When to use each:
| Approach | Best For | Limitations |
|---|---|---|
| RAG | Dynamic data, source attribution, cost control | Retrieval quality ceiling |
| Fine-tuning | Teaching style/format, specialized domains | Stale data, no source citation |
| Long context | Small corpora (<100 docs), one-shot analysis | Cost at scale, attention degradation |
The sweet spot for most production systems: RAG + selective fine-tuning. Fine-tune for domain language and response style. Use RAG for up-to-date facts and source attribution.
Production Tips
- Cache aggressively — Cache embeddings, cache LLM re-ranking calls, cache final answers for repeated queries
- Stream the answer — Start generating as soon as retrieval completes; don't wait for re-ranking if latency matters
- Monitor retrieval quality — Log which chunks were retrieved and whether users found answers helpful
- Use metadata filters — Filter by date, department, document type before vector search to reduce noise
- Implement fallback — If RAG confidence is low, fall back to a direct LLM response with a disclaimer
RAG in 2026 is a pipeline engineering challenge, not a simple API call. But get it right, and you have a system that's accurate, attributable, and cost-effective at scale.
If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more AI engineering content.
You Might Also Like
- BullMQ Job Queues in Node.js: Background Processing Done Right (2026 Guide)
- Building Your First MCP Server in TypeScript: Connect AI Agents to Anything
- Building AI-Ready Backends: Streaming, Tool Use, and LLM Integration Patterns (2026)
Follow me for more production-ready backend content!
If this helped you, buy me a coffee on Ko-fi!
Top comments (0)