RAG (Retrieval-Augmented Generation) has evolved dramatically. In 2023 it was "embed and retrieve." In 2026, it's a multi-stage, agentic pipeline with evaluation loops. Here's the complete picture.
Why RAG Still Matters in 2026
Even with 1M+ token context windows, RAG remains essential:
| Problem | Symptom | RAG Solution |
|---|---|---|
| Knowledge cutoff | LLM can't answer about recent events | Real-time retrieval |
| Hallucination | Confident but wrong answers | Ground answers in source documents |
| Private data | LLM doesn't know your internal docs | Inject proprietary knowledge |
| Cost | 1M tokens per query = expensive | Retrieve only what's needed |
The RAG Evolution Arc
Naive RAG (2023)
Question → Embed → Vector Search → Retrieve chunks → LLM → Answer
Simple. Worked. Hit precision ceiling around 70%.
Advanced RAG (2024)
Question → Query expansion → Hybrid search → Rerank → LLM → Answer
HyDE, query decomposition, MMR, cross-encoder reranking pushed precision to 85%+.
Agentic RAG (2025–2026)
Question → Agent plans strategy
→ Parallel multi-source retrieval
→ Synthesis + verification
→ Self-critique loop (retry if insufficient)
→ Final answer with citations
The agent decides when to search, what to search for, and whether the result is good enough.
Building a Production RAG Pipeline
Step 1: Document Loading and Chunking
from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("technical_docs.pdf")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200, # Overlap preserves context at boundaries
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
Chunking strategy matters more than most people think:
- Technical docs: 500–1000 chars
- Conversational logs: 200–500 chars
- Legal/contracts: 1000–2000 chars (longer context needed)
Step 2: Vector Store Setup
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
collection_name="knowledge_base"
)
Embedding model comparison (2026):
| Model | Dimensions | Cost | Notes |
|-------|-----------|------|-------|
| text-embedding-3-large | 3072 | $0.13/1M | Best quality |
| text-embedding-3-small | 1536 | $0.02/1M | 5x cheaper, good for most |
| BAAI/bge-m3 | 1024 | Free | Best open-source option |
Step 3: Hybrid Search + Reranking
The biggest quality jump comes from combining vector search (semantic) with BM25 (keyword):
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Semantic retriever (MMR for diversity)
vector_retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 10, "fetch_k": 30}
)
# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10
# Hybrid: 60% semantic + 40% keyword
hybrid_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4]
)
# Rerank top results with a cross-encoder
reranker = CrossEncoderReranker(
model=HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3"),
top_n=5
)
final_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=hybrid_retriever
)
Step 4: The RAG Chain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_template("""
You are a precise technical assistant. Answer based ONLY on the provided documents.
If the answer isn't in the documents, say "I couldn't find this information in the provided documents."
Documents:
{context}
Question: {question}
Answer (cite your sources):
""")
def format_docs(docs):
return "\n\n---\n\n".join([
f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
for doc in docs
])
rag_chain = (
{"context": final_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("What are the main RAG hallucination mitigation strategies?")
Agentic RAG with LangGraph
The key difference: the agent decides the retrieval strategy dynamically.
from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Optional
class RAGState(TypedDict):
question: str
search_queries: List[str]
retrieved_docs: List[str]
answer: Optional[str]
needs_more_search: bool
iteration: int
def query_decomposer(state: RAGState) -> RAGState:
"""Break complex questions into targeted sub-queries"""
response = llm.invoke(
f"Decompose this into 2-4 specific search queries (JSON array):\n{state['question']}"
)
# Parse JSON from response
queries = [state['question']] # Simplified
return {"search_queries": queries}
def parallel_retriever(state: RAGState) -> RAGState:
all_docs = []
for query in state['search_queries']:
docs = final_retriever.invoke(query)
all_docs.extend([d.page_content for d in docs])
return {"retrieved_docs": list(dict.fromkeys(all_docs))[:10]} # dedup
def answer_and_evaluate(state: RAGState) -> RAGState:
context = "\n\n".join(state['retrieved_docs'])
response = llm.invoke(
f"Documents:\n{context}\n\nQuestion: {state['question']}\n\n"
f"Answer, then on a new line output JSON: {{\"sufficient\": true/false}}"
)
# In production, parse the JSON suffix
return {
"answer": response.content,
"needs_more_search": False,
"iteration": state.get('iteration', 0) + 1
}
def should_retry(state: RAGState) -> str:
if state['needs_more_search'] and state['iteration'] < 3:
return "retry"
return "end"
graph = StateGraph(RAGState)
graph.add_node("decompose", query_decomposer)
graph.add_node("retrieve", parallel_retriever)
graph.add_node("generate", answer_and_evaluate)
graph.set_entry_point("decompose")
graph.add_edge("decompose", "retrieve")
graph.add_edge("retrieve", "generate")
graph.add_conditional_edges("generate", should_retry, {"retry": "retrieve", "end": END})
agentic_rag = graph.compile()
Evaluation: You Can't Improve What You Don't Measure
The top RAG evaluation stack in 2026:
Ragas (RAG-specific)
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset
eval_dataset = Dataset.from_dict({
"question": test_questions,
"answer": [rag_chain.invoke(q) for q in test_questions],
"contexts": [[d.page_content for d in final_retriever.invoke(q)] for q in test_questions],
"ground_truth": reference_answers
})
scores = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
print(scores.to_pandas())
Target scores for production:
| Metric | Minimum | Target |
|--------|---------|--------|
| Faithfulness | 0.85 | > 0.92 |
| Answer Relevancy | 0.80 | > 0.88 |
| Context Recall | 0.75 | > 0.85 |
| Context Precision | 0.70 | > 0.80 |
The 5 Most Common RAG Failures (and Fixes)
1. Chunk boundary cuts critical information
→ Increase chunk_overlap to 20-30% of chunk size
2. Vocabulary mismatch between query and document
→ Use HyDE (generate a hypothetical answer, embed that for search)
→ Use hybrid search (BM25 catches exact keyword matches)
3. Irrelevant chunks pass vector similarity threshold
→ Add cross-encoder reranking as a second filter
4. Stale data in the index
→ Add date metadata, filter by recency in retriever kwargs
5. LLM ignores the retrieved context
→ Restructure the prompt — put documents BEFORE the question, not after
2026 Trends to Watch
- GraphRAG: Microsoft's approach — extract knowledge graph from docs, traverse relationships for multi-hop reasoning
- Multi-modal RAG: Retrieve images, charts, tables alongside text
- Adaptive RAG: Route simple queries to fast/cheap path, complex ones to agentic path
- Caching layers: Cache embeddings + frequent query results (Redis/Upstash) to cut costs 60-80%
RAG is mature technology now. The differentiator isn't whether you use it — it's how well you evaluate and iterate on it. Add Ragas to your CI/CD pipeline and treat retrieval quality as a first-class metric.
Explore 460+ AI agent tools including RAG infrastructure at AgDex.ai
Top comments (0)