DEV Community

Agdex AI
Agdex AI

Posted on

RAG in 2026: From Naive Retrieval to Agentic RAG — A Complete Implementation Guide

RAG (Retrieval-Augmented Generation) has evolved dramatically. In 2023 it was "embed and retrieve." In 2026, it's a multi-stage, agentic pipeline with evaluation loops. Here's the complete picture.


Why RAG Still Matters in 2026

Even with 1M+ token context windows, RAG remains essential:

Problem Symptom RAG Solution
Knowledge cutoff LLM can't answer about recent events Real-time retrieval
Hallucination Confident but wrong answers Ground answers in source documents
Private data LLM doesn't know your internal docs Inject proprietary knowledge
Cost 1M tokens per query = expensive Retrieve only what's needed

The RAG Evolution Arc

Naive RAG (2023)

Question → Embed → Vector Search → Retrieve chunks → LLM → Answer
Enter fullscreen mode Exit fullscreen mode

Simple. Worked. Hit precision ceiling around 70%.

Advanced RAG (2024)

Question → Query expansion → Hybrid search → Rerank → LLM → Answer
Enter fullscreen mode Exit fullscreen mode

HyDE, query decomposition, MMR, cross-encoder reranking pushed precision to 85%+.

Agentic RAG (2025–2026)

Question → Agent plans strategy
         → Parallel multi-source retrieval
         → Synthesis + verification
         → Self-critique loop (retry if insufficient)
         → Final answer with citations
Enter fullscreen mode Exit fullscreen mode

The agent decides when to search, what to search for, and whether the result is good enough.


Building a Production RAG Pipeline

Step 1: Document Loading and Chunking

from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("technical_docs.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,   # Overlap preserves context at boundaries
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
Enter fullscreen mode Exit fullscreen mode

Chunking strategy matters more than most people think:

  • Technical docs: 500–1000 chars
  • Conversational logs: 200–500 chars
  • Legal/contracts: 1000–2000 chars (longer context needed)

Step 2: Vector Store Setup

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="knowledge_base"
)
Enter fullscreen mode Exit fullscreen mode

Embedding model comparison (2026):
| Model | Dimensions | Cost | Notes |
|-------|-----------|------|-------|
| text-embedding-3-large | 3072 | $0.13/1M | Best quality |
| text-embedding-3-small | 1536 | $0.02/1M | 5x cheaper, good for most |
| BAAI/bge-m3 | 1024 | Free | Best open-source option |

Step 3: Hybrid Search + Reranking

The biggest quality jump comes from combining vector search (semantic) with BM25 (keyword):

from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Semantic retriever (MMR for diversity)
vector_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 10, "fetch_k": 30}
)

# Keyword retriever
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 10

# Hybrid: 60% semantic + 40% keyword
hybrid_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

# Rerank top results with a cross-encoder
reranker = CrossEncoderReranker(
    model=HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3"),
    top_n=5
)

final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever
)
Enter fullscreen mode Exit fullscreen mode

Step 4: The RAG Chain

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_template("""
You are a precise technical assistant. Answer based ONLY on the provided documents.
If the answer isn't in the documents, say "I couldn't find this information in the provided documents."

Documents:
{context}

Question: {question}

Answer (cite your sources):
""")

def format_docs(docs):
    return "\n\n---\n\n".join([
        f"[Source: {doc.metadata.get('source', 'unknown')}]\n{doc.page_content}"
        for doc in docs
    ])

rag_chain = (
    {"context": final_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("What are the main RAG hallucination mitigation strategies?")
Enter fullscreen mode Exit fullscreen mode

Agentic RAG with LangGraph

The key difference: the agent decides the retrieval strategy dynamically.

from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Optional

class RAGState(TypedDict):
    question: str
    search_queries: List[str]
    retrieved_docs: List[str]
    answer: Optional[str]
    needs_more_search: bool
    iteration: int

def query_decomposer(state: RAGState) -> RAGState:
    """Break complex questions into targeted sub-queries"""
    response = llm.invoke(
        f"Decompose this into 2-4 specific search queries (JSON array):\n{state['question']}"
    )
    # Parse JSON from response
    queries = [state['question']]  # Simplified
    return {"search_queries": queries}

def parallel_retriever(state: RAGState) -> RAGState:
    all_docs = []
    for query in state['search_queries']:
        docs = final_retriever.invoke(query)
        all_docs.extend([d.page_content for d in docs])
    return {"retrieved_docs": list(dict.fromkeys(all_docs))[:10]}  # dedup

def answer_and_evaluate(state: RAGState) -> RAGState:
    context = "\n\n".join(state['retrieved_docs'])
    response = llm.invoke(
        f"Documents:\n{context}\n\nQuestion: {state['question']}\n\n"
        f"Answer, then on a new line output JSON: {{\"sufficient\": true/false}}"
    )
    # In production, parse the JSON suffix
    return {
        "answer": response.content,
        "needs_more_search": False,
        "iteration": state.get('iteration', 0) + 1
    }

def should_retry(state: RAGState) -> str:
    if state['needs_more_search'] and state['iteration'] < 3:
        return "retry"
    return "end"

graph = StateGraph(RAGState)
graph.add_node("decompose", query_decomposer)
graph.add_node("retrieve", parallel_retriever)
graph.add_node("generate", answer_and_evaluate)
graph.set_entry_point("decompose")
graph.add_edge("decompose", "retrieve")
graph.add_edge("retrieve", "generate")
graph.add_conditional_edges("generate", should_retry, {"retry": "retrieve", "end": END})

agentic_rag = graph.compile()
Enter fullscreen mode Exit fullscreen mode

Evaluation: You Can't Improve What You Don't Measure

The top RAG evaluation stack in 2026:

Ragas (RAG-specific)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall, context_precision
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": test_questions,
    "answer": [rag_chain.invoke(q) for q in test_questions],
    "contexts": [[d.page_content for d in final_retriever.invoke(q)] for q in test_questions],
    "ground_truth": reference_answers
})

scores = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
print(scores.to_pandas())
Enter fullscreen mode Exit fullscreen mode

Target scores for production:
| Metric | Minimum | Target |
|--------|---------|--------|
| Faithfulness | 0.85 | > 0.92 |
| Answer Relevancy | 0.80 | > 0.88 |
| Context Recall | 0.75 | > 0.85 |
| Context Precision | 0.70 | > 0.80 |


The 5 Most Common RAG Failures (and Fixes)

1. Chunk boundary cuts critical information
→ Increase chunk_overlap to 20-30% of chunk size

2. Vocabulary mismatch between query and document
→ Use HyDE (generate a hypothetical answer, embed that for search)
→ Use hybrid search (BM25 catches exact keyword matches)

3. Irrelevant chunks pass vector similarity threshold
→ Add cross-encoder reranking as a second filter

4. Stale data in the index
→ Add date metadata, filter by recency in retriever kwargs

5. LLM ignores the retrieved context
→ Restructure the prompt — put documents BEFORE the question, not after


2026 Trends to Watch

  • GraphRAG: Microsoft's approach — extract knowledge graph from docs, traverse relationships for multi-hop reasoning
  • Multi-modal RAG: Retrieve images, charts, tables alongside text
  • Adaptive RAG: Route simple queries to fast/cheap path, complex ones to agentic path
  • Caching layers: Cache embeddings + frequent query results (Redis/Upstash) to cut costs 60-80%

RAG is mature technology now. The differentiator isn't whether you use it — it's how well you evaluate and iterate on it. Add Ragas to your CI/CD pipeline and treat retrieval quality as a first-class metric.


Explore 460+ AI agent tools including RAG infrastructure at AgDex.ai

Top comments (0)