DEV Community

Young Gao
Young Gao

Posted on

RAG Is Not Dead: Advanced Retrieval Patterns That Actually Work in 2026

Every few months, someone declares RAG (Retrieval-Augmented Generation) dead. "Just use a million-token context window," they say. "Fine-tune instead," others suggest.

They're wrong. RAG isn't dead — naive RAG is dead. The pattern of "chunk documents → embed → cosine similarity → stuff into prompt" was always a prototype, not a production system. In 2026, production RAG looks radically different.

This article covers the patterns that separate toy demos from systems that actually work.

Why Naive RAG Fails

The classic RAG pipeline has predictable failure modes:

  1. Chunking destroys context — Splitting at 512 tokens breaks paragraphs, separates questions from answers, and loses document structure
  2. Embedding similarity ≠ relevance — "How do I reset my password?" and "Password reset policy" have high similarity but serve different intents
  3. Top-K retrieval is crude — The 5 most similar chunks aren't necessarily the 5 most useful
  4. No query understanding — The raw user query goes straight to vector search with no transformation

Let's fix each of these.

Pattern 1: Semantic Chunking

Instead of fixed-size chunks, split at semantic boundaries:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90,
)

chunks = chunker.split_text(document_text)
Enter fullscreen mode Exit fullscreen mode

The semantic chunker computes embeddings for each sentence, then splits where the cosine distance between consecutive sentences exceeds a threshold. Sentences about the same topic stay together.

Contextual Retrieval (Anthropic's Approach)

Prepend each chunk with context about where it fits in the document:

def add_context(chunk: str, full_document: str) -> str:
    """Use an LLM to generate context for each chunk."""
    prompt = f"""Given this document:
{full_document[:2000]}

And this specific chunk:
{chunk}

Write a 2-3 sentence context that explains where this chunk 
fits within the overall document. Be specific."""

    context = llm.invoke(prompt)
    return f"CONTEXT: {context}\n\n{chunk}"
Enter fullscreen mode Exit fullscreen mode

This costs more upfront but dramatically improves retrieval accuracy — Anthropic reported 49% fewer retrieval failures.

Pattern 2: Hybrid Search

Vector search alone misses exact matches. BM25 alone misses semantic similarity. Combine them:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Qdrant

# Vector retriever
vector_store = Qdrant.from_documents(
    documents, embeddings,
    url="http://localhost:6333",
    collection_name="docs"
)
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 10})

# BM25 retriever  
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10

# Combine with Reciprocal Rank Fusion
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4],  # Favor semantic for most use cases
)
Enter fullscreen mode Exit fullscreen mode

The EnsembleRetriever uses Reciprocal Rank Fusion (RRF) to merge results. A document ranking #1 in vector search and #3 in BM25 scores higher than one ranking #2 in both.

Adding Knowledge Graphs

For structured relationships (org charts, product hierarchies, dependency trees), add a graph layer:

from neo4j import GraphDatabase

def graph_enhanced_retrieval(query: str, vector_results: list) -> list:
    """Enrich vector results with graph context."""
    driver = GraphDatabase.driver("bolt://localhost:7687")

    enriched = []
    for doc in vector_results:
        # Find related entities in the graph
        with driver.session() as session:
            result = session.run("""
                MATCH (n)-[r]-(related)
                WHERE n.name = $entity
                RETURN related.name, type(r), related.description
                LIMIT 5
            """, entity=doc.metadata.get("entity"))

            context = [f"{r['type(r)']}: {r['related.name']}" for r in result]
            doc.page_content += f"\n\nRelated: {', '.join(context)}"
            enriched.append(doc)

    return enriched
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Re-Ranking

Initial retrieval casts a wide net. Re-ranking uses a cross-encoder to score each (query, document) pair more accurately:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Retrieve 20 candidates, re-rank to top 5
reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=5,
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=ensemble,  # From hybrid search above
)

results = compression_retriever.invoke("How do I configure SSO?")
Enter fullscreen mode Exit fullscreen mode

Cross-encoders process the query and document together (unlike bi-encoders which encode them separately), enabling much more nuanced relevance scoring. The tradeoff is speed — which is why we re-rank a pre-filtered set rather than the entire corpus.

ColBERT v2: Best of Both Worlds

ColBERT stores per-token embeddings and uses late interaction for scoring, giving near cross-encoder accuracy at near bi-encoder speed:

from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
rag.index(
    collection=[doc.page_content for doc in documents],
    document_metadatas=[doc.metadata for doc in documents],
    index_name="my_index",
)

results = rag.search(query="SSO configuration", k=5)
Enter fullscreen mode Exit fullscreen mode

Pattern 4: Query Transformation

Don't send the raw user query to retrieval. Transform it first.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, then search for documents similar to that answer:

def hyde_retrieval(query: str) -> list:
    # Generate hypothetical answer
    hypothetical = llm.invoke(
        f"Write a short paragraph that would answer: {query}"
    )

    # Search using the hypothetical document's embedding
    # This often matches better than the question embedding
    return vector_store.similarity_search(hypothetical, k=5)
Enter fullscreen mode Exit fullscreen mode

Multi-Query Expansion

Generate multiple perspectives on the same question:

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=llm,
)

# Internally generates 3+ query variants and deduplicates results
results = multi_retriever.invoke("Why is our API slow?")
# Generates: "API performance issues", "latency root causes", 
# "slow response time debugging"
Enter fullscreen mode Exit fullscreen mode

Step-Back Prompting

For specific questions, first ask a broader question:

def step_back_retrieval(query: str) -> list:
    # Generate a more general question
    broader = llm.invoke(
        f"Given this specific question: '{query}'\n"
        f"What is a more general question that would help answer it?"
    )

    # Retrieve for both specific and general queries
    specific_docs = vector_store.similarity_search(query, k=3)
    general_docs = vector_store.similarity_search(broader, k=3)

    return deduplicate(specific_docs + general_docs)
Enter fullscreen mode Exit fullscreen mode

Pattern 5: Agentic RAG

The biggest evolution: let the LLM decide what to retrieve and when.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated

class AgentState(TypedDict):
    question: str
    documents: list
    answer: str
    needs_more_info: bool

def retrieve(state: AgentState) -> AgentState:
    docs = retriever.invoke(state["question"])
    return {"documents": docs}

def grade_documents(state: AgentState) -> AgentState:
    """Let the LLM decide if retrieved docs are sufficient."""
    grade = llm.invoke(
        f"Question: {state['question']}\n"
        f"Documents: {state['documents']}\n"
        f"Are these documents sufficient to answer the question? "
        f"Reply YES or NO with a brief reason."
    )
    return {"needs_more_info": "NO" in grade.upper()}

def generate(state: AgentState) -> AgentState:
    answer = llm.invoke(
        f"Answer based on these documents:\n"
        f"{state['documents']}\n\n"
        f"Question: {state['question']}"
    )
    return {"answer": answer}

def rewrite_query(state: AgentState) -> AgentState:
    better_query = llm.invoke(
        f"The following question didn't get good search results: "
        f"'{state['question']}'. Rewrite it for better retrieval."
    )
    return {"question": better_query}

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("rewrite", rewrite_query)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges(
    "grade",
    lambda s: "rewrite" if s["needs_more_info"] else "generate",
)
workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("generate", END)

app = workflow.compile()
Enter fullscreen mode Exit fullscreen mode

This agent retrieves, evaluates quality, rewrites the query if needed, and only generates when it has sufficient context.

Pattern 6: Evaluation with RAGAS

You can't improve what you don't measure:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

result = evaluate(
    dataset=eval_dataset,  # Questions + ground truth answers
    metrics=[
        faithfulness,       # Is the answer grounded in retrieved context?
        answer_relevancy,   # Does the answer address the question?
        context_precision,  # Are the retrieved docs relevant?
        context_recall,     # Did we retrieve all necessary info?
    ],
)

print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 
#  'context_precision': 0.85, 'context_recall': 0.79}
Enter fullscreen mode Exit fullscreen mode

Track these metrics over time. When you change chunking strategy, embedding model, or retrieval pipeline, you'll know immediately if it helped.

RAG vs. Fine-Tuning vs. Long Context

When to use each:

Approach Best For Limitations
RAG Dynamic data, source attribution, cost control Retrieval quality ceiling
Fine-tuning Teaching style/format, specialized domains Stale data, no source citation
Long context Small corpora (<100 docs), one-shot analysis Cost at scale, attention degradation

The sweet spot for most production systems: RAG + selective fine-tuning. Fine-tune for domain language and response style. Use RAG for up-to-date facts and source attribution.

Production Tips

  1. Cache aggressively — Cache embeddings, cache LLM re-ranking calls, cache final answers for repeated queries
  2. Stream the answer — Start generating as soon as retrieval completes; don't wait for re-ranking if latency matters
  3. Monitor retrieval quality — Log which chunks were retrieved and whether users found answers helpful
  4. Use metadata filters — Filter by date, department, document type before vector search to reduce noise
  5. Implement fallback — If RAG confidence is low, fall back to a direct LLM response with a disclaimer

RAG in 2026 is a pipeline engineering challenge, not a simple API call. But get it right, and you have a system that's accurate, attributable, and cost-effective at scale.


If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more AI engineering content.


You Might Also Like

Follow me for more production-ready backend content!


If this helped you, buy me a coffee on Ko-fi!

Top comments (0)