Young Gao

Posted on Mar 21

RAG Is Not Dead: Advanced Retrieval Patterns That Actually Work in 2026

#ai #python #rag #llm

Every few months, someone declares RAG (Retrieval-Augmented Generation) dead. "Just use a million-token context window," they say. "Fine-tune instead," others suggest.

They're wrong. RAG isn't dead — naive RAG is dead. The pattern of "chunk documents → embed → cosine similarity → stuff into prompt" was always a prototype, not a production system. In 2026, production RAG looks radically different.

This article covers the patterns that separate toy demos from systems that actually work.

Why Naive RAG Fails

The classic RAG pipeline has predictable failure modes:

Chunking destroys context — Splitting at 512 tokens breaks paragraphs, separates questions from answers, and loses document structure
Embedding similarity ≠ relevance — "How do I reset my password?" and "Password reset policy" have high similarity but serve different intents
Top-K retrieval is crude — The 5 most similar chunks aren't necessarily the 5 most useful
No query understanding — The raw user query goes straight to vector search with no transformation

Let's fix each of these.

Pattern 1: Semantic Chunking

Instead of fixed-size chunks, split at semantic boundaries:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90,
)

chunks = chunker.split_text(document_text)

The semantic chunker computes embeddings for each sentence, then splits where the cosine distance between consecutive sentences exceeds a threshold. Sentences about the same topic stay together.

Contextual Retrieval (Anthropic's Approach)

Prepend each chunk with context about where it fits in the document:

def add_context(chunk: str, full_document: str) -> str:
    """Use an LLM to generate context for each chunk."""
    prompt = f"""Given this document:
{full_document[:2000]}

And this specific chunk:
{chunk}

Write a 2-3 sentence context that explains where this chunk 
fits within the overall document. Be specific."""

    context = llm.invoke(prompt)
    return f"CONTEXT: {context}\n\n{chunk}"

This costs more upfront but dramatically improves retrieval accuracy — Anthropic reported 49% fewer retrieval failures.

Pattern 2: Hybrid Search

Vector search alone misses exact matches. BM25 alone misses semantic similarity. Combine them:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import Qdrant

# Vector retriever
vector_store = Qdrant.from_documents(
    documents, embeddings,
    url="http://localhost:6333",
    collection_name="docs"
)
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 10})

# BM25 retriever  
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10

# Combine with Reciprocal Rank Fusion
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.6, 0.4],  # Favor semantic for most use cases
)

The EnsembleRetriever uses Reciprocal Rank Fusion (RRF) to merge results. A document ranking #1 in vector search and #3 in BM25 scores higher than one ranking #2 in both.

Adding Knowledge Graphs

For structured relationships (org charts, product hierarchies, dependency trees), add a graph layer:

from neo4j import GraphDatabase

def graph_enhanced_retrieval(query: str, vector_results: list) -> list:
    """Enrich vector results with graph context."""
    driver = GraphDatabase.driver("bolt://localhost:7687")

    enriched = []
    for doc in vector_results:
        # Find related entities in the graph
        with driver.session() as session:
            result = session.run("""
                MATCH (n)-[r]-(related)
                WHERE n.name = $entity
                RETURN related.name, type(r), related.description
                LIMIT 5
            """, entity=doc.metadata.get("entity"))

            context = [f"{r['type(r)']}: {r['related.name']}" for r in result]
            doc.page_content += f"\n\nRelated: {', '.join(context)}"
            enriched.append(doc)

    return enriched

Pattern 3: Re-Ranking

Initial retrieval casts a wide net. Re-ranking uses a cross-encoder to score each (query, document) pair more accurately:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Retrieve 20 candidates, re-rank to top 5
reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=5,
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=ensemble,  # From hybrid search above
)

results = compression_retriever.invoke("How do I configure SSO?")

Cross-encoders process the query and document together (unlike bi-encoders which encode them separately), enabling much more nuanced relevance scoring. The tradeoff is speed — which is why we re-rank a pre-filtered set rather than the entire corpus.

ColBERT v2: Best of Both Worlds

ColBERT stores per-token embeddings and uses late interaction for scoring, giving near cross-encoder accuracy at near bi-encoder speed:

from ragatouille import RAGPretrainedModel

rag = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
rag.index(
    collection=[doc.page_content for doc in documents],
    document_metadatas=[doc.metadata for doc in documents],
    index_name="my_index",
)

results = rag.search(query="SSO configuration", k=5)

Pattern 4: Query Transformation

Don't send the raw user query to retrieval. Transform it first.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, then search for documents similar to that answer:

def hyde_retrieval(query: str) -> list:
    # Generate hypothetical answer
    hypothetical = llm.invoke(
        f"Write a short paragraph that would answer: {query}"
    )

    # Search using the hypothetical document's embedding
    # This often matches better than the question embedding
    return vector_store.similarity_search(hypothetical, k=5)

Multi-Query Expansion

Generate multiple perspectives on the same question:

from langchain.retrievers.multi_query import MultiQueryRetriever

multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(),
    llm=llm,
)

# Internally generates 3+ query variants and deduplicates results
results = multi_retriever.invoke("Why is our API slow?")
# Generates: "API performance issues", "latency root causes", 
# "slow response time debugging"

Step-Back Prompting

For specific questions, first ask a broader question:

def step_back_retrieval(query: str) -> list:
    # Generate a more general question
    broader = llm.invoke(
        f"Given this specific question: '{query}'\n"
        f"What is a more general question that would help answer it?"
    )

    # Retrieve for both specific and general queries
    specific_docs = vector_store.similarity_search(query, k=3)
    general_docs = vector_store.similarity_search(broader, k=3)

    return deduplicate(specific_docs + general_docs)

Pattern 5: Agentic RAG

The biggest evolution: let the LLM decide what to retrieve and when.

from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated

class AgentState(TypedDict):
    question: str
    documents: list
    answer: str
    needs_more_info: bool

def retrieve(state: AgentState) -> AgentState:
    docs = retriever.invoke(state["question"])
    return {"documents": docs}

def grade_documents(state: AgentState) -> AgentState:
    """Let the LLM decide if retrieved docs are sufficient."""
    grade = llm.invoke(
        f"Question: {state['question']}\n"
        f"Documents: {state['documents']}\n"
        f"Are these documents sufficient to answer the question? "
        f"Reply YES or NO with a brief reason."
    )
    return {"needs_more_info": "NO" in grade.upper()}

def generate(state: AgentState) -> AgentState:
    answer = llm.invoke(
        f"Answer based on these documents:\n"
        f"{state['documents']}\n\n"
        f"Question: {state['question']}"
    )
    return {"answer": answer}

def rewrite_query(state: AgentState) -> AgentState:
    better_query = llm.invoke(
        f"The following question didn't get good search results: "
        f"'{state['question']}'. Rewrite it for better retrieval."
    )
    return {"question": better_query}

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade", grade_documents)
workflow.add_node("generate", generate)
workflow.add_node("rewrite", rewrite_query)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges(
    "grade",
    lambda s: "rewrite" if s["needs_more_info"] else "generate",
)
workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("generate", END)

app = workflow.compile()

This agent retrieves, evaluates quality, rewrites the query if needed, and only generates when it has sufficient context.

Pattern 6: Evaluation with RAGAS

You can't improve what you don't measure:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

result = evaluate(
    dataset=eval_dataset,  # Questions + ground truth answers
    metrics=[
        faithfulness,       # Is the answer grounded in retrieved context?
        answer_relevancy,   # Does the answer address the question?
        context_precision,  # Are the retrieved docs relevant?
        context_recall,     # Did we retrieve all necessary info?
    ],
)

print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 
#  'context_precision': 0.85, 'context_recall': 0.79}

Track these metrics over time. When you change chunking strategy, embedding model, or retrieval pipeline, you'll know immediately if it helped.

RAG vs. Fine-Tuning vs. Long Context

When to use each:

Approach	Best For	Limitations
RAG	Dynamic data, source attribution, cost control	Retrieval quality ceiling
Fine-tuning	Teaching style/format, specialized domains	Stale data, no source citation
Long context	Small corpora (<100 docs), one-shot analysis	Cost at scale, attention degradation

The sweet spot for most production systems: RAG + selective fine-tuning. Fine-tune for domain language and response style. Use RAG for up-to-date facts and source attribution.

Production Tips

Cache aggressively — Cache embeddings, cache LLM re-ranking calls, cache final answers for repeated queries
Stream the answer — Start generating as soon as retrieval completes; don't wait for re-ranking if latency matters
Monitor retrieval quality — Log which chunks were retrieved and whether users found answers helpful
Use metadata filters — Filter by date, department, document type before vector search to reduce noise
Implement fallback — If RAG confidence is low, fall back to a direct LLM response with a disclaimer

RAG in 2026 is a pipeline engineering challenge, not a simple API call. But get it right, and you have a system that's accurate, attributable, and cost-effective at scale.

If this article helped you, consider buying me a coffee on Ko-fi! Follow me for more AI engineering content.

Follow me for more production-ready backend content!

If this helped you, buy me a coffee on Ko-fi!

DEV Community