Gabriel Anhaia

Posted on Apr 18

RAG Is Dead. Long Live RAG.

#ai #webdev #rag #llm

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You built a RAG pipeline in 2023. It shipped. Somebody got a bonus for it. In 2026 it is embarrassing.

Nothing is broken. The embeddings still come back. The vector DB still returns five neighbors. The model still generates a paragraph with a citation at the end. The latency is fine. The cost is fine. The eyes on the roadmap are on a different problem now.

The bar moved. That is the whole story.

The pipeline you drew on a whiteboard at the peak of the first hype wave (chunk, embed, retrieve, augment, generate) was a demo that happened to clear the bar because the bar was on the floor. Put it next to what a serious team ships today and it looks like a flip phone next to a satellite uplink. The stages are nominally the same. Every single one has been rebuilt.

This post is a diff. The old stack on the left. The new stack on the right. Opinionated picks for what actually matters.

The 2023 stack in one picture

Draw it from memory. You have drawn it a hundred times.

query → embed → ANN search → top_k chunks → prompt → LLM → answer

Five stages. One embedding model. One similarity metric. One prompt template. One shot at the vector DB. If the retrieval came back wrong, the model hallucinated on top of it and called the hallucination a citation. You shipped this. Everyone shipped this.

The failure modes were the ones you already know. Queries that did not match the embedding space of the corpus. Chunk boundaries that split a table in half. Synonyms the embedding model had never seen. The model confidently quoting a passage it invented because the retrieval returned nothing useful and nobody told it to stop.

The 2026 stack keeps the five-word outline. Everything between the arrows changed.

Stage 1: Query rewriting, not raw query

The user types "why is billing broken". You embed those four words and you get a vector that lives in the same neighborhood as every other rant about billing in your corpus. Useless.

The new stack does not embed the user's query. It embeds a rewritten query, or several.

REWRITE_PROMPT = """Rewrite the user question into 3 retrieval
queries optimized for semantic search. Prefer specific nouns,
product names, error codes. Drop filler. Output one JSON array
of strings.

User question: {q}"""

rewrites = await llm.structured(
    REWRITE_PROMPT.format(q=user_query),
    schema=list[str],
)
# ["invoice generation failure enterprise plan",
#  "payment method charge declined 3DS",
#  "billing webhook retry exponential backoff"]

Three queries, three retrievals, one merged candidate set. The technique is old (HyDE, query expansion, step-back prompting). In 2023 most production pipelines skipped it because the latency budget was tight and the rewrites were done by a 175B model that cost real money per call. In 2026 you do the rewrite with a Haiku-class or Flash-class model that returns in 200 ms for a fraction of a cent, and you keep it on the critical path because every measurement you run says recall goes up more than latency does.

If you are running a single raw-query retrieval in front of your LLM in 2026, you are leaving recall on the floor.

Stage 2: Hybrid search, not pure dense

Pure dense vector search lost. It is still in the stack. It is not the whole stack anymore.

The problem with dense-only retrieval was always the same: embeddings are great at fuzzy semantic match and terrible at exact-token match. A query for error code EPERM-4471 retrieves documents that talk about "permission errors" in general and misses the runbook that contains the literal string. A query for product name Velocity Pro retrieves documents about "fast professional" anything and misses the page about the actual product.

The 2026 default is hybrid: BM25 (or BM25F, or SPLADE) running in parallel with a dense retriever, results merged by Reciprocal Rank Fusion. Both Weaviate and Qdrant ship this as a first-class API. Weaviate's hybrid query takes a single alpha parameter that slides between keyword-only and vector-only. Qdrant exposes it as a fusion query over a sparse and a dense vector.

# Weaviate hybrid search, alpha=0.5 splits keyword/vector 50/50
results = (
    client.collections.get("Docs")
    .query.hybrid(
        query=rewritten_query,
        alpha=0.5,
        limit=50,
    )
)

Weaviate's own 2026 benchmarks put hybrid at 93.2% recall on a 1M-vector corpus with sub-100ms latency, and the numbers from production teams running A/B tests are flatly one-directional: hybrid beats dense-only on nearly every corpus that contains code, product names, IDs, error codes, or proper nouns. Which is every corpus that is not an art history textbook.

The Weaviate team's own write-up on why hybrid wins puts the mechanism in one line: sparse retrieval covers exact-match recall, dense retrieval covers semantic recall, and RRF merges the two without needing either ranker to be calibrated to the other.

You pull 50 candidates here, not 5. The next stage thins them down.

Stage 3: Reranking is non-optional

This is the biggest delta between 2023 and 2026. In 2023, reranking was an optimization. In 2026, it is the load-bearing stage that separates a pipeline that hallucinates from a pipeline that doesn't.

The logic is straight-forward. Bi-encoders are the models producing your dense embeddings. They are fast because they embed the query and the document independently and compare with dot product. That is exactly why they are imprecise: the model never got to look at the query and the document together. A cross-encoder scores query-document pairs jointly, which is slower per pair and radically more accurate, which is why you use it on 50 candidates instead of 50 million.

Two picks worth knowing.

Cohere Rerank 3.5. Announced December 2024, still the commercial default. 4096-token context, state-of-the-art on BEIR. Cohere's own numbers have it 23.4% better than hybrid search and 30.8% better than BM25 on financial-domain retrieval. Cohere has since shipped Rerank 4.0, but 3.5 is the one most production systems actually run in April 2026 because it is in every managed vector DB integration.

BAAI/bge-reranker-v2-m3. The open-weights option. Multilingual, small enough to self-host on a single A10, produces a similarity score you can sigmoid into [0, 1]. This is what you run when the Cohere API bill is not the answer your CFO wants to hear.

The code is dull. That is the point.

from cohere import Client

co = Client()
reranked = co.rerank(
    model="rerank-v3.5",
    query=rewritten_query,
    documents=[c.text for c in candidates],  # 50 from hybrid
    top_n=5,
)
# reranked.results is a list of {index, relevance_score}
top = [candidates[r.index] for r in reranked.results]

Fifty candidates in, five out. The five that go into the prompt are the five a cross-encoder judged to actually answer the query, not the five that happened to land closest in a 1536-dimensional vector space where half the axes were trained on Reddit.

Skip reranking in 2026 and your product will be outcompeted by a product that didn't.

Stage 4: Agentic RAG, not one-shot retrieval

Here is where the architecture itself changes shape.

The 2023 pipeline was a function. Query in, answer out. Retrieval happened exactly once, exactly at the start, whether the query needed it or not. A user typing "thanks, that worked" hit the same five-stage pipeline as a user typing a complex technical question. The vector DB got pounded for nothing. Half the calls added noise to the context.

The 2026 pipeline is an agent. The LLM decides.

SYSTEM = """You answer user questions. You have a tool:

  retrieve(query: str) -> list[Document]

Use it when you don't already know the answer from the
conversation. You can call it multiple times with different
queries. You can choose not to call it at all. When you have
enough to answer, write the answer with citations to the docs
you retrieved. If you cannot answer, say so."""

async def agentic_rag(user_msg, history):
    msgs = history + [{"role": "user", "content": user_msg}]
    while True:
        resp = await llm.chat(system=SYSTEM, messages=msgs,
                              tools=[retrieve_tool])
        if not resp.tool_calls:
            return resp.content
        for call in resp.tool_calls:
            docs = await retrieve(call.args["query"])
            msgs.append({"role": "tool",
                         "tool_call_id": call.id,
                         "content": format_docs(docs)})
        msgs.append(resp.to_message())

Retrieval is a tool call, not a stage. The model decides whether to call it, what to call it with, and when to stop. Typical sessions call retrieve zero times, one time, or four times depending on the question. A "hi" query costs one LLM call. A multi-hop technical question triggers a retrieve on the main topic, a retrieve on an entity surfaced by the first result, and a retrieve on a follow-up the model reasons into on its own.

The benchmark numbers on this are harder to trust than the others. Every framework vendor has a chart showing their agentic RAG doubles accuracy on MultiHopQA. Still, the direction is consistent. Traditional pipelines plateau at roughly a third on multi-step queries. Agent-controlled retrieval with routing and self-correction gets into the 70s. Microsoft's own agentic-RAG guide and the ByteByteGo writeup are both worth reading for the mechanism.

The operational consequence is that your trace tree now looks different. You had a linear five-span chain in 2023. You have a span tree with a variable-depth tool-call loop in 2026. If your observability is not built for that, you are back to the green-trace problem that has been the story of LLM ops since GPT-4 shipped.

Stage 5: Context engineering, not chunking

In 2023, chunking was the whole conversation. Fixed size? Semantic? Recursive splitter? Sentence-aware? Whole blog posts were written about whether to overlap 10% or 20%.

The 2026 word is context engineering. The question moved from "how do I split documents" to "given a 200K-token budget, what goes into the prompt and in what order, and what structure helps the model reason over it?"

Three things changed to make this the real question.

First, context windows grew. Claude Sonnet 4.5 ships a 1M-token window. Gemini 2.5 Pro shipped in March 2025 with 1M and a 2M expansion that is still pending in GA. The bottleneck stopped being "how do I fit enough into 4096 tokens" and became "when I have room for a whole book, what should I actually put there?"

Second, long-context models exposed a failure mode the short-context era never had to worry about: position sensitivity. The infamous lost-in-the-middle paper replicated across every frontier model. You cannot dump 800 chunks into a million-token prompt and expect the model to find the one that matters. You can dump 12 chunks, ordered from most to least relevant, with explicit section headers, and get a noticeably better answer than 800 unordered chunks.

Third, the economics changed. A million-token Gemini call is not free. The retrieval cost is small, the LLM call is large, and context you did not need is money you set on fire. Dumping more is no longer a shortcut. It is the expensive path.

So the new stage looks like this:

def build_context(reranked_docs, max_tokens=32_000):
    # rank by reranker score already done
    # attach provenance and section headers
    blocks = []
    used = 0
    for i, doc in enumerate(reranked_docs):
        header = f"[doc_{i}] source={doc.source} "\
                 f"section={doc.section} score={doc.score:.2f}"
        block = f"{header}\n{doc.text}"
        tokens = count_tokens(block)
        if used + tokens > max_tokens:
            break
        blocks.append(block)
        used += tokens
    # most relevant first and last (position sensitivity)
    if len(blocks) > 3:
        blocks = [blocks[0]] + blocks[2:] + [blocks[1]]
    return "\n\n---\n\n".join(blocks)

Ordering matters. Provenance matters. Staying inside a soft context budget (well under the hard context-window limit) matters. The short answer on when to keep RAG at all in a long-context world is that pure long-context has been consistently worse than RAG + long-context on every public benchmark that actually measured answer quality rather than recall, which is why hybrid-retrieval + reranker + 32K of ordered context is the default over "just paste everything."

The diff in one table

Stage	2023	2026
Query	raw user text	LLM-rewritten, 1–3 variants
Retrieval	dense-only, top-5	hybrid (BM25 + dense), top-50, RRF
Rerank	skipped	Cohere Rerank 3.5 or bge-reranker-v2-m3, top-5
Control	one-shot pipeline	agent with a `retrieve` tool
Context	chunk-and-stuff	ordered, headered, budgeted
Tracing	none or ad-hoc	OTel GenAI semconv, span tree per retrieval

Every row is a concrete engineering change. Every row changes what a trace looks like when something goes wrong.

What actually matters

If you can only afford one upgrade from the 2023 stack, it is the reranker. Not the agent, not the long context, not the hybrid retrieval. A reranker in front of a dumb dense retriever will out-perform any other single change you can make in the top-5 quality of your retrieved chunks. It is the cheapest precision lever you have.

If you can afford two, add hybrid search. BM25 is sixty years old and it still catches things your 1536-dimensional transformer embeddings miss, and the engineering effort to bolt it on is one config flag in Weaviate or Qdrant.

Agentic RAG is third on the list for a reason. The ceiling is the highest and the floor is the lowest. You need good tool-use instrumentation and real evals before you turn an agent loose on your vector DB in production; otherwise you are one bad prompt away from a retrieval storm that looks like a successful trace and costs you a thousand dollars an hour. The $47K LangChain loop is the reference incident. If you are not ready to operate an agent, ship everything else first.

Context engineering is the one that silently separates good teams from great teams. It does not show up on a benchmark. It shows up when the same query gets a better answer from your product than from your competitor's because you ordered the chunks right, attached provenance, and budgeted tokens like you budget memory.