Gabriel Anhaia

Posted on May 23

Long-Context Models Killed RAG. Except for the 6 Cases Where They Made It Worse.

#ai #architecture #llm #rag

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your PM saw the Gemini 2M context demo. They came back asking the obvious question: "why are we still chunking documents when the whole thing fits in the prompt?" You don't have a clean answer, because half the answer is "cost" and they think cost will fall, and half the answer is "accuracy" and you haven't measured it.

This post is the second half. The needle-in-a-haystack benchmark vendors stopped putting in their landing pages is public. Accuracy on retrieval-flavored queries degrades past 60k tokens on every model that claims a million-token window. There are six query shapes where stuffing the whole corpus into context loses on quality before it loses on price. There are also three where long-context genuinely wins, and pretending otherwise makes you look like the engineer who still uses XML for new APIs.

The two numbers that ended the debate

Take a corpus of 500k tokens: a mid-sized SaaS product's help docs, or the last two years of a single team's design docs. A query needs to consult roughly 4k tokens of it.

Here's the math on a single query, using April 2026 list prices for the three frontier models. Long-context means the full 500k corpus in the prompt. Retrieval means a vector search returning ~4k tokens of relevant chunks.

Model	Input price ($/MTok)	Long-context cost	Retrieval cost	Multiplier
GPT-4.1	$2.00	$1.00	$0.008	125x
Claude Sonnet 4.5	$3.00	$1.50	$0.012	125x
Gemini 2.5 Pro (>200k)	$2.50	$1.25	$0.010	125x

A factor of 125 per query. At 10k queries a day, the long-context bill is $12,500 and the retrieval bill is $100. Prompt caching narrows this by maybe 5-10x in the best case where every query hits the same prefix, which most production traffic doesn't.

Tail latency is the second number, and it's the one nobody puts in the slide deck. TTFT (time-to-first-token) on a 500k-token prompt sits between 8 and 25 seconds on most frontier endpoints. Retrieval with a hot vector index returns in 50-150ms and the LLM call on 4k tokens streams first token in under a second. Your p95 latency budget for a chat surface is 2 seconds. Long-context blows that budget before the model has finished tokenizing.

So the answer to "what does long-context cost" is: 125x more money, 10-25x worse latency, and we haven't talked about quality yet.

The needle-in-a-haystack chart vendors stopped showing

In 2024, every long-context release shipped with a green-and-red heatmap: needle-in-a-haystack across context depth and position. The heatmaps were almost entirely green. By late 2025, the heatmaps stopped appearing in release posts. The benchmarks still exist. They just don't look as good once the task gets harder than "find this one sentence I planted."

The public RULER benchmark from NVIDIA (and successors like LongBench v2) tests retrieval-style tasks at varying context lengths: multi-key needle retrieval, variable tracking, aggregation, multi-hop reasoning. Across every tested model with a 1M-token window, accuracy on multi-hop and aggregation tasks drops sharply somewhere between 32k and 128k tokens, depending on the model. Single-needle stays high; the harder you make the task, the earlier the cliff.

The shape of the cliff is the point. Single fact at known position: stays near 100% out past 500k. Multi-fact reasoning where the model has to compose evidence from three locations: drops below 60% by 64k on most models tested. That's the regime your real RAG queries live in.

Case 1: Multi-hop reasoning across distant chunks

A user asks: "Did the contract amendment from March change the SLA penalty for the API tier we're on?" To answer, the model has to find the original SLA section, find the amendment, identify the API tier the user is on (probably from a separate doc), and reason about whether the amendment touched that tier.

Three retrievals, one reasoning step. With retrieval, you fetch the three relevant chunks (maybe with multi-query expansion) and the model sees them adjacent. With long-context, the model has to find all three by attention alone, in the middle of 200 other contract sections that look superficially similar. The model picks the most prominent one, the one closest to where its attention happens to land, and gives a confident wrong answer.

The fix isn't a smarter prompt. The fix is putting the three relevant chunks in front of the model and nothing else.

Case 2: Contradictory sources in the corpus

Your knowledge base has the new policy doc from Q4 and the old policy doc from Q1, both still present because nobody cleaned up. A user asks about the policy. Long-context blends. The model reads both, and the response is a hedge: "the policy is X, although in some cases it may be Y." That's a hallucination disguised as nuance.

Retrieval lets you do something long-context can't: rank by freshness, dedupe by source, and pass the model one canonical chunk. The retriever takes a side. The model then writes a clean answer because it only sees one version. If you want to surface the conflict to the user, you do it as a separate "see also" panel, not as a smeared sentence inside a confident-sounding paragraph.

Case 3: Recency-sensitive answers

"What's our current refund policy?" The corpus has every version of the policy from the last four years. Long-context reads all of them and weights by frequency or position, neither of which correlates with truth. The 2022 version is mentioned twice as often because it stuck around longest. The model leans on it.

Metadata filtering is a retrieval primitive. WHERE doc_type = 'policy' AND status = 'current' ORDER BY updated_at DESC LIMIT 1. There's no equivalent inside a single transformer forward pass. The model can't filter its own context; you have to filter for it. Long-context advocates will tell you "just put the date in the system prompt." That works for one document. With a corpus, you're back to retrieval, you just spelled it weirdly.

Case 4: Needle past the 60k mark

This is where the RULER chart bites. Take the same question that gets 95% accuracy at 4k tokens and ask it at 256k tokens. Accuracy drops to 60-70% on the best frontier models. At 1M, the floor is lower still. The needle is the same. The haystack got bigger and the model got less reliable.

A team I talked to recently ran their own version of this with internal docs. They put a single line into a 400k-token tech-spec dump: "The retry budget for the billing webhook is 5 attempts." They asked the model 50 different paraphrases of "how many retries does the billing webhook get." On a fresh prompt each time, accuracy was 42%. The same question against the same chunk pulled by a retriever scored 96%. The model wasn't dumber on the smaller prompt. It just stopped having to find the line.

Case 5: Span-grounded citation requirements

Compliance asks: "show me which sentence from which document the model based that claim on." Long-context can quote. It cannot reliably tell you where it quoted from. Ask "which doc and which page" and the model confabulates plausible-looking page numbers because nothing in the input forced it to track provenance.

Retrieval gives you provenance for free. Every chunk has doc_id, page, offset, bbox. The model writes its answer, your post-processing layer attaches the source span to each cited claim, and the audit log has something a regulator can trace. With long-context, you're either re-running retrieval on the model's output to back-fill citations (you'd have built RAG to begin with) or you're shipping an "AI tool with citations we made up." Don't ship that.

Case 6: Structured table lookups inside PDFs

A user asks: "what was net revenue in Q3 2024?" The answer is one cell in a financial statement PDF. Long-context reads the PDF as a token stream and the table layout gets mangled. Numbers in adjacent columns get attributed to the wrong row. The model returns a plausible figure that's off by one column.

The fix is layout-aware extraction during ingestion: detect the table, convert each row to a structured record, embed the records with their column headers as context, retrieve the right row. The retrieval step doesn't even need to be vector-based. A SQL filter on extracted records beats both vector search and long-context for this query shape.

The 3 cases where long-context actually wins

Retrieval isn't the answer to everything. Three shapes flip the other way.

Single-document summarization. "Summarize this 300-page contract." There's no retrieval to do. The user has already retrieved by handing you one document. Chunking and reassembling a summary loses cross-section coherence that the model gets for free when it sees the whole thing.

Conversational follow-ups over a fixed working set. User uploads three documents and asks 20 questions over the next 10 minutes. Re-running retrieval per turn adds latency and risks fetching different chunks on similar questions (inconsistency the user notices). Pin the working set in context, cache the prefix, answer fast and consistent.

Exploratory "read this whole thing and tell me what's surprising." The query has no concrete target. Retrieval needs a query vector; "what's interesting" doesn't make one. Long-context, even with degraded mid-context recall, will surface more than a retriever guessing at relevance.

A decision rule you can paste into your design doc

For each query type your product handles, walk this in order. Stop at the first yes.

Does the query specify a target like a fact, a number, a person, or a section? If yes, retrieve. The model finds targets faster when you've already narrowed the haystack.

Does it require composing evidence from multiple distant places? If yes, retrieve, and consider multi-query or decomposition before retrieval. Long-context smears multi-hop reasoning past 60k.

Does the corpus have duplicates, contradictions, or versioning? If yes, retrieve and dedupe at the retriever. The model can't take sides; the retriever can.

Does the answer need provenance, a citation a regulator or auditor will check? If yes, retrieve. Provenance is a retrieval artifact, not a generation artifact.

Is the working set one document the user already chose, and they're asking many questions over a short session? If yes, long-context with prefix caching.

Is the request exploratory with no concrete target? If yes, long-context. Retrieval needs a query.

Default: retrieve. Long-context is the special case, not the new normal. The 125x cost gap and the mid-context accuracy cliff aren't going away just because the window keeps growing.

Which of the six cases hit you hardest in production, and did you fix it with better retrieval or a different model? Drop the war story in the comments.

If this was useful

This is the kind of mental model the RAG Pocket Guide sits on: query shape first, retrieval pattern second. The book walks through chunking, hybrid search, reranking, and the eval methodology you'd need to make the table at the top of this post for your own corpus instead of trusting mine. If your team is having the "do we still need RAG" conversation, the chapters on query routing and recall-vs-precision tradeoffs are the ones to read first.