Jeff Reese

Posted on Apr 30 • Originally published at purecontext.dev

Long Context vs RAG: When to Load the Whole Book

#ai #rag #machinelearning #beginners

AI in Practice, No Fluff — Day 9/10

I have a project where every conversation and decision gets saved as a journal entry. Hundreds of entries, accumulated over weeks. When I need context from a previous session, I have two options: load every single entry into the AI's context window and ask my question, or use the embedding-based search from yesterday's post to retrieve just the relevant entries and pass only those in.

Both work and each has their tradeoffs. The choice between them is one of the most important architectural decisions in AI applications right now.

In the first series, we covered context windows (there is always a limit) and RAG (retrieve relevant information before generating a response). Today is where those two concepts collide. Context windows have gotten dramatically larger since that series. The question is no longer "can the AI hold all of this?" It often can. The question is whether it should.

The context window got big

Less than a year ago, 200,000 tokens was considered large. That has changed. As of early 2026, Claude offers a 1 million token context window. Gemini 2.5 Pro supports 2 million tokens. GPT-4.1 handles 1 million.

To put that in perspective, 1 million tokens is roughly 750,000 words. That is longer than the entire Lord of the Rings trilogy. You could paste the whole thing in and ask the AI to find every scene where Gandalf loses his temper.

This changes the calculus completely. For many use cases, the "just load everything" approach is now physically possible where it was not before. The question shifts from "does it fit?" to "is fitting it all in the best approach?"

The cost math matters

Loading a large context is not free. Let me walk through what this actually costs.

Say you have a 500-page internal knowledge base. That is roughly 250,000 tokens. You want an AI assistant that answers questions about it.

The long context approach: You load the entire knowledge base into every API call. Using Claude Sonnet at $3 per million input tokens, each question costs about $0.75 just for the input context. If your team asks 100 questions a day, that is $75 per day, or roughly $2,250 per month. Just for input tokens, before counting the responses.

The RAG approach: You embed the knowledge base once (a few cents for the whole thing), store the vectors, and retrieve the 10 most relevant chunks per question. That is maybe 2,000 tokens of retrieved context per query. At the same $3 per million rate, each question costs $0.006 for input. One hundred questions a day is $0.60. Per day. The monthly cost is under $20.

The difference is over 100x. At scale, this is the difference between a feature that is economically viable and one that is not.

Prompt caching changes this math significantly. Claude's cached input rate drops to $0.30 per million tokens, a 90% reduction. If your knowledge base does not change between calls, caching can bring that $2,250 monthly cost down to around $225. That is much more reasonable, but still 10x what RAG costs.

The attention problem that might not be apparent

Here is the part that matters more than cost. Even when you can fit everything into the context window, the AI does not treat all of it equally.

Research from Stanford and others documented what they call the "lost in the middle" problem. When you give an AI a large amount of context, it pays the most attention to the beginning and the end. Information in the middle gets significantly less attention. In one study, accuracy dropped by over 30% when the relevant information was placed in the middle of 20 documents compared to being placed first.

This is not a minor edge case. It is a structural property of how transformer models work. Each token in the input can only attend to tokens that came before it. Tokens at the beginning accumulate attention from every subsequent token. Tokens in the middle get less. The result is a U-shaped attention curve: strong at the start, strong at the end, weaker in the middle.

Models have improved significantly, but the U-shaped attention pattern has not disappeared entirely. Techniques like placing important instructions in the system prompt help. If you dump 500 pages into the context and your answer is on page 247, the AI might miss it. Not because it cannot see it. Because it is not paying enough attention to that region.

RAG sidesteps this entirely. When you retrieve 5 relevant chunks and pass only those to the model, everything in the context is relevant. There is no middle to get lost in.

When long context wins

Long context is not just a brute-force option. There are cases where it is genuinely the better choice.

One-off analysis of a focused document. If someone hands you a 100-page contract and says "summarize the key obligations," loading the whole thing makes sense. You need the full context to understand how clauses reference each other. There is no retrieval step because you need all of it.

Cross-referencing across a full document. Questions like "are there any contradictions between section 3 and section 7?" require the model to see both sections simultaneously. RAG might retrieve one section but miss the other, because the query does not match both. Long context lets the model find connections you did not think to ask about.

Codebases and structured documents. When the material has internal references (code that calls other code, a specification where section 4 depends on definitions in section 2), long context preserves those relationships. Chunking for RAG can break them.

Prototyping and exploration. When you are not sure what questions you will ask, loading everything lets you explore freely. RAG requires you to know what you are looking for, at least well enough to write a query.

When RAG wins

RAG is the right choice more often than people expect, especially in production systems.

Large and growing knowledge bases. If your data is more than a few hundred pages, or if it grows over time, RAG scales where long context does not. My journal has hundreds of entries. Loading all of them every time I need to recall a single decision would be wasteful and would hit the attention problem hard.

Repeated queries at scale. If you are building a customer support bot that handles thousands of questions a day, not using RAG can become problematic. The cost math from earlier makes this clear. Long context at that volume would be prohibitively expensive even with caching.

Citation and traceability. RAG systems can tell you exactly which source documents contributed to an answer. The retrieval step creates a natural audit trail. With long context, the model might synthesize an answer from page 12 and page 340, but it will not always tell you that clearly. If your use case requires citations (legal, medical, compliance), RAG gives you this for free.

Frequently updated data. When your knowledge base changes daily, re-embedding the changed documents is trivial. Re-loading the entire thing into every API call is not.

The hybrid approach

Realistically, I use both.

My journal system uses embeddings and semantic search to find the most relevant entries for a given question. That is RAG. When I start a new session and need to orient myself, I load a curated set of core context directly into the window. That is long context.

The pattern is simple. Use long context for the stable foundation that the AI always needs. Use RAG for the large, searchable pool that gets pulled in on demand. This is not a compromise. It is usually the best architecture.

Production AI systems that work well usually do some version of this. The system prompt and key instructions go in the context directly. The knowledge base gets searched and the top results get injected alongside the user's question. You get the reliability of focused context with the breadth of a large knowledge base.

The decision framework

If you are trying to decide between the two, start with these questions:

How much data are you working with? Under 100 pages of focused content, try long context first. It is simpler and you avoid the complexity of building a retrieval pipeline. Over 100 pages or growing, build RAG.

How often will the data be queried? One-off analysis favors long context. Repeated queries favor RAG, because you are paying that input cost every single time.

Does the task require seeing everything at once? Cross-referencing, summarization, and contradiction-finding need full visibility. Question answering against a large corpus does not.

Do you need citations? If yes, RAG. Full stop.

Is latency a constraint? Long context calls with 500,000 tokens take noticeably longer to process. RAG queries with 2,000 tokens of retrieved context are fast.

You are already doing this

If you have ever uploaded a PDF to Claude and asked it a question, you chose the long context approach. If you have ever used a tool that searched your company's docs before answering, you benefited from RAG. You have been making this architectural decision already. The difference is whether you are making it on purpose.

That 50-page contract? Load the whole thing. The entire Confluence wiki for your 500-person company? Build a retrieval pipeline. The buzzwords are not as scary as they sound. You just did not know the names for the things you were already doing.

If there is anything I left out or could have explained better, tell me in the comments.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more