A Question Worth Taking Seriously
Gemini 1.5 Pro supports 1 million token context. Claude 3.5 handles 200K tokens. GPT-4 Turbo handles 128K. A small novel fits in context. Some people ask: is RAG still necessary?
The question deserves a real answer, because it hides a genuine engineering decision: for a production system, should I use RAG or long context?
The Numbers
Large language model context windows (2024–2025):
| Model | Context Window | Approximate text |
|---|---|---|
| Gemini 1.5 Pro | 1,000,000 tokens | ~750,000 words, ~1500 pages |
| Claude 3.5 Sonnet | 200,000 tokens | ~150,000 words, ~300 pages |
| GPT-4 Turbo | 128,000 tokens | ~96,000 words, ~190 pages |
| GPT-4o | 128,000 tokens | ~96,000 words, ~190 pages |
This looks like a lot. But how much content does a real knowledge base have?
- A mid-sized company's internal documentation: thousands of documents, millions of words
- A large codebase: tens of thousands of files, billions of tokens
- A news or research database: millions of articles
All of these exceed any model's context window. That is the hard ceiling on long context.
The Real Cost of Long Context
"Bigger window" doesn't mean "free." Every request processes every token, and the cost is real.
Cost 1: Money
Rough estimates at late 2024 pricing (input tokens):
| Model | Price per 1M tokens | 1M token request |
|---|---|---|
| Gemini 1.5 Pro | $1.25 | $1.25 |
| Claude 3.5 Sonnet | $3.00 | $3.00 |
| GPT-4 Turbo | $10.00 | $10.00 |
Compare to RAG:
- Retrieval phase: Embedding API only (< $0.001)
- Generation phase: 2,000–5,000 tokens of retrieved context + question (< $0.05)
RAG can cost 20–200× less than long context for the same question.
At 1,000 user queries per day against an enterprise knowledge base:
- Long context (1M tokens): ~$1,250/day
- RAG (3K token context): ~$3–15/day
Cost 2: Latency
More tokens = slower response. Time to first token (TTFT) grows roughly linearly with input length:
100K token input → TTFT ~2–5 seconds
1M token input → TTFT ~15–30 seconds (varies by model and infrastructure)
A conversational application where the user waits 30 seconds before any output is largely unusable.
Cost 3: Lost in the Middle
A 2023 Stanford paper "Lost in the Middle" (Liu et al.) found that when relevant information appears in the middle of a long context, LLM recall drops significantly. Information at the beginning or end performs best; information in the middle performs worst.
Position vs. recall (approximate trend):
Beginning (0–10%) ████████████████ high
Middle (40–60%) ██████ low
End (90–100%) ████████████ higher
Stuffing 100 documents into context does not guarantee the model finds the one at position 50.
The Real Cost of RAG
RAG isn't free either.
Cost 1: Imperfect Retrieval
Vector search is approximate matching — it makes mistakes:
- False negatives: relevant documents not retrieved. The user's question is semantically distant from the relevant passage; it falls outside the top-k.
- False positives: irrelevant documents retrieved. The LLM receives noise, which can cause confusion or hallucination.
This is exactly the problem that earlier articles in this series addressed: hybrid retrieval, Rerank, HyDE — all of these are patches for retrieval imperfection.
Cost 2: Chunking Breaks Context
Chunking splits documents into fragments. Related information can end up in different chunks. A 10-page research report whose conclusion depends on an assumption from page 3 may be split such that only the conclusion chunk is retrieved — the LLM gets the conclusion without the premise.
Cost 3: System Complexity
RAG is an engineering system: vector store + embedding model + retrieval pipeline + update mechanism + evaluation framework. Compared to "send the document to the LLM," it has significantly higher maintenance cost.
Five-Dimension Comparison
| Dimension | Long Context | RAG |
|---|---|---|
| Document volume ceiling | ~10–100 docs (limited by window and cost) | Unlimited (vector store scales) |
| Cost | High (all tokens billed every request) | Low (only relevant fragments) |
| Latency | High (large inputs are slow) | Low (small inputs are fast) |
| Recall completeness | Perfect (everything is present) | Incomplete (depends on retrieval quality) |
| Knowledge updates | Requires resending all content | Only update changed documents |
| Engineering complexity | Low (direct API call) | High (retrieval pipeline to maintain) |
| Single-document understanding | Strong (cross-document reasoning) | Weaker (affected by chunking) |
Neither approach wins on all dimensions.
Decision Framework: Which One?
Four dimensions to locate your scenario:
Dimension 1: Document Volume
< 50 docs, total < 100K tokens → consider long context
50–1000 docs → evaluate cost, decide
> 1000 docs, or total > 1M tokens → RAG
Dimension 2: Update Frequency
Static content (monthly updates or less) → long context acceptable
Dynamic content (daily/hourly updates) → RAG (incremental indexing is cheap)
Real-time data → RAG (or direct API integration)
Dimension 3: Query Volume
One-time analysis (research, report generation) → long context
Low-frequency queries (< 100/day) → either works
High-frequency queries (> 1000/day) → RAG (cost differences compound)
Dimension 4: Latency Requirements
Interactive Q&A (< 3 second response) → RAG
Report generation, offline analysis → long context acceptable
Summary Decision Table
Use case Docs Updates Queries Recommendation
──────────────────────────────────────────────────────────────────────────
Legal contract review (single) small none once Long context
Enterprise knowledge base Q&A large frequent high RAG
PDF financial report analysis medium none once Long context
Product documentation chatbot large moderate high RAG
Codebase understanding huge frequent high RAG
Meeting notes summary (single) small none once Long context
Hybrid Strategy: Use Both
Long context and RAG are not mutually exclusive. Sometimes the best choice is a combination.
Strategy 1: RAG selects documents, long context reads in full
# Step 1: use RAG to find the 3 most relevant documents
relevant_docs = retriever.invoke(query) # top-3 documents
# Step 2: send full documents (not chunks) to the LLM
full_docs = [load_full_doc(doc.metadata["source"]) for doc in relevant_docs]
full_context = "\n\n".join([doc.page_content for doc in full_docs])
# Step 3: LLM answers based on complete documents
answer = llm.invoke(f"Answer based on the following documents:\n{full_context}\n\nQuestion: {query}")
Good fit for: large document sets (can't send all), but each document requires complex cross-passage reasoning.
Strategy 2: Coarse-grained RAG with large chunks
Traditional RAG uses 512–1024 token chunks. With larger windows, you can use 3,000–10,000 token chunks — preserving much more context while still doing retrieval filtering.
# Split with larger chunks (preserve more context)
splitter = RecursiveCharacterTextSplitter(
chunk_size=4000, # traditional 512 → now 4000 is reasonable
chunk_overlap=400,
)
# Retrieve fewer chunks since each is larger
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 3 × 4000 = 12,000 tokens: precise and context-rich
Strategy 3: Summary cache + precise retrieval
For large document libraries, use the LLM to generate a structured summary for each document; retrieve summaries; load the original passage on demand.
# Pre-processing: generate summaries (one-time)
for doc in all_documents:
summary = llm.invoke(f"Summarize this document's key points in 3 sentences:\n{doc.page_content}")
summary_doc = Document(page_content=summary, metadata={
"source": doc.metadata["source"],
"original": doc.page_content,
})
summary_vectorstore.add_documents([summary_doc])
# Query time: retrieve summaries, load original passages
def query_with_summary(question):
summaries = summary_vectorstore.similarity_search(question, k=5)
relevant_chunks = [
extract_relevant_passage(s.metadata["original"], question)
for s in summaries
]
return llm.invoke(build_prompt(question, relevant_chunks))
What Actually Changed
The rise of large context windows genuinely shifted some decisions:
Scenarios where RAG was once necessary but now may not be:
- Understanding documents under 50 pages (just stuff it in — simpler)
- One-time document analysis tasks (not worth building a RAG system)
- Prototype validation (fast idea testing, no need for production-grade RAG)
Scenarios where RAG is still necessary (most production systems):
- Knowledge bases with > 1,000 documents
- Frequently updated content
- High concurrency, cost-sensitive deployments
- Attribution requirements (RAG natively knows which document an answer came from)
Large context windows made "skip RAG for simple cases" a reasonable choice. They didn't make RAG obsolete — they made RAG's use case clearer: when document volume, update frequency, or cost makes "full context" impractical, RAG is irreplaceable.
Summary
| Long Context | RAG | |
|---|---|---|
| Core strength | Complete context, cross-document reasoning | Scalable, low cost, real-time updates |
| Core limitation | High cost, high latency, hard document ceiling | Imperfect retrieval, engineering complexity |
| Best for | Small-scale, one-time deep analysis | Large-scale production systems |
| Trend | Windows keep growing, costs keep falling | Retrieval quality keeps improving |
These are not competitors — they're complementary tools. Understanding the true cost of each, and choosing the right one, is engineering judgment.
Top comments (0)