WonderLab

Posted on May 19

RAG Series (22): Long Context vs RAG — Do We Even Need RAG?

#ai #llm #rag #opensource

A Question Worth Taking Seriously

Gemini 1.5 Pro supports 1 million token context. Claude 3.5 handles 200K tokens. GPT-4 Turbo handles 128K. A small novel fits in context. Some people ask: is RAG still necessary?

The question deserves a real answer, because it hides a genuine engineering decision: for a production system, should I use RAG or long context?

The Numbers

Large language model context windows (2024–2025):

Model	Context Window	Approximate text
Gemini 1.5 Pro	1,000,000 tokens	~750,000 words, ~1500 pages
Claude 3.5 Sonnet	200,000 tokens	~150,000 words, ~300 pages
GPT-4 Turbo	128,000 tokens	~96,000 words, ~190 pages
GPT-4o	128,000 tokens	~96,000 words, ~190 pages

This looks like a lot. But how much content does a real knowledge base have?

A mid-sized company's internal documentation: thousands of documents, millions of words
A large codebase: tens of thousands of files, billions of tokens
A news or research database: millions of articles

All of these exceed any model's context window. That is the hard ceiling on long context.

The Real Cost of Long Context

"Bigger window" doesn't mean "free." Every request processes every token, and the cost is real.

Cost 1: Money

Rough estimates at late 2024 pricing (input tokens):

Model	Price per 1M tokens	1M token request
Gemini 1.5 Pro	$1.25	$1.25
Claude 3.5 Sonnet	$3.00	$3.00
GPT-4 Turbo	$10.00	$10.00

Compare to RAG:

Retrieval phase: Embedding API only (< $0.001)
Generation phase: 2,000–5,000 tokens of retrieved context + question (< $0.05)

RAG can cost 20–200× less than long context for the same question.

At 1,000 user queries per day against an enterprise knowledge base:

Long context (1M tokens): ~$1,250/day
RAG (3K token context): ~$3–15/day

Cost 2: Latency

More tokens = slower response. Time to first token (TTFT) grows roughly linearly with input length:

100K token input → TTFT ~2–5 seconds
1M token input   → TTFT ~15–30 seconds (varies by model and infrastructure)

A conversational application where the user waits 30 seconds before any output is largely unusable.

Cost 3: Lost in the Middle

A 2023 Stanford paper "Lost in the Middle" (Liu et al.) found that when relevant information appears in the middle of a long context, LLM recall drops significantly. Information at the beginning or end performs best; information in the middle performs worst.

Position vs. recall (approximate trend):
Beginning (0–10%)    ████████████████ high
Middle (40–60%)      ██████           low
End (90–100%)        ████████████     higher

Stuffing 100 documents into context does not guarantee the model finds the one at position 50.

The Real Cost of RAG

RAG isn't free either.

Cost 1: Imperfect Retrieval

Vector search is approximate matching — it makes mistakes:

False negatives: relevant documents not retrieved. The user's question is semantically distant from the relevant passage; it falls outside the top-k.
False positives: irrelevant documents retrieved. The LLM receives noise, which can cause confusion or hallucination.

This is exactly the problem that earlier articles in this series addressed: hybrid retrieval, Rerank, HyDE — all of these are patches for retrieval imperfection.

Cost 2: Chunking Breaks Context

Chunking splits documents into fragments. Related information can end up in different chunks. A 10-page research report whose conclusion depends on an assumption from page 3 may be split such that only the conclusion chunk is retrieved — the LLM gets the conclusion without the premise.

Cost 3: System Complexity

RAG is an engineering system: vector store + embedding model + retrieval pipeline + update mechanism + evaluation framework. Compared to "send the document to the LLM," it has significantly higher maintenance cost.

Five-Dimension Comparison

Dimension	Long Context	RAG
Document volume ceiling	~10–100 docs (limited by window and cost)	Unlimited (vector store scales)
Cost	High (all tokens billed every request)	Low (only relevant fragments)
Latency	High (large inputs are slow)	Low (small inputs are fast)
Recall completeness	Perfect (everything is present)	Incomplete (depends on retrieval quality)
Knowledge updates	Requires resending all content	Only update changed documents
Engineering complexity	Low (direct API call)	High (retrieval pipeline to maintain)
Single-document understanding	Strong (cross-document reasoning)	Weaker (affected by chunking)

Neither approach wins on all dimensions.

Decision Framework: Which One?

Four dimensions to locate your scenario:

Dimension 1: Document Volume

< 50 docs, total < 100K tokens     → consider long context
50–1000 docs                       → evaluate cost, decide
> 1000 docs, or total > 1M tokens  → RAG

Dimension 2: Update Frequency

Static content (monthly updates or less)   → long context acceptable
Dynamic content (daily/hourly updates)     → RAG (incremental indexing is cheap)
Real-time data                             → RAG (or direct API integration)

Dimension 3: Query Volume

One-time analysis (research, report generation)   → long context
Low-frequency queries (< 100/day)                 → either works
High-frequency queries (> 1000/day)               → RAG (cost differences compound)

Dimension 4: Latency Requirements

Interactive Q&A (< 3 second response)   → RAG
Report generation, offline analysis     → long context acceptable

Summary Decision Table

Use case                       Docs    Updates   Queries   Recommendation
──────────────────────────────────────────────────────────────────────────
Legal contract review (single) small   none      once      Long context
Enterprise knowledge base Q&A  large   frequent  high      RAG
PDF financial report analysis  medium  none      once      Long context
Product documentation chatbot  large   moderate  high      RAG
Codebase understanding         huge    frequent  high      RAG
Meeting notes summary (single) small   none      once      Long context

Hybrid Strategy: Use Both

Long context and RAG are not mutually exclusive. Sometimes the best choice is a combination.

Strategy 1: RAG selects documents, long context reads in full

# Step 1: use RAG to find the 3 most relevant documents
relevant_docs = retriever.invoke(query)  # top-3 documents

# Step 2: send full documents (not chunks) to the LLM
full_docs = [load_full_doc(doc.metadata["source"]) for doc in relevant_docs]
full_context = "\n\n".join([doc.page_content for doc in full_docs])

# Step 3: LLM answers based on complete documents
answer = llm.invoke(f"Answer based on the following documents:\n{full_context}\n\nQuestion: {query}")

Good fit for: large document sets (can't send all), but each document requires complex cross-passage reasoning.

Strategy 2: Coarse-grained RAG with large chunks

Traditional RAG uses 512–1024 token chunks. With larger windows, you can use 3,000–10,000 token chunks — preserving much more context while still doing retrieval filtering.

# Split with larger chunks (preserve more context)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,    # traditional 512 → now 4000 is reasonable
    chunk_overlap=400,
)

# Retrieve fewer chunks since each is larger
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 3 × 4000 = 12,000 tokens: precise and context-rich

Strategy 3: Summary cache + precise retrieval

For large document libraries, use the LLM to generate a structured summary for each document; retrieve summaries; load the original passage on demand.

# Pre-processing: generate summaries (one-time)
for doc in all_documents:
    summary = llm.invoke(f"Summarize this document's key points in 3 sentences:\n{doc.page_content}")
    summary_doc = Document(page_content=summary, metadata={
        "source": doc.metadata["source"],
        "original": doc.page_content,
    })
    summary_vectorstore.add_documents([summary_doc])

# Query time: retrieve summaries, load original passages
def query_with_summary(question):
    summaries = summary_vectorstore.similarity_search(question, k=5)
    relevant_chunks = [
        extract_relevant_passage(s.metadata["original"], question)
        for s in summaries
    ]
    return llm.invoke(build_prompt(question, relevant_chunks))

What Actually Changed

The rise of large context windows genuinely shifted some decisions:

Scenarios where RAG was once necessary but now may not be:

Understanding documents under 50 pages (just stuff it in — simpler)
One-time document analysis tasks (not worth building a RAG system)
Prototype validation (fast idea testing, no need for production-grade RAG)

Scenarios where RAG is still necessary (most production systems):

Knowledge bases with > 1,000 documents
Frequently updated content
High concurrency, cost-sensitive deployments
Attribution requirements (RAG natively knows which document an answer came from)

Large context windows made "skip RAG for simple cases" a reasonable choice. They didn't make RAG obsolete — they made RAG's use case clearer: when document volume, update frequency, or cost makes "full context" impractical, RAG is irreplaceable.

Summary

	Long Context	RAG
Core strength	Complete context, cross-document reasoning	Scalable, low cost, real-time updates
Core limitation	High cost, high latency, hard document ceiling	Imperfect retrieval, engineering complexity
Best for	Small-scale, one-time deep analysis	Large-scale production systems
Trend	Windows keep growing, costs keep falling	Retrieval quality keeps improving

These are not competitors — they're complementary tools. Understanding the true cost of each, and choosing the right one, is engineering judgment.

DEV Community