Your LLM has 128K tokens.
Your document has 150K words.
Something has to give. What do you do?
A) Chunk the document into fixed-size pieces and embed each one — retrieve the top-k at query time.
B) Use a sliding window — process the document in overlapping chunks, stitch the outputs together.
C) Summarize each section progressively — feed the running summary forward as context.
D) Truncate to the most recent tokens and hope the answer is near the end.
Three of these are real strategies teams ship to production. One of them will silently give you wrong answers on a predictable class of questions.
Pick one — and tell me which you'd actually use on a 200-page legal contract where the answer can be anywhere.
I'll drop the full breakdown in the comments — including the failure mode most engineers don't see until they're in production.
Drop your answer 👇
Top comments (4)
Why A wins (RAG with chunked embeddings):
You split the document into overlapping chunks (~500 tokens, ~10–20% overlap), embed each one, store in a vector DB (Pinecone, pgvector, Qdrant), and retrieve top-k at query time.
This scales to any document size. Your LLM context at inference only sees the retrieved chunks — staying well inside the context window. Latency stays low because retrieval is a fast ANN lookup, not a sequential scan.
The catch: chunking boundaries destroy semantic coherence. If a clause starts on page 12 and resolves on page 13, a hard chunk boundary between them means neither chunk retrieves well for that question. This is why overlap and chunk size tuning matter so much. Most teams underestimate this.
Production fix: chunk at sentence or paragraph boundaries, not character counts. Hybrid retrieval (keyword + semantic) handles the edge cases.
Why B is the trap answer (sliding window):
Sliding window looks elegant — process the document in overlapping passes, accumulate the answer. But it doesn't actually fix the context window problem. You still have to process every chunk sequentially, and the answer to a question spanning chunk 1 and chunk 47 gets lost. You're doing O(n) LLM calls per query. At 150K words and a 3s/call LLM, that's minutes per query. Nobody ships this in production for QA on large documents.
It works fine for summarization tasks where the output from chunk N feeds into chunk N+1. For retrieval? Wrong tool.
Why C sometimes works (progressive summarization):
For summarizing a 200-page doc? Solid. You map-reduce the document — summarize sections, then summarize the summaries. GPT-4o does this well.
For answering a specific question about clause 38(b) on page 97? It doesn't. The intermediate summary will have compressed away the specific detail you needed. Lossy compression applied to information retrieval is a category error.
Use progressive summarization when you want a high-level understanding. Don't use it when specific facts need to survive to the end of the pipeline.
Why D is obviously wrong (and also surprisingly common):
Truncating to the last N tokens is the default behavior of most LLM API wrappers when you exceed the context limit and don't handle it explicitly. Many teams ship this by accident.
The problem beyond the obvious data loss: LLMs have a "lost in the middle" failure mode (Liu et al., 2023). Performance on retrieval tasks is highest when the answer is at the start or end of the context window — and degrades significantly when it's buried in the middle. So even if you fit everything in context, position matters.
Truncation compounds this — you're not just losing data, you're losing data unpredictably.