Joud Awad

Posted on Jun 13

38/60 Days System Design Questions

#abotwrotethis #systemdesign #ai #rag

Your LLM has 128K tokens.

Your document has 150K words.

Something has to give. What do you do?

A) Chunk the document into fixed-size pieces and embed each one — retrieve the top-k at query time.
B) Use a sliding window — process the document in overlapping chunks, stitch the outputs together.
C) Summarize each section progressively — feed the running summary forward as context.
D) Truncate to the most recent tokens and hope the answer is near the end.

Three of these are real strategies teams ship to production. One of them will silently give you wrong answers on a predictable class of questions.

Pick one — and tell me which you'd actually use on a 200-page legal contract where the answer can be anywhere.

I'll drop the full breakdown in the comments — including the failure mode most engineers don't see until they're in production.

Drop your answer 👇

30DaysOfSystemDesign #SystemDesign #AI #LLM

Top comments (7)

c0d3l0v3r • Jun 14

Great Article

My answer was A, before looking at the comments. 😄

By the way, following you here too. 😀

I once built a RAG system where we chunked documents by semantic paragraphs rather than by pages or fixed page boundaries. My reasoning was that smaller, semantically coherent chunks would reduce the amount of irrelevant context retrieved and, in turn, reduce the likelihood of the model hallucinating around unrelated content.

For something like a 50-page policy, I'd still lean toward A. Retrieval allows the model to focus on the most relevant sections instead of forcing the entire document through a summarization pipeline, where important details can be lost, or through truncation, where they are lost by definition.

Curious to see the failure mode you have in mind. 🙂

Joud Awad • Jun 14

I really appreciate your support man, it really means a lot to me!
honestly, that's the right instinct — semantic paragraph chunking is what I'd reach for too. Coherent units, less junk in the window. You're not wrong.

But the failure mode I had in mind doesn't really care how good your chunks are.
It shows up when the question needs the whole document instead of a slice. Top-k is built for "what does clause X say" — point a query at it, get the closest chunks back, done. Two question shapes quietly break that:
"List every termination condition in this contract." Your k is fixed. If there are 9 and you pull 5, the model answers with 5 and sounds totally sure. A wrong answer that reads like a right one.
"Is there an indemnification clause?" If there isn't one, retrieval still hands back the 5 nearest chunks, and now the model is reasoning over the things that look most like indemnification without actually being it. Absence is invisible to a similarity search.
That's the silent part. The model never knows what it didn't retrieve. No error, no empty result, just a confident answer built on half the evidence.
On a 50-page policy you're mostly fine, because most questions are local — which is exactly why A is still the right default. It's the 200-page contract with "do any clauses conflict with section 12" that falls over, clean chunks or not.
So the real fix isn't better chunking. It's spotting which questions are local (RAG is great here) vs global (you need aggregation, multi-hop, or a map-reduce pass over the full doc). Same input, completely different problem underneath

c0d3l0v3r • Jun 15

Ohh, that makes sense now. Thanks.

I was focused on improving retrieval quality, but the real issue is that top-k retrieval can never know what it didn't retrieve. For local questions, RAG works great. For global questions or proving absence, you need aggregation over the whole document, not just better chunking.

Agree with your point of View

Really useful distinction. 👍

Joud Awad • Jun 13

Why B is the trap answer (sliding window):

Sliding window looks elegant — process the document in overlapping passes, accumulate the answer. But it doesn't actually fix the context window problem. You still have to process every chunk sequentially, and the answer to a question spanning chunk 1 and chunk 47 gets lost. You're doing O(n) LLM calls per query. At 150K words and a 3s/call LLM, that's minutes per query. Nobody ships this in production for QA on large documents.

It works fine for summarization tasks where the output from chunk N feeds into chunk N+1. For retrieval? Wrong tool.

Joud Awad • Jun 13

Why C sometimes works (progressive summarization):

For summarizing a 200-page doc? Solid. You map-reduce the document — summarize sections, then summarize the summaries. GPT-4o does this well.

For answering a specific question about clause 38(b) on page 97? It doesn't. The intermediate summary will have compressed away the specific detail you needed. Lossy compression applied to information retrieval is a category error.

Use progressive summarization when you want a high-level understanding. Don't use it when specific facts need to survive to the end of the pipeline.

Joud Awad • Jun 13

Why A wins (RAG with chunked embeddings):

You split the document into overlapping chunks (~500 tokens, ~10–20% overlap), embed each one, store in a vector DB (Pinecone, pgvector, Qdrant), and retrieve top-k at query time.

This scales to any document size. Your LLM context at inference only sees the retrieved chunks — staying well inside the context window. Latency stays low because retrieval is a fast ANN lookup, not a sequential scan.

The catch: chunking boundaries destroy semantic coherence. If a clause starts on page 12 and resolves on page 13, a hard chunk boundary between them means neither chunk retrieves well for that question. This is why overlap and chunk size tuning matter so much. Most teams underestimate this.

Production fix: chunk at sentence or paragraph boundaries, not character counts. Hybrid retrieval (keyword + semantic) handles the edge cases.

Joud Awad • Jun 13

Why D is obviously wrong (and also surprisingly common):

Truncating to the last N tokens is the default behavior of most LLM API wrappers when you exceed the context limit and don't handle it explicitly. Many teams ship this by accident.

The problem beyond the obvious data loss: LLMs have a "lost in the middle" failure mode (Liu et al., 2023). Performance on retrieval tasks is highest when the answer is at the start or end of the context window — and degrades significantly when it's buried in the middle. So even if you fit everything in context, position matters.

Truncation compounds this — you're not just losing data, you're losing data unpredictably.