If you’ve built a RAG system, you’ve probably seen this:
- The retriever finds the right document
- The chunk clearly contains the answer
- Yet the LLM responds as if it never saw it
This isn’t a vector problem.
It’s not an embedding issue.
And it’s usually not your prompt.
It’s a context window problem — commonly called “Lost in the Middle.”
What “Lost in the Middle” Actually Means
Large Language Models do not treat all tokens equally.
When processing long prompts, models tend to:
- Pay more attention to tokens at the beginning
- Pay more attention to tokens at the end
- Pay less attention to tokens in the middle
This behaviour emerges from how transformer attention works at scale — especially when prompts approach the context window limit.
So even if the correct chunk is retrieved, placing it in the middle of a long prompt makes it statistically easier for the model to ignore.
Why This Hits RAG Systems Especially Hard
RAG pipelines usually look like this:
- User asks a question
- Retriever fetches top-K chunks
- Chunks are concatenated into context
- Prompt is sent to the LLM
The problem?
Most systems:
- Append retrieved chunks after instructions
- Stack chunks in relevance order
- Push critical information into the middle of the prompt
Typical RAG prompt layout:
[System Instructions]
[User Question]
[Retrieved Chunk 1]
[Retrieved Chunk 2]
[Retrieved Chunk 3]
[Retrieved Chunk 4]
[Answer Instruction]
Where do most chunks land?
👉 Right in the middle
Result:
Relevance ≠ Visibility
Even though:
- They are semantically correct
- They are retrieved via similarity
The retriever did its job — but the model never fully used the information.
Why “Better Embeddings” Don’t Fix This
This is the trap many teams fall into:
- Switching from OpenAI → Cohere → BGE
- Tweaking vector dimensions
- Changing similarity metrics
But embeddings only decide what gets retrieved.
They don’t control what gets attended to.
You can have perfect embeddings and still get poor answers if:
- Context is too long
- Chunks are poorly ordered
- Important facts sit in the middle
How “Lost in the Middle” Shows Up in Production
Common symptoms:
- Model answers partially correct
- Hallucinations despite relevant context
- Correct answers during testing, failures at scale
- “It works for short queries, not long ones”
These are not random failures — they’re structural.
Practical Ways to Mitigate It
You don’t eliminate “Lost in the Middle” — you design around it.
Effective strategies include:
- Putting critical chunks at the beginning or end
- Query-aware chunk re-ordering
- Context compression / summarisation
- Smaller, intent-focused context windows
- Multi-step prompting instead of one giant prompt
The goal isn’t more context — it’s better-positioned context
Final Takeaway
RAG doesn’t fail because retrieval is wrong.
It fails because attention is finite.
If you don’t design for how models actually consume context, they’ll ignore the very information you worked hard to retrieve.
What’s Next
In the next article, we’ll go deeper into:
Chunking, Batching & Indexing — the Hidden Costs of RAG Systems
Because once attention is understood, scale, latency, and cost become the real problems.
Top comments (0)