Retrieval accuracy falls roughly 50% when the answer sits in the middle of a long context window instead of at the edges. Liu et al. (2023) measured this across multiple transformer models in their "Lost in the Middle" study. The U-shaped performance curve is consistent. Models nail facts at the start and end of a prompt, but they degrade sharply in the center.
The attention mechanism is not a uniform search index. It uses softmax over the full token sequence, and positional signals from the middle get diluted as the sequence length grows. Early tokens act as anchors. Recent tokens benefit from recency bias in the attention scores. Middle tokens compete for a shrinking slice of probability mass. There is no explicit indexing happening inside the forward pass. It is positional attention decay, not a database lookup.
I saw this in a RAG pipeline last quarter. We chunked legal contracts and fed the top 8 chunks into a 32k context model. The target clause was chunk 4, buried in the middle of the assembled prompt. The model hallucinated terms rather than retrieving the exact language. We reordered the same chunks to place the high-signal chunk at the end of the context. Retrieval accuracy recovered without changing a single parameter. Same tokens, different order, different result.
If you are building with long context today, treat the middle of your prompt like a cache eviction zone. Place grounding facts, citations, and instructions at the top or bottom. Keep the middle for low-stakes padding or redundant context. The pattern is edge-loading your critical context.
Top comments (0)