Lost-in-the-Middle Is Still Real in 2026 (Even on 1M-Token Models)

#ai #llm #rag #python

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series): Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub. An IDE for developers who ship with Claude Code and other AI coding tools.
Me: xgabriel.com | GitHub

You stuffed the whole onboarding handbook into the prompt because the model has a million-token window now. The answer comes back confident and wrong. You re-read your prompt. The fact the model missed is right there, in paragraph 47 of section 12, exactly where you put it. So why did the model invent a different answer?

You hit the same wall a Stanford team hit in 2023. Liu et al. published Lost in the Middle: How Language Models Use Long Contexts. Their finding still holds: facts at the start or end of a prompt come back accurately. Facts in the middle often vanish, sometimes by 20 or more percentage points (see Figure 5 of the paper). The recall curve is a U.

Three years and several model generations later, the U-shape is still the dominant pattern. Bigger context windows did not fix it. They just gave you more middle to lose things in.

What "lost in the middle" actually means

The original setup was simple. Build a prompt that contains a question, a small set of documents, and one document that has the answer. Vary the position of the answer document (first, second, third, and so on). Measure accuracy at each position.

The 2023 paper ran this on GPT-3.5, Claude 1.3, and a few open-source models. Accuracy was high when the gold document sat at position 1 or position N. It dipped by 20+ percentage points for positions in the middle. The paper's title became a benchmark tradition the moment the next long-context model shipped.

Other teams reproduced it on every long-context model that came out:

Anthropic's Claude 2.1 long-context prompting write-up documented the broader long-context recall problem on a 200K-token needle eval and showed a prompt-engineering mitigation (priming the model with "Here is the most relevant sentence in the context:") that lifted accuracy from 27% to 98%.
Greg Kamradt's widely cited Needle In A Haystack repo became the de facto third-party harness. Its public results across GPT-4, GPT-4-Turbo, and the Claude 2.x family showed the same U-shape on long inputs.
Google's Gemini 1.5 technical report leaned hard on near-perfect needle recall. Follow-up work like RULER (Hsieh et al., 2024) tested multi-needle and reasoning-over-context tasks and found effective context length is much shorter than advertised, with consistent recall degradation as context grows.
More recent multi-document benchmarks like LongBench v2 and HELMET continue to surface long-context degradation patterns consistent with the lost-in-the-middle effect, even on the deepest 2025-era frontier models.

RULER alone tested 17 long-context models. All 17 showed the recall degradation as input length grew. Bigger windows moved the middle further out. They did not get rid of it.

A 40-line eval you can run today

Here is a small harness that reproduces the effect against any Anthropic model. It builds a synthetic context out of distractor paragraphs, drops a unique fact at a configurable position, asks one question, and grades the answer.

import os
import random
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-sonnet-4-5"

DISTRACTOR = (
    "The harbor district was quiet for most of "
    "the year. Fishermen left at dawn and "
    "returned at dusk, and the cafes by the "
    "water filled up only on weekends."
)

NEEDLE = (
    "The municipal archive code for the 1973 "
    "harbor renovation is K-7421-Q."
)

QUESTION = (
    "What is the municipal archive code for "
    "the 1973 harbor renovation?"
)

GOLD = "K-7421-Q"

Build the context with the needle at a chosen position:

def build_context(num_chunks: int, needle_pos: int):
    chunks = [DISTRACTOR] * num_chunks
    chunks[needle_pos] = NEEDLE
    return "\n\n".join(chunks)

def ask(context: str, question: str) -> str:
    msg = client.messages.create(
        model=MODEL,
        max_tokens=128,
        messages=[
            {
                "role": "user",
                "content": (
                    f"Read this passage and answer "
                    f"the question.\n\n"
                    f"<passage>\n{context}\n</passage>\n\n"
                    f"Question: {question}"
                ),
            }
        ],
    )
    return msg.content[0].text

Sweep the needle across positions and measure recall:

def run_sweep(num_chunks: int = 200, trials: int = 3):
    positions = [0, num_chunks // 4,
                 num_chunks // 2,
                 3 * num_chunks // 4,
                 num_chunks - 1]
    results = {}
    for pos in positions:
        hits = 0
        for _ in range(trials):
            ctx = build_context(num_chunks, pos)
            answer = ask(ctx, QUESTION)
            if GOLD in answer:
                hits += 1
        results[pos] = hits / trials
    return results

if __name__ == "__main__":
    print(run_sweep())

Run it against any frontier long-context model and you will see the same U-shape the published reproductions report. On Claude Sonnet 4.5 at 200 chunks, expect roughly 90%+ recall at the edges and a clear dip in the middle, with the exact gap depending on chunk size and model snapshot.

A few notes on the harness so you do not draw the wrong conclusion:

One needle per run. Real RAG retrieves several relevant chunks, and the RULER work shows multi-needle is meaningfully harder than single-needle.
The needle is a string match. Real questions need reasoning, which is where mid-context loss compounds.
Distractor diversity matters. Repeating the same paragraph is the friendliest possible setup. Swap in real documents and recall drops further.

Treat the script as a starting point. It is enough to win the argument with your skeptical teammate that the effect is real and reproducible.

Bigger windows aren't the fix

When the 2023 paper landed, the natural reaction was: "we will solve this with longer context." Three years later, that bet did not pay off. Context windows are dramatically larger, the U-shape persists, and dumping everything into the prompt is more expensive per call and slower per response than retrieval-based approaches.

Three patterns that have held up across model generations:

1. Retrieve and rerank, then place the top results at the edges. Vector search gives you 50 candidates, a reranker scores them, and you pass the top 5 to 10 in. Put the highest-scoring chunk first. Put the second-highest at the very end of your context. Spread the rest in between. That puts your highest-signal chunks where the model actually pays attention.

2. Anchor the question to the recent end. The end of your prompt is where the model is about to start generating. Repeat the user's question (or a tightened version of it) right before the answer slot. Something as small as "Question: what is the municipal archive code for the 1973 harbor renovation?" right above the assistant turn keeps the task in the model's strongest attention window.

3. Chunk-and-summarize before you stuff. Long source documents get a per-chunk summary pass first. The summaries become the prompt. The originals stay one fetch away if the model needs to drill in via a tool call. A cheap summarization pass usually beats a long answer pass on recall.

Think of attention as a spotlight that's always brightest at the wings of the stage. If a fact is load-bearing, do not bury it in the chorus.

What to take away

Long-context models did not solve retrieval. They changed where the failures live.

If your application is sliding from "answers seem fine" to "answers are subtly wrong," check the prompt before you blame the model. Find the position where the answer is supposed to come from. Then check whether anything in your retrieval and ordering logic is pushing it into the middle.

Run the sweep on your retrieval pipeline this afternoon. The U is waiting.

If this post saved you a debugging session

Retrieval-augmented systems live or die on chunking, ranking, and ordering. My RAG Pocket Guide walks through the production patterns that hold up: chunking strategies, hybrid retrieval, reranking, prompt assembly, and the eval harnesses you need to keep regressions from shipping. If you build with LLMs at work, it's the part of the stack where small wins compound fastest.

I also build Hermes IDE, an IDE for developers who ship with Claude Code and other AI coding tools. If you spend most of your days writing code with an LLM in the loop, it might be worth a look.