DEV Community

Ana Julia Bittencourt
Ana Julia Bittencourt

Posted on • Originally published at blog.memoclaw.com

The lost-in-the-middle problem and why retrieval beats stuffing

Your agent has a 200K token context window. So you dump everything in there — MEMORY.md, daily logs, project notes, old conversations — and figure the model will sort it out. It won't.

The research says your middle context is a dead zone

In 2023, researchers from Stanford, UC Berkeley, and Samaya AI published "Lost in the Middle: How Language Models Use Long Contexts." They tested models on tasks where the relevant information was placed at different positions in the input. The results were consistent: models performed best when key information appeared at the very beginning or the very end of the context. Information in the middle got ignored.

This wasn't a fluke finding. Nelson Liu and the team tested across multiple model families and context lengths. Performance degraded significantly — sometimes by 20% or more — when the answer was buried in the middle third of the input.

Google DeepMind followed up with similar findings. So did Anthropic's own internal research on Claude's attention patterns. The pattern holds: long context doesn't mean good context.

What this means for your agent

If you're loading 50KB of MEMORY.md into every session, here's what actually happens:

  1. The model reads the first few thousand tokens carefully
  2. Attention drops off through the middle
  3. It picks back up near the end, where your actual conversation starts

That preference you stored six months ago about using TypeScript? It's sitting in paragraph 47 of your memory file. The model probably won't notice it when it matters.

The math makes it worse. A 50KB MEMORY.md is roughly 12,500 tokens. At $3 per million input tokens (Claude Sonnet pricing), that's about $0.04 per session just to load memories your agent might not even use. Run 50 sessions a day and you're spending $2/day on context that's partially invisible to the model.

Stuffing vs. retrieval: a real comparison

Stuffing approach (MEMORY.md):

  • Load everything every session: ~12,500 tokens
  • Model sees all memories but attends unevenly to them
  • Cost: $0.04 per session regardless of relevance
  • Old memories compete with new ones for attention

Retrieval approach (MemoClaw recall):

  • Query for relevant memories: 5-10 results, ~500-1,000 tokens
  • Model sees only what's relevant to the current conversation
  • Cost: $0.005 per recall + ~$0.003 in input tokens
  • Important memories surface when they're actually needed

The retrieval approach uses roughly 8% of the tokens and puts them where the model actually pays attention — right before the conversation starts.

Why "just use a bigger context window" doesn't fix this

Every few months, someone announces a longer context window. Gemini hit 1M tokens. Claude went to 200K. GPT-4 Turbo did 128K. And every time, people assume the memory problem is solved.

It isn't. Longer windows don't change the attention distribution. They make the middle-zone problem worse because there's more middle to lose things in. A 1M token context with your answer at position 500K is worse than a 4K context with your answer at position 2K.

The lost-in-the-middle researchers tested this explicitly. Extending context length didn't improve retrieval from the middle. It just gave models more text to skim past.

What actually works

The fix isn't bigger contexts. It's smaller, targeted contexts with the right information.

With MemoClaw, instead of loading everything, you recall what's relevant:

memoclaw recall "user's TypeScript preferences"
Enter fullscreen mode Exit fullscreen mode

You get back 5-10 semantically matched memories. You inject those at the start of your prompt. The model sees exactly what it needs, right where it pays the most attention.

For an OpenClaw agent, this looks like:

  1. Session starts
  2. Agent calls recall with a query about the current task
  3. Gets back relevant memories (preferences, past decisions, corrections)
  4. Those go into the system prompt, before the conversation
  5. Agent works with full context on what matters, zero noise from six months of irrelevant notes

The token cost drops from ~12,500 to ~800. The relevant information moves from "somewhere in the middle" to "right at the top." The model stops missing things.

The numbers

Here's a side-by-side for an agent running 30 sessions per day over a month:

MEMORY.md stuffing MemoClaw retrieval
Tokens loaded per session ~12,500 ~800
Monthly input token cost ~$33.75 ~$2.16
MemoClaw API cost $0 ~$4.50 (30 recalls/day)
Total monthly cost ~$33.75 ~$6.66
Relevant info position Scattered Top of context
Missed memories Common (middle zone) Rare (semantic match)

You save about $27/month per agent and your agent actually remembers the things that matter.

Start with the expensive memories first

You don't have to migrate everything at once. Start with the memories your agent keeps forgetting:

  • User corrections ("I prefer tabs over spaces" stored with importance 0.9)
  • Project-specific context that only matters for one workspace
  • Preferences that were set months ago and keep getting lost in the file

Move those to MemoClaw, keep the rest in MEMORY.md for now, and see if your agent starts getting things right more often. If you've got an OpenClaw agent running, install the skill and run a migration:

memoclaw migrate ~/path/to/MEMORY.md --namespace my-project
Enter fullscreen mode Exit fullscreen mode

Your context window is expensive real estate. Stop filling it with things the model won't read.


References:

  • Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (2023). arXiv:2307.03172
  • Pricing based on Anthropic Claude 3.5 Sonnet rates as of early 2026.

Top comments (2)

Collapse
 
matthewhou profile image
Matthew Hou

I ran into this exact problem a month ago — dumped everything into context, agent performed worse than when I gave it less information. Counterintuitive until you read the Stanford paper.

The practical solution I landed on: a three-layer system. Layer 1 is a short scratchpad that gets overwritten every cycle (current state only). Layer 2 is structured knowledge files organized by topic. Layer 3 is semantic search that pulls only what's relevant per query. The agent never sees the full corpus — it searches first, reads specific sections, then acts.

The biggest win was stopping the agent from reading its own full memory file at startup. Sounds obvious in retrospect, but "load everything on boot" is the default pattern everyone starts with. The retrieval-first approach cut my context usage by maybe 70% and the output quality actually went up.

Collapse
 
signalstack profile image
signalstack

The multi-turn problem compounds this in a way that doesn't get enough attention. Even if you do careful top-loading of retrieved context at session start, every exchange adds tokens that push that context deeper into the middle. After 15-20 turns, your well-placed retrieved memories are sitting in the dead zone you were trying to avoid.

One approach I've found useful: re-retrieval at decision points, not just at session start. When the agent is about to make a significant architectural or implementation choice, trigger a fresh recall pass specifically for that decision type. It costs a bit more per session but keeps the relevant context in the primacy position when it matters.

The token cost table is a good reality check. People focus on "is my agent getting good results" without tracking what they're spending to load context that isn't helping. $27/month per agent is real money at scale, and that's before you factor in the quality degradation from middle-zone attention loss.