DEV Community

pickuma
pickuma

Posted on • Originally published at pickuma.com

Why AI Agents Forget: Memory Decay and Context Contamination Explained

You give your coding agent a clear objective — refactor an authentication module, add pagination to three endpoints, fix a flaky test suite. Forty tool calls and twenty minutes later it produces something that technically compiles, except it has silently forgotten a constraint you specified in turn two, contradicts a design decision from turn eight, and calls a helper function that it deleted three steps ago. The agent did not hallucinate in the usual sense. It ran out of usable memory.

This is not a rare edge case. It is a structural property of how transformer-based agents manage state, and it gets worse the longer an agent runs. Understanding the mechanics helps you design around the failure rather than being surprised by it.

The context window is RAM, not a database

Every token an LLM processes — your instructions, tool outputs, intermediate reasoning, code snippets — lives inside a fixed-size context window. Current frontier models vary widely: GPT-4o, Claude Sonnet, and Gemini each support context windows measured in hundreds of thousands of tokens, and some configurations extend into millions. That sounds enormous until you account for how fast a coding agent burns through it.

A realistic 50-step workflow, where each step involves a tool call with a moderately verbose output, can consume well over a million tokens in aggregate. Those tokens do not accumulate neatly; at each step the model must fit everything — prior turns, current state, and the new output — inside a single window. When the window fills, something has to give: either the agent truncates early context, or it hits a hard limit and fails.

The deeper problem is that the context window behaves more like RAM than persistent storage. Information is volatile. It degrades under load. And — unlike a database — you cannot efficiently index or retrieve specific entries from it. You put things in and hope the model attends to the right ones.

Three failure modes worth naming

Memory decay from positional bias

When a fact is stated early in a long context and many tokens accumulate after it, the model's attention to that fact decreases. This is not a hypothesis; it is a measurable phenomenon tied to transformer architecture. Researchers first characterized what they called the "lost in the middle" effect in a 2023 paper from Stanford and UC Berkeley, which showed that retrieval accuracy follows a U-shaped curve: information at the very beginning and the very end of a context window gets attended to most reliably, while content in the middle is substantially deprioritized.

A 2026 study by Yeran Gamage quantified the downstream effect on agent behavior across 4,416 trials at six conversation depths: constraint compliance dropped from 73% at turn 5 to 33% at turn 16 without memory mitigation. That is not a subtle degradation — it means an agent halfway through a complex task is violating its own instructions roughly two times as often as it was at the start. Critically, the agent does not know this is happening. It keeps running, confidently producing output that ignores earlier constraints.

Context contamination

Even when information stays within the window, it can poison future reasoning. Two mechanisms drive this.

The first is stale data. If a tool returns a code listing that the agent then modifies, the old listing is still sitting in context. Subsequent reasoning now has two versions of the same function — the original and the revised one — competing for the model's attention. The agent may reference the old version inadvertently, producing edits that target code that no longer exists.

The second is noise accumulation. Long agent traces include failed attempts, corrective feedback loops, half-completed thoughts, and tool errors. Anthropic's engineering team describes this as a "context rot" dynamic: as traces grow, the ratio of useful signal to accumulated noise falls, and model performance degrades even when the window limit has not been reached. The architecture's O(n²) attention scaling means every new token must interact with every prior token, and low-quality prior tokens drag on the computation.

Compaction hallucinations

When agents summarize prior context to make room for new work — a technique called compaction — they introduce a new failure mode. Summarization is lossy by design. If the model mis-summarizes even a small detail — a variable name, a constraint, an API signature — that error propagates forward as if it were ground truth. Minor inaccuracies in compaction output can contaminate the entire remainder of the session.

Compaction is not a free lunch. When an agent automatically summarizes its conversation history, any factual error in that summary gets treated as authoritative context going forward. If you are working on a long-running task and the agent suddenly seems to have forgotten a constraint or misidentified a function, compaction may have introduced a subtle error earlier in the trace. Checking the compacted summary — if your tool exposes it — is worth the time.

What mitigation actually looks like

There is no single fix, and any solution involves tradeoffs between context cost, latency, and the risk of information loss. The patterns below are not mutually exclusive — production agent systems typically combine several.

Just-in-time retrieval instead of pre-loading

Rather than dumping all relevant code, documentation, and prior state into the initial prompt, agents can use lightweight identifiers — file paths, function names, schema names — and retrieve the actual content only when they need it. This keeps the context window lean and pushes retrieval cost to tool calls, which are cheap relative to context-window real estate. Anthropic calls this pattern "just-in-time context retrieval" in their context engineering guidance.

The tradeoff: retrieval requires good tooling. The agent must know what to retrieve and when. If retrieval granularity is too coarse, you just re-introduce the noise problem via tool output instead of pre-loaded context.

Scoped sub-agents with narrow objectives

Rather than running one agent through an entire multi-hour task, you decompose the work into focused sub-tasks and spawn separate agents for each. Each sub-agent gets a clean context window, a narrow objective, and access only to the tools it needs. The parent agent receives only the final output — typically a condensed summary of 1,000–2,000 tokens — rather than the full trace.

This architecture prevents context explosion and makes individual steps easier to reason about. Claude Code's sub-agent pattern works exactly this way: each spawned agent's intermediate tool calls and reasoning stay isolated in its own window and never pollute the parent's context. The isolation can go further — agents can be forked into separate git worktrees so their file edits do not interfere with the main checkout.

The tradeoff: decomposing a task correctly requires upfront design. Sub-tasks that are not truly independent will need to share state somehow, which reintroduces the coordination problem at a different layer.

External memory with selective promotion

Instead of relying on context window contents as the sole source of agent memory, you can maintain a persistent external store — a vector database, a structured key-value store, or even a flat file — and give the agent tools to read and write it. Important decisions, constraints, and intermediate results get written to external memory explicitly. At the start of each turn, the agent retrieves only what is relevant to the current step.

The "selective promotion" part matters. If every interaction gets stored, retrieval quality degrades as the store grows stale. Effective systems let the model evaluate each interaction for salience before persisting it, then periodically prune outdated facts and merge duplicates.

If you are building an agent that runs multi-hour tasks, treat constraint pinning as a first-class concern. Hard constraints — API rate limits, forbidden patterns, required reviewers, output format rules — should be re-injected into the context on every call at the system-prompt level, not stated once at the beginning of the conversation. Positional decay will erode constraint compliance if you rely on the model remembering something it saw 100,000 tokens ago.

Structured compaction over naive truncation

When compaction is unavoidable, the quality of the summary prompt matters enormously. Good compaction separates what must be preserved verbatim (active constraints, unresolved errors, API signatures the agent is currently working with) from what can be summarized lossy (completed sub-tasks, exploratory paths that were abandoned). Some systems use hierarchical summarization: long traces are broken into chunks, each chunk is summarized independently, and then the chunk summaries are summarized into a top-level digest.

The goal is maximizing recall of critical facts while minimizing token cost. If your compaction prompt is generic ("summarize the conversation"), you will get generic output that drops edge cases. If it is specific ("list every constraint stated, every open error, and every file modified"), you preserve what the agent actually needs.

What this means for how you use agents today

The practical upshot is that long-running agents are not just a UX problem — they are an architecture problem. Treating a coding agent as a stateful assistant that remembers everything from session start sets you up for failures that are hard to diagnose because the agent does not announce what it has forgotten.

Short, scoped tasks with explicit constraints repeated at relevant points outperform long open-ended sessions on nearly every quality metric. When you do need multi-step workflows, check whether your tool exposes compaction summaries or memory state, and treat those as artifacts worth inspecting — not plumbing to ignore. The agent's confidence at turn 30 is not evidence that it has an accurate model of turn 3.

Context engineering — the practice of deliberately curating what goes into the window, when, and in what form — is becoming as important as prompt engineering. The window is finite. Everything you put in it costs attention, and everything you omit is a retrieval problem waiting to happen.


Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.

Top comments (0)