Tisha

Posted on Jun 7

Why Agents Forget

#ai #llm #machinelearning #agents

Your coding agent is better than it was a year ago, and it still forgets.

The assistants added a memory layer in the meantime. GitHub Copilot now carries your conventions across sessions, and ChatGPT keeps a running profile of what you tell it. But those features remember preferences, not work: they hold "use camelCase," not "we ruled out Redis for the cache last sprint because of the eviction policy." The model underneath still starts every call from nothing.

You see it the first time you leave an agent running across a day. It boots up, runs the tests, watches them fail, and works out from scratch, again, that the database has to be seeded first. Yesterday's run already learned that. Today's run has no way to know it.

From first principles, the reason is not mysterious. A language model is a pure function: its output depends only on its weights and the text in front of it. Nothing carries from one call to the next, because there is nowhere for it to go. The weights freeze after training, and the context window clears when the call ends. Everything we call memory is a workaround for that one fact: we store text outside the model and feed it back in next time. The 2020 RAG paper named the underlying problem and filed "updating world knowledge" under open. Five years on, it still is.

This was a footnote when agents answered one question and stopped. It is the main event now. We give agents work that runs for hours across many sessions, and an agent that carries nothing forward cannot improve. It can only get faster at starting over.

So the question this post takes seriously: how does an agent learn from what it already did? Every claim links to a primary source, because the field is loud and the details are where the truth is.

The north star

Where are we actually trying to go? Separate the two things people both call memory, because they are not the same operation.

	Persistence	Learning
What changes	the input (the prompt)	the function (the weights)
How	store facts, read them back	update the model from experience
Example	Copilot recalls your style	a model sharper at the task than last month
Where we are	shipping today	open frontier

Almost everything shipping today is the left column. Copilot recalls that you prefer camelCase. ChatGPT recalls your kid's name. Real and useful, and not the same as a model that is more capable in month six than in month one.

The north star is the right column: an agent that detects it keeps making the same mistake and stops, that internalizes the structure of your codebase and the failure modes of your stack. Anthropic framed memory in a recent talk as the primitive they believe turns agents into systems that improve from their own experience.

Are we there? No, and it is worth being exact about why. What we call learning today changes the text we feed the model, not the model. The weights are identical before and after. The competence is fixed; only the context improves. Most of the confusion in this space comes from not naming which column a system is in.

Where we are

How does an agent remember today? Reduce it to the two operations that matter and the rest is detail. On the way in, retrieval: what to pull into the context window. On the way out, writing: what to persist. Memory is the engineering around those two decisions, plus an offline pass that cleans up between runs.

Start with the constraint that forces the whole design. The context window is the only channel for new information at inference, and it is finite. The MemGPT paper frames this as an operating system managing scarce memory: the window is RAM, the external store is disk, the system pages between them. That is the one analogy worth keeping, because it is the architecture, not decoration.

What are we storing, exactly? CoALA names four kinds, and each maps to a different store and a different update rhythm.

Memory tier	Cognitive type	Infrastructure	Updated
In-context	Working	Context window (RAM)	Every model turn
Session log	Episodic	Temporal graph / append-only log	Live, during the session
Knowledge base	Semantic	Vector DB / document store	Read-heavy, slow writes
Runbooks & rules	Procedural	File system (markdown / wiki)	Offline consolidation

"Seed the database before the tests" is procedural. "This service runs on Postgres 14" is semantic. "The March migration broke staging" is episodic.

How does the right memory reach the window? The base case is RAG: split knowledge into the parametric memory in the weights and a non-parametric store you can search and update without retraining. In practice that store is a vector index. Embed everything, and at query time pull the passages nearest the query in vector space.

The trap is treating nearest as most useful. Similarity is not relevance, and one store cannot serve two different questions. A real retrieval step routes across stores in parallel, to stay inside a latency budget:

async def build_context(query, session_id):
    # episodic and semantic live in different stores; fetch concurrently
    episodic, semantic = await asyncio.gather(
        graph.recent_events(session_id, limit=5),            # what just happened, in order
        vectors.search(query, filter={"status": "active"})  # facts, minus the retired ones
    )
    if conflicts(semantic, episodic):     # cheap guard against the staleness failure below
        semantic = reconcile(semantic, episodic)
    return assemble(working=query, episodic=episodic, semantic=semantic)

Episodic state from a temporal store, semantic facts from a vector store, a metadata filter so retired facts never surface. Zep productizes exactly this pairing of a temporal graph with semantic search.

Where does memory come from? Hand-written notes cover what never changes and nothing else. The interesting systems generate it. Copilot proposes memories from your sessions for you to approve. A-MEM links each new note to related ones and lets a new write revise the old ones, so the store reorganizes as it grows. And memory has to forget, or it fills with noise; MemoryBank applies an Ebbinghaus forgetting curve so entries decay with age and strengthen with use.

When does the cleanup run? Increasingly offline. Generative Agents introduced reflection in 2023, synthesizing raw observations into higher-level conclusions on a schedule. Letta turned it into sleep-time compute, doing the thinking while no request is waiting, which cut query-time work by roughly five times on their benchmarks. Anthropic ships a version it calls dreaming that mines recent sessions and curates shared memory between runs; the specifics come from a talk, so treat them as preliminary. The principle is constant: move the expensive remembering off the critical path.

The problems

So why is this not solved? Each layer has a failure mode the demos skip. There are six worth knowing.

Failure	Why it happens	The fix
Staleness	similarity has no clock	recency decay + a temporal store
Bloat	save everything, bury the signal	importance scoring, active forgetting
Evaluation	quality was self-reported	your own eval, from day one
Cost	every call re-bills the whole context (prefill)	retrieve less, cache, curate harder
Context rot	attention thins as the window grows	minimal high-signal context
Fleet conflicts	many agents, one store	optimistic concurrency + versioning

Three of these deserve more than a row.

Staleness is the one that bites in production. The agent learned Postgres 13. You moved to 14. The old note is still the nearest match, so the agent retrieves it with confidence. Vector similarity has no concept of time. Generative Agents already showed the fix: score memories not by similarity alone but by similarity weighted by recency and importance. Reduced to one line, the recency-aware score is

score(m) = similarity(query, m) · e^(−λ · Δt)

where Δt is time since the memory was last verified and λ is a decay constant you tune per fact type: high for infra versions, near zero for a person's name.

Add the metadata filter from the router and a temporal store that records when each fact was true, and staleness goes from silent to detectable. The exponential-decay form is the same idea MemoryBank borrowed from the forgetting curve.

Context rot is the subtle one. It is not a metaphor; it is Anthropic's documented term: as the number of tokens in the context window increases, the model's ability to accurately recall from that context decreases. So stuffing in more retrieved memory past a point makes answers worse, not better. More tokens is not more intelligence. Selection is the whole job.

Cost is the one that quietly kills long-running agents. Every call re-bills the entire context as input tokens, and processing those input tokens, the prefill, is most of what you pay for on a long history. Prompt caching softens the repeated prefill but does not erase it. An agent leaning on a bloated memory pays that tax on every turn, and past a point the workflow is not slow, it is financially non-viable. The discipline is the same one context rot demands: feed the model the smallest set of high-signal tokens, not everything you retrieved.

Under all six sits the real one. None of this changes the model. We are refining the text we hand a fixed function, because changing the function means updating weights from live experience, and nobody has shown how to do that safely, cheaply, and without the model drifting. Every technique above is a way to avoid that wall.

How we get there

So how do we close the gap? Three horizons.

Horizon	What it looks like	Status
Now	hybrid retrieval, timestamps, consolidation, your own evals	available today
Medium	memory as infrastructure: permissions, versioning, audit	emerging
Long	the model learns into its own weights at runtime	open frontier

If you're building one today:

Retrieve with vectors and keywords, so exact names and terms aren't lost to fuzzy similarity.
Timestamp every fact and decay it at retrieval, so a stale note can't outrank the current one.
Add an offline pass between sessions, so the agent stops relearning the same lessons every run.
Write your own eval first, because a leaderboard score is not your workload.

Medium term, memory becomes infrastructure. Permissions, so an agent can read the runbook but not corrupt it. Versioning and audit logs, so you can see what it stored, when, and why, and roll it back when it is wrong. Memory stops being a text blob and starts being a system with history.

Long term is the open frontier. To close the loop, the agent has to learn into the model, not into a file beside it. Google's Titans is an early move: a neural memory that updates as it runs, using a surprise signal to decide what to commit, attention serving as short-term memory and the module as long-term. But writing experience back into weights without the model drifting or degrading is unsolved, and anyone claiming otherwise is selling something.

That is the map. Persistence is largely solved. Consolidation is arriving. Learning into the weights is the frontier, and it is where "self-learning agent" either earns the phrase or stays a slogan.

The point

We did not make agents forgetful on purpose. It falls out of how the models work, and nearly everything we have built compensates from the outside. It works well now. A modern agent on a real memory system can answer as if it knows you.

Knowing you is not the same as improving. The day an agent stops needing the note, because the lesson is actually in the model, is the day this stops being a workaround and becomes memory. We are not there. Now you know the exact shape of the gap, and why closing it is the whole project.

Sources: MemGPT · RAG · CoALA · Generative Agents · Titans · Zep · A-MEM · MemoryBank · Sleep-time Compute · Anthropic: context engineering · Anthropic: Memory and dreaming (talk; preliminary) · OpenAI: ChatGPT memory · GitHub Copilot Memory