Notes from building a memory layer that forgets on purpose.
Most "memory-enabled" agents don't remember anything. They re-read.
Every turn, the whole conversation gets pasted back into the prompt, and we call that memory because the model can answer questions about earlier turns. It's a good trick. I used it for months. It also falls apart the moment real people start using the thing, and it falls apart in three separate ways.
The first is the one everyone notices: it's expensive and noisy. You re-send every prior turn on every request. The single line you actually care about - "I'm allergic to peanuts" - is buried under a thousand lines of small talk, and you pay for all of it, every time.
The second is quieter and worse. Transcript-stuffing has no idea what stale means. If someone told your agent "I'm vegetarian" in March and "I eat fish now" in May, you've just handed the model both facts with equal weight. Now it has to guess which one is current. Sometimes it guesses wrong, and there's nothing in the system that even thinks that's a problem.
The third one is the reason I stopped treating this as a side quest. When you finally add summarization to control the cost from problem one, the summarizer is free to drop whatever it wants to save tokens. Including the allergy. I spent years around fintech, where the wrong record surviving (or the right one quietly vanishing) is how people get hurt, so this landed hard: forgetting an allergy to save 40 tokens isn't a cost bug. It's a safety bug wearing a cost bug's clothes.
So the question I actually wanted to answer wasn't "how do I make my agent remember more." It was: how do I build something where acting on a fact the user already retracted and silently dropping a fact that must survive are impossible by construction, not just unlikely if the prompt is good that day.
Everyone has already solved one third of this
The encouraging part is that you don't have to invent much. The discouraging part is that every existing system is excellent at one axis and weak at exactly the one the problem needs most, which is forgetting.
I read through a pile of them, and the pattern was almost funny:
- MemPalace never throws anything away. Beautiful instinct for a safety net, terrible instinct for a context window - it keeps everything in the prompt.
-
Zep does fact-validity properly: facts have a
valid_until, so superseding is a first-class idea. It also wants you to run a graph database, which is a lot of operational weight for a hackathon, or honestly for most teams. - Obsidian and the GSD planning style nail the portable, human-readable, git-versioned knowledge layer. Their flaw is making that the only layer, with no semantic recall and no decay.
- Mem0 does compact, typed extraction, which is the right shape - but on the hot path, so writes get expensive.
- Letta / MemGPT treat the context window like an OS treats RAM, paging memory in and out. Smart. But the paging logic lives inside the agent's reasoning, which is a strange place for it.
The lesson I took from staring at all five: be deeply unoriginal in your parts and opinionated in your assembly. Take Zep's valid_until but skip the graph DB. Take MemPalace's "never destroy data" but demote it to cold storage instead of the prompt. Take Obsidian's markdown layer but make it the canonical tier, not the working one. None of the pieces are mine. The arrangement is the whole contribution.
The shape that fell out
What I ended up with is three tiers and a buffer, two phases, and one small set of swappable adapters. I'll spare you the full architecture and give you the mental model, because the mental model is the part that's actually portable to whatever you build.
Three tiers. There's a raw, append-only log of every turn that never goes into the prompt - it's the cold safety net and the audit trail. There's a working-memory tier of typed records (fact, preference, event, procedure) that is the only thing recall ever touches. And there's a canonical, human-readable markdown tier for the stable stuff, the long-term user model you'd actually want to read with your own eyes.
Two phases, and this is the bit I'd tattoo on a new engineer. The fast write path, the thing that runs on every turn, does almost nothing: it stores the raw turn and at most makes one embedding call. No reasoning model. The slow path runs offline, on a cheap "flash" model, and does all the hard work - extracting clean records, resolving entities, retiring contradictions, running decay. Cheap model curates, expensive model only reasons. That single split is what lets the agent feel instant and still cost almost nothing.
The mistake that taught me the most
Here's where I'll be honest, because a blog that only lists the things that worked is just marketing with footnotes.
I was so pleased with "the write path does almost nothing" that I made the write path do too little. In the first version, a turn went into the raw log and got queued for offline consolidation, and that was it. Clean. Cheap. Broken.
Because if you tell the agent something and then ask about it one second later, consolidation hasn't run yet. The record doesn't exist in the searchable tier. Recall comes back empty. To a user, that doesn't read as "eventual consistency," it reads as the app is broken and doesn't listen to me.
The fix wasn't clever, which is the point. I added a recent-session buffer - just the last handful of turns the agent is already holding - and unioned it into recall. Zero extra cost, and it covers the overwhelming majority of "I just said that" moments. For the rarer case where you say something in one session and ask in the next before the offline job runs, I let durable-sounding claims write one provisional record immediately (one embedding, still no reasoning model), which consolidation reconciles later.
The lesson generalizes well beyond memory: a cost optimization that quietly degrades the felt experience is not an optimization. It's a bug you're proud of.
Make the guarantee structural, not best-effort
The other thing I changed my mind about is how you keep a promise like "never forget an allergy."
The tempting version is best-effort: score every memory by importance, and set the threshold high enough that important things survive. That's how you end up dropping the allergy on a bad day. There's a branch in the code where a critical fact can fall through, and "can" eventually means "will."
So I removed the branch. Safety facts get marked protected, and the eviction pass skips protected records before any scoring math runs at all. There is no comparison they can lose. A property-based test asserts that for any randomly generated pile of records, no protected one is ever evicted - and the thing that decides "protected" is a content rule in the code, not the LLM judging its own output. I don't trust the model to tell me what's important. I trust a rule I can read.
Same spirit for supersession. When a contradicting claim shows up, the old record gets valid_until set and its vector pulled from the index, and recall only ever queries live records. The retired fact isn't deleted - it's still there, with a pointer to what replaced it, fully auditable - it just can't resurface. You can watch it happen:
await alice.remember("Actually I stopped keto, I eat balanced now", session_id="s2")
await engine.consolidate(user_id="alice")
diet = await alice.recall("what's my current diet?", budget=200)
# surfaces "balanced"; the keto record is retired and never returned
That five-second demo - state a change, watch the old preference stop influencing answers - turned out to be far more convincing than any number I could put on a slide. Which leads to the last thing I got wrong.
I almost built a library nobody would feel
For a good while, this was a memory engine and nothing else. A clean SDK, good tests, real guarantees - and absolutely no reason for anyone to care, because a memory engine on its own demos as an abstraction talking to itself.
Memory only becomes legible when it's wrapped in something where memory is the product: a domain where preferences genuinely change over time, where history piles up, and where being wrong is concretely bad. I picked a nutrition coach. Diets change constantly ("I went vegetarian, now I eat fish"), the history accumulates fast, and recommending a peanut dish to someone allergic is a failure anyone can see and feel in an instant. The allergy-as-protected-fact story stopped being a paragraph in a design doc and became the thing you watch survive while a month of unrelated chatter gets packed out of the context window.
If you take one workflow note from all this: build the smallest possible thing that lets a stranger feel your guarantee, and build it early. I wasted time polishing internals before I had a way to show what they were for.
What I'd tell you if you're starting this
A few things, plainly:
Don't reach for a benchmark first. I almost let the big public memory benchmarks eat my whole timeline wiring up their harnesses. A suite of fifteen scripted tests you write yourself, mapped one-to-one onto the behaviors you actually promise, will teach you more in week one and double as your demo script. Add the famous benchmark later, if you have the room.
Be skeptical of headline accuracy numbers, including the ones that look amazing. Some of the impressive figures floating around these systems are tuned against the specific questions they were failing. Report your own numbers, with your own method, even when they're modest. Mine are modest: a small reference eval where the naive transcript baseline gets two of three probes right and mine gets all three, at roughly a third fewer context tokens. The probe the baseline misses is the one where it acts on a diet the user already retracted. That single failure is the entire reason the project exists, and it's worth more than a flashy aggregate.
And keep the whole thing provider-agnostic if you possibly can. The reasoning model, the embedding model, and the storage are three independent choices, and pretending otherwise is how you get locked in. Mine runs fully local on SQLite and the filesystem for development, and swaps to a cloud stack with a single config object - same engine code, different adapters, every adapter gated by the same conformance suite that asserts the guarantees still hold. A new backend that breaks "never forget a protected fact" fails the test suite by construction. That's not a safety net I bolted on. It's the design doing its job.
None of this is finished, and some of it I'm still unsure about - the decay tuning in particular is more art than I'd like. But the core bet feels right: an agent's memory shouldn't be a transcript it re-reads. It should be a small, opinionated thing that stores cheaply, curates quietly in the background, forgets on purpose, and refuses - structurally, not hopefully - to drop the one fact that matters.
If you're building in this space, I'd genuinely like to hear how you're handling forgetting, because that's the part nobody seems to fully agree on yet, and it's the part that matters most.
Top comments (0)