Why most AI agent memory implementations break in production

#agents #ai #productivity #discuss

Every team trying to give AI agents memory is solving the same three problems badly. After running production agent memory for several months across two codebases, here are the failure modes I keep hitting and the one pattern that actually works.

Failure 1: Embed everything as vectors and call it memory

The instinct is reasonable. You have a vector database, you have embeddings, you have a retrieval API. Memory looks like "stuff a conversation in, get relevant chunks out." So you dump every session's transcript, every decision, every code review into the same embedding store and retrieve by similarity.

This breaks because facts and conversations have different retrieval shapes.

Ask the agent "what did we decide about JWT vs opaque session tokens?" and the embedding store returns five things kind-of-about-tokens by vector similarity. Three of them are old debate snippets. One is a tangential comment from a different feature. The actual decision record is in there somewhere, ranked alongside the noise.

The agent then synthesizes an answer from "five tokenish memories," which gives you a confident summary of the team's thoughts on tokens. What you actually wanted was the single decision record that says "use opaque session tokens, set 2025-04-12, still active."

The fix isn't to abandon vectors. It's to separate the layers. Structured decision records get an id, a claim, a source, an active_at, and (when relevant) a supersedes_id. Conversations and exploratory reasoning stay in the embedding store. Queries hit both, merge, and prioritize structured records over vector neighbors when the structured record exists for the same topic.

Failure 2: Summarize and discard

The pattern: every session, the agent writes a summary of what happened. The raw events are discarded. The next session starts by loading the summary.

This breaks because summaries are lossy compressions, and the next summary compresses the first summary, not the original events.

A real example I watched happen across four sessions on the same project:

Session 1: "We agreed to enforce idempotency at the receiver before any side-effect fires. Webhook X currently doesn't and that's blocking the migration."
Session 1 summary: "Decided idempotency must be enforced at the receiver. Webhook X needs updating."
Session 2 summary (built from session 1 summary): "Idempotency is being enforced at the receiver. Webhook X update is in progress."
Session 3 summary: "Idempotency is enforced at the receiver. Webhook X has been updated."
Session 4: "The webhook layer enforces idempotency."

Webhook X was never updated. The agent now believes a thing that isn't true and will plan against that belief.

The fix is to keep events as the source of truth. Summaries reference event ids, not free text. When a summary goes weird, you can re-summarize from the events with a fresh model and recover the original signal. Without this, you're playing a game of telephone with your own state.

Failure 3: Append-only memory with no supersedes-relations

The pattern: every decision becomes a new record. Old records aren't deleted because deleting historical context feels wrong. Conflicts resolve at retrieval time by "newest wins" or by vibes.

This breaks because retrieval relevance and recency are different ranking signals, and "newest wins" doesn't apply when both records are perfectly relevant.

Concrete: "We use JWT for service-to-service auth" gets recorded in week 1. In week 4 the team switches: "We replaced JWT with opaque session tokens for service-to-service. JWT is deprecated." Both records exist. Both are retrievable.

In week 8, the agent is asked about service-to-service auth. The query phrasing happens to match the JWT record more strongly (because the JWT record uses the exact phrase "service-to-service" while the replacement record uses "S2S"). The agent confidently retrieves "we use JWT" and starts building against it.

This isn't a hypothetical. I have seen it in production three separate times across two projects. The fix is supersedes-relations as a first-class concept. When a decision replaces another, the new record points at the old via supersedes_id. Retrieval filters out superseded records by default. The old record stays in the database for audit, but it's not surfaced unless explicitly queried.

The pattern that does work

The shape that has held up under load for me:

Decisions are records, not sentences. Each one has an id, a textual claim, a source (link, transcript ref, doc), an active_at timestamp, and a supersedes_id field that's null unless this record replaces another.
Provenance is mandatory. A record without a source is auto-flagged as low-trust. The agent can't ground an answer in a record it can't trace.
Supersedes-relations are first-class. Replacements use the supersedes_id field, not deletion or "newest wins."
Conversations stay in the embedding store, separately. Vector retrieval finds discussions. Structured retrieval finds decisions. Both run for each query, and structured wins when they conflict.
Resummarization runs against events, never against the previous summary. Summaries are derived data, refreshed periodically. They never become the source of truth.

The pattern is tool-agnostic. You can implement it with sqlite and a few tables. You can implement it with a managed memory service. The implementation that matters isn't the storage layer, it's the discipline of treating decisions as records with provenance and supersedes.

If you want to read further on why structure beats pure vector retrieval for this exact problem, I went deeper on the agent-memory-vs-vector-db decision tree here: https://memnode.dev/articles/agent-memory-vs-vector-db. And on why inspectable provenance beats opaque embeddings for trust here: https://memnode.dev/articles/lineage-and-provenance-in-agent-memory.

Curious what failure modes other people are hitting. The three above are the ones I keep seeing. There's probably a fourth I haven't caught yet.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.