Jonathanfarrow

Posted on Mar 30

# The 5 memory problems for agents

#agents #ai #llm #rag

You ship an agent. It works well in the demo. Users start using it daily. After a week, someone asks: "Why did you suggest the same thing you suggested on Monday? I told you that didn't work."

Your agent has no answer because it has no memory of Monday. Or worse, it has a memory of Monday but no idea that Monday's approach failed.

This is the problem that shows up in every long-running agent system, and it is not a retrieval problem. Your vector search works fine. Your RAG pipeline returns relevant context. The problem is upstream of retrieval: your agent stores facts but does not learn from outcomes. It records what happened without recording whether it worked.

The research has a name for this gap. Hu et al.'s survey on agent memory identifies three functional categories: factual memory (what the agent knows), experiential memory (how the agent improves from past actions), and working memory (what the agent is thinking about right now). Most production agent systems implement factual memory. Almost none implement experiential memory. And experiential memory is where the leverage is for agents that operate over more than a single session.

the five things that break

After building and watching others build agent memory systems, the same five failure modes keep showing up. They are not obvious at first. They emerge over weeks of usage.

1. the contradiction problem

Your agent stores "Alice lives in London" on day one. On day fourteen, it stores "Alice lives in Berlin." Now there are two facts in the memory store. Which one does retrieval return?

If you are lucky, the more recent one. If you are unlucky, whichever one has a higher similarity score to the current query. If Alice's London fact was stored with richer context and more entity connections, it might actually rank higher than the Berlin fact even though it is stale.

The standard fix is to add a created_at timestamp and bias retrieval toward recency. This helps. But it does not solve the underlying problem: your memory store has no concept of a fact replacing another fact. It has two independent entries that happen to be about the same thing. The relationship between them (one supersedes the other) exists only in the real world, not in your data.

What you actually need is for the system to know that "Alice lives in Berlin" invalidated "Alice lives in London." Not as a heuristic. As a structural property of how the data is stored. The old fact should carry a timestamp marking when it stopped being true. The new fact should be linked to what it replaced. Queries should return the currently valid fact by default, with the full history available when you ask for it.

This is what bi-temporal storage does. Every fact carries a valid_from and valid_until. When a new fact supersedes an old one, the old fact's valid_until gets stamped. The new fact opens with valid_until = None (still active). You do not build this logic in application code. The database handles succession.

2. the "what did we know when" problem

Your agent makes a recommendation on March 10th. On March 15th, new information arrives that changes the picture. A user asks: "Why did you recommend that on March 10th?"

To answer this, your agent needs to reconstruct what it believed on March 10th. Not what is true now. What the system knew then.

This is the difference between valid-time (when was the fact true in the real world?) and transaction-time (when did the database learn about it?). Most memory systems track neither properly. Some track when a fact was stored. Almost none track when a fact became true independently of when it was recorded.

A user might tell your agent on March 15th that they moved to Berlin on March 1st. The valid-time is March 1st. The transaction-time is March 15th. If someone queries the system as of March 10th, the correct answer depends on which clock you are asking about. "What was true on March 10th?" returns Berlin (valid-time). "What did we know on March 10th?" returns London (transaction-time, because the system had not learned about the move yet).

Audit trails, debugging agent decisions, and understanding why your agent behaved the way it did all require both clocks. One timestamp column does not cut it.

3. the "it happened at the same time" problem

Alice changed jobs and cities in the same month. Your agent knows both facts independently. But it has no way to connect them temporally. It cannot answer "did those changes happen at the same time?" because its memory has no concept of temporal intervals, let alone relationships between intervals.

This matters more than it might seem. Temporal co-occurrence is how humans detect patterns: "every time we deploy on Friday, something breaks over the weekend" is a temporal pattern. "The user's engagement dropped when we changed the pricing model" is a temporal correlation. Agents that operate in domains with temporal structure (which is most domains) need to reason about how time periods relate to each other.

Allen's Interval Algebra defines 13 relationships between two time intervals: before, after, meets, overlaps, starts, finishes, equals, and their inverses. If your memory system stores facts with proper valid-time intervals, these relationships become queryable. "Find all cases where one relationship ended exactly as another began" is a query, not an application-level computation.

4. the "every session starts from zero" problem

Your agent debugged a tricky authentication issue last week. It tried three approaches, two failed, one worked. This week, a similar issue comes up. The agent starts from scratch.

This is the experiential memory gap. The agent stored facts about last week's session (what was tried, what error messages appeared). But it never synthesized those facts into a lesson: "For auth issues in this codebase, check the token expiry configuration first. Retry logic is usually a red herring."

The research describes experiential memory at four levels of abstraction:

Case-based: Store the raw trajectory. What happened, step by step. High fidelity but expensive to retrieve and hard to generalize from.

Strategy-based: Distill the trajectory into insights. "When X happens, do Y because Z." Transfers across tasks. This is where the leverage is.

Skill-based: Compile strategies into executable code. The agent writes reusable tools for itself.

Hybrid: Maintain multiple levels simultaneously.

Most systems that attempt experiential memory stop at case-based. They store conversation logs or event histories and hope that retrieval will surface the right past experience when a similar situation arises. This works sometimes. It fails when the current situation is similar in intent but different in surface details, because the embeddings do not match well enough.

What you need is a system that automatically compresses raw experiences into generalized insights. Not "here is what happened last Tuesday" but "here is what we have learned from the last ten Tuesdays."

5. the stale confidence problem

Your agent stored a strategy six months ago: "Use approach X for deployment issues." At the time, it had worked three times out of three. Confidence was high.

Since then, the codebase has changed. Approach X has failed twice. But the original memory still has high confidence because nothing in the system updates confidence based on new outcomes.

This is subtler than the contradiction problem. The fact is not wrong. It is not superseded by a new fact. Its reliability has just degraded over time as conditions changed. Your memory store has no way to express this unless you build outcome tracking into the memory entries themselves.

What you need is for memory strength to adjust based on evidence. When an approach succeeds, the associated memory should strengthen. When it fails, it should weaken. The scoring should handle both early uncertainty (few data points) and distribution shift (what used to work no longer does).

6. the multi-hop problem

A user asks your agent: "What API does our project use that's built by the company Steve used to work at?"

No single fact in the memory store answers this. The answer requires chaining through multiple entries: Steve's employment history leads to the company. The company's products lead to the API. The project's dependencies lead back to the same API. Three hops through connected facts.

Flat memory stores handle this terribly. Vector similarity retrieval looks for entries that match the question. But the question mentions Steve, a company, and an API in the same breath, and no single stored fact connects all three. You can try query expansion, generating multiple related queries and merging results. But expansion can only rephrase what is already in the question. It cannot discover connected entities that exist only in the memory store. If the user does not mention the company by name, query expansion cannot find it.

The research is clear on this: for genuine multi-hop queries where the connecting entities are not in the question, you have two reliable options. Let the agent loop (iterative retrieval, calling the search tool multiple times) or build structural connections (a knowledge graph with traversable edges).

Iterative retrieval works but is slow and unpredictable. The agent has to guess what to search for next. It might take two hops or twelve. It might go in circles. And the more hops required, the more likely the chain breaks.

A knowledge graph solves this structurally. The edges exist. The traversal follows them. "Start at Steve, follow worked_at edges to find the company, follow builds edges to find their products, check which product matches the project's dependencies." That is a graph pattern, not a search query:

MATCH (steve:Person {name: "Steve"})-[:worked_at]->(company)
      -[:builds]->(api)<-[:depends_on]-(project)
RETURN api.name, company.name

One query. Deterministic. No iterative guessing. And because edges in MinnsDB are temporal, you can ask the time-aware version: "What API does our project use that is built by the company Steve currently works at?" versus "...used to work at?" The WHEN clause distinguishes between active and historical edges.

This is why the topology of your memory matters. Flat (1D) memory with vector search handles single-hop factual lookups well. But the questions agents actually need to answer in practice are often multi-hop. "What do we know about the context around this error?" is multi-hop. "What changed in this user's situation recently?" is multi-hop. "Has this approach worked before in similar circumstances?" is multi-hop with a temporal dimension.

The research recommends starting flat and moving to graphs when you observe specific retrieval failures. In our experience, those failures show up faster than you expect. As soon as your agent operates over a domain with relationships between entities (which is most domains), multi-hop queries become the norm, not the exception.

what these problems have in common

All six problems share a root cause: the memory system does not model time and structure as first-class dimensions.

The contradiction problem is a temporal succession problem. The audit problem requires two independent time axes. The co-occurrence problem requires temporal intervals with algebraic relationships. The experience problem requires temporal segmentation and consolidation over time. The confidence problem requires outcome tracking that evolves over time. The multi-hop problem requires structural connections between facts, with temporal awareness of which connections are currently valid.

A key-value store with vector search does not solve any of these. A relational database with a created_at column solves the first one partially and the rest not at all. A knowledge graph without temporal edges solves the multi-hop problem but leaves temporal reasoning to application code.

The insight we built MinnsDB around is that agent memory is fundamentally a temporal and structural problem. Not a storage problem. Not a retrieval problem. If your storage layer understands time and relationships, all six failure modes have clean solutions. If it does not, you end up building increasingly complex application logic to paper over the gaps.

how temporal storage addresses each one

Contradictions. Every edge carries valid_from and valid_until. When "Alice lives in Berlin" is recorded, the system checks the ontology: location:lives_in is a functional property (single-valued). The existing London edge gets its valid_until stamped. The Berlin edge opens as active. No application code. The property type defines the succession behavior.

Audit and reconstruction. Every edge also carries created_at (transaction-time). The WHEN clause filters on valid-time. The AS OF clause filters on transaction-time. Compose them:

MATCH (a:Person)-[r:lives_in]->(city)
WHEN ALL AS OF "2024-03-10T00:00:00Z"
WHERE a.name = "Alice"
RETURN city.name, valid_from(r), valid_until(r)

This reconstructs the database's state as it was known on March 10th, showing all historical validity intervals. You can audit any past decision.

Temporal co-occurrence. Allen's Interval Algebra as query predicates:

MATCH (a)-[r1]->(b), (a)-[r2]->(c)
WHEN ALL
WHERE meets(r1, r2)
RETURN a.name, type(r1), type(r2)

"Find all cases where one fact about a person ended exactly as another began." No application logic. A query.

Experience consolidation. The episode detector segments the event stream into coherent experiences. The consolidation engine compresses episodic memories into semantic memories (generalized patterns) and then into schema memories (reusable mental models). Three deployments that all required a migration step become the insight: "deployments in this repo usually need a migration." This runs automatically with configurable thresholds.

Confidence evolution. Each memory tracks outcome counts. Scoring transitions from a Bayesian prior (stable with few data points) to an exponential moving average (responsive to distribution shift as evidence accumulates). Memory strength adjusts based on whether the associated approaches keep succeeding or start failing.

Multi-hop queries. The knowledge graph makes entity relationships traversable. Variable-length paths (-[*1..3]->) handle cases where you do not know the exact number of hops. Because edges carry temporal validity, you can traverse only currently active relationships, or include historical ones, depending on the query. The graph executor handles this with bounded BFS (capped at 10,000 visited nodes per source) and a 30-second query deadline. Multi-hop questions become graph patterns, not iterative search loops.

the developer experience

We think about this from the perspective of an engineer building an agent. You should not need to build temporal succession logic. You should not need to implement your own versioning system. You should not need to write consolidation pipelines or outcome trackers.

You should write events into the system and query it. The temporal modeling, the succession, the episode detection, the consolidation: those are the database's job.

Events go in through the API. State changes, transactions, conversations, observations. The pipeline extracts claims, builds graph edges, detects episodes, forms memories, and consolidates over time. You query with MinnsQL (graph patterns and table queries in one language) or natural language. You subscribe to live queries and get deltas pushed when things change.

-- What does the agent know right now?
MATCH (a:Person)-[r:lives_in]->(city)
WHERE a.name = "Alice"
RETURN city.name

-- What changed in the last week?
MATCH (a:Person)-[r]->(b)
WHERE CHANGED(r, ago("7d"), now())
RETURN a.name, type(r), b.name, change_type(r, ago("7d"), now())

-- What was true six months ago?
MATCH (a:Person)-[r:works_at]->(company)
WHEN "2024-09-15"
WHERE a.name = "Alice"
RETURN company.name

-- Subscribe to changes in real time
SUBSCRIBE MATCH (a:Agent)-[r:KNOWS]->(b:Agent)
RETURN a.name, b.name, r.confidence

The temporal clauses, the change detection, the subscriptions: these are not features you bolt on. They work because time is built into the storage layer from the ground up.

what this is not

MinnsDB is not a general-purpose database. It does not replace Postgres for your application's primary data store. It is not trying to be a vector database or an LLM framework.

It is a database built specifically for the temporal memory problems that agents face. If your agent needs to store facts that change over time, learn from its own experiences, maintain audit trails of what it knew and when, and react to changes as they happen, those are the problems this solves.

If your agent just needs to store and retrieve static facts, a simpler solution works fine. The complexity of temporal storage pays off when your agent operates over time. When sessions span days and weeks. When facts change, the agent needs to know what changed and when. When raw experience should be compressed into knowledge.

Most agents today do not operate this way. But the ones that will matter in six months do.

Try https://minns.ai, the agentic database built for temporal memory.

DEV Community