TL;DR
Most agent memory failures look like hallucinations. They are not. The model reasons correctly over a stale fact that the memory layer fed it. That is a database failure, not a model failure.
Destructive updates create the State Confusion Problem. The seemingly obvious fix (have an LLM resolve facts at write time) breaks two ways: it silently purges history when the resolution model hallucinates equivalence, and it adds an LLM call to every ingested chunk.
The architecture that works borrows the shape of Git. Edges are append-only commits, each carrying both a transaction time and a valid time. Nothing is overwritten. The current state of any relationship is determined by the edge log.
On LongMemEval-s, this architecture scores 90.79% overall, +5 points absolute over the strongest published system, with the largest gains in Knowledge Update (97.43%) and Temporal Reasoning (90.97%): the categories where temporal versioning matters most.
The architecture is not free. Storage grows. Write-path enrichment costs cycles. Query plans get more involved. The paper argues these costs are paid efficiently and scale correctly with workload.
A user moved from New York to London in October 2024. The next month, an agent planning their weekend reads memory and suggests a Brooklyn dinner reservation, citing a 2022 conversation about a favorite spot in Williamsburg.
The model didn't hallucinate. It reasoned correctly. The memory layer handed it a stale fact and silently overwrote the correction.
This is a database failure, not a model failure.
In my previous piece, I argued that agent memory is a database problem: agents should be stateless, state should live in a purpose-built store with the primitives database engineers already know how to build. One of those primitives does the most work in production: a versioned, bitemporal graph for relationships and facts that change over time.
The shape HydraDB has ended up with is a composite architecture: it combines traditional vector search and B-tree indexes with a foundational structure that looks a lot like Git. Every state transition, whether a user's preference shifts, a customer's seat count drops, or an internal system's owner changes, is committed as an append-only edge with both a transaction time and a valid time. The current state of any relationship is a function over the full commit history of that edge: "all commits to this edge, ordered by time, filtered to t ≤ now." Reading state at any historical point is a function over the same edge log, not a custom replay scaffold built in app code. The "what changed and why" question has the same shape.
The architecture and the empirical numbers come from the HydraDB paper, Beyond Context Windows for Long-Term Agentic Memory.
The State Confusion Problem: when agent memory overwrites correct facts
Most memory layers handle a state change by overwriting the old value. A user's location, a customer's tier, an account's primary contact: the new value replaces the old one in storage. While vector stores can do this via explicit ID-based upserts, more commonly they simply store the new chunk alongside the old one, completely unlinked. This leaves no deterministic way to fetch the "latest" entry without building custom application logic. Standard graph databases do this by updating the edge in place. KV stores do it by writing to the same key. The old value is gone unless someone wrote a separate audit log.
For most application workloads, this is the correct default. For agent memory, it is destructive.
Consider a user who in 2022 says, "I live in New York because I work at startup XYZ, headquartered in NYC." In 2024 the same user says, "I live in London because I switched to Meta. I moved to be closer to my parents."
In a destructive store, two things can happen. The system overwrites NYC with London, and the trail of why the move happened, when it happened, and what the world looked like at the earlier point disappears. Or the system stores both as separate facts with no temporal ordering, and at retrieval time the agent gets two competing values without knowing which is current. Either way, the agent loses the timeline, the reasoning, and the decision tree.
The paper calls this the State Confusion Problem. It is not a corner case. Every long-running agent runs into it within weeks of production traffic.
The cost is concrete. A scheduling agent suggests dinner in a city the user moved away from. A renewal agent quotes the prior year's seat count after a downgrade has been committed. A health-tracking agent acts on a dietary preference the user reversed three months ago. The model reasoned correctly. The memory layer fed it the wrong premise.
Read more about temporal knowledge graphs for tracking how AI context evolves over time
Why the Iterative Resolution Loop breaks at scale
The first instinct most teams have is to add an LLM-mediated update step on the write path. For every incoming fact, vector-search for similar existing facts, then ask an LLM to decide whether the new fact updates, supersedes, or coexists with the old one. This pattern, called the Iterative Resolution Loop in the paper, shows up in nearly every agent memory codebase that has tried to handle changing state.
It breaks two ways.
The first failure mode is what the paper calls Instability via False Positives. Semantic similarity does not imply factual redundancy. For example, if a user says "I love the Next.js framework" and a year later says "I love Angular," it is incredibly difficult for an LLM to conclusively determine if the user has abandoned Next.js (requiring an overwrite) or simply loves both. Asking an LLM to "resolve" these at write time produces a False Positive Delete whenever the resolution model makes the wrong probabilistic guess and hallucinates a replacement: a destructive overwrite driven by a judgment call. The system silently purges history. There is no principled way to recover the lost prior state.
The second failure mode is the O(N) Latency Trap. Every ingested chunk triggers a retrieval-and-reasoning step. For a system ingesting tens of thousands of facts per day per tenant, that is tens of thousands of vector searches and tens of thousands of LLM calls on the write path. The cost and latency profile is incompatible with any production write throughput target.
Both failure modes share a root cause. The Iterative Resolution Loop tries to maintain a single canonical "current state" by reconciling at write time. Reconciliation requires inference. Inference at write time is expensive and unsafe. The cleaner move is to stop reconciling at write time entirely.
How HydraDB stores entity state as Git-style commits
Borrow the shape from version control. A Git repository does not store a single canonical state of the codebase that is overwritten on every change. It stores an append-only log of commits, each carrying a parent reference, a timestamp, and metadata describing the change. The current state of any file is a function over the commit log: take all commits touching this file, ordered by time, and resolve.
HydraDB takes the same shape for relationship and entity state. Edges are not single records updated in place. They are append-only, time-ordered sequences of commits. The paper formalizes this directly. For two entities u and v, the set of all edges between them is E(u,v). Each edge e_k is a tuple:
e_k = (r_k, t_commit, t_valid, C_meta)
Four fields, each load-bearing:
r_k is the semantic relation: LOCATED_IN, PREFERS, WORKS_AT, CAUSED_BY, BLOCKED_BY. The graph is typed.
t_commit is the ingestion timestamp. When the system learned this fact.
t_valid is the real-world validity time. When the fact actually became true in the world.
C_meta carries the contextual metadata: the source utterance, the sentiment, the reasoning, the alternatives considered, the situational factors. The "why" of the state change, not just the "what."
The split between t_commit and t_valid is the bitemporal axis. A user might tell the agent in March that they moved in October. The transaction-time history records the commit happened in March. The valid-time history records the move happened in October. Most memory layers conflate these. A bitemporal architecture answers two distinct questions cleanly: what did the agent know at March 12, and what was actually true on March 12. The first is an audit query. The second is a correctness query. Both matter, and they are not the same query.
When state changes, no edge is mutated. A new edge is appended to E(u,v) with a fresh t_commit and t_valid. The previous edge stays. The current state of the relationship is a function over the full edge history:
ΔState(u, v) = SortByTime(E(u, v)), filtered to t_valid ≤ t_now
Reading state at any historical point is the same function with a different t_now. The append-only log also answers the historical "blame" question: what facts did the agent commit between dates X and Y, and what conversation context drove them? Both fall out of the same data model.
This is a familiar shape. Bitemporal databases have been a research line since Richard Snodgrass formalized the model in the early 1990s. Datomic productionized transaction-time history with as-of, since, and history filters: a single-axis temporal model. XTDB v2 extended this to full bitemporality, exposing both system time and valid time as first-class queryable dimensions directly in SQL: SELECT * FROM customers FOR VALID_TIME AS OF '2024-10-15' is a single statement, not a replay scaffold built in app code. HydraDB applies the same bitemporal model at the edge level. It exposes the temporal axis through its recall API (with parameters like recency bias and valid-time filtering on graph traversal), not through a generic SQL-like temporal query language. The novelty in the agent memory layer is not the bitemporal model. It is applying that model to the substrate that LLMs read from. Vector indexes were not built for this. Standard graph stores were not built for this. The agent memory layer needs to be built for this.
What HydraDB's temporal graph enables: multi-hop reasoning, implicit preferences, and cross-session memory
Bitemporal append-only graphs are not free.
Multi-hop graph traversal for causally connected facts
Vector retrieval treats each chunk as an independent point in embedding space. Two facts are retrievable together only if they are semantically close. In production, the facts an agent needs are often causally connected but semantically distant.
Consider the query: "Why is the authentication service behaving differently this month?" The relevant facts include the auth service, the user database it depends on, a migration that touched the user database, the engineer who authored the migration, and the schema-change ticket that justified it. Vector retrieval might surface recent logs that mention the auth service. It will not connect those logs to the migration, the engineer, or the ticket, because none of those entities sit close to the auth service in the embedding space.
The graph traversal is direct:
Each hop is on a typed relation, with valid-time filters applied at each step. The agent recovers the full causal chain in a single retrieval pass. None of the intermediate hops were co-located in embedding space, and a flat vector index could not have surfaced them together.
Inferring unstated preferences from graph topology
Some of the most useful preferences a user has are never stated explicitly. They are visible in the topology of the user's decisions over time.
A user rejects two cloud vendors and accepts a third. The rejection conversations cite different reasons each time, but the accepted vendor shares one property the rejected ones lack: data residency in the user's home jurisdiction. The user has expressed a preference for data sovereignty, but never in those words. A flat retrieval system cannot surface this preference because no chunk contains the words "data sovereignty" attached to the user.
A versioned graph encodes the rejections and the acceptance as typed edges:
The graph then compares what cloud-vendor-C offers that A and B lack (in this case, data residency in the user's home jurisdiction), and synthesizes a higher-level preference edge.
The inferred preference is now retrievable across every future conversation that touches vendor selection, even if those conversations never use the words "data sovereignty." The graph grew smarter through use.
Preference accumulation across sessions
The third consequence falls out of the first two. When preferences are typed edges with provenance and outcome metadata, they accumulate across sessions instead of decaying with the prompt window. A user who repeatedly accepts open-source recommendations, declines SaaS suggestions, and expresses cost-sensitivity across unrelated sessions builds a preference subgraph:
Each edge has a count, a confidence weighted by recency, and outcome annotations: did the user act on the recommendation, did the plan succeed, was the decision later reversed. The memory layer is no longer a passive record of stated preferences. It is an active model of demonstrated preferences, retrievable as structured priors for downstream reasoning.
When these preference subgraphs are shared across agents operating on behalf of the same organization, HydraDB exposes the result as Hive Memories: cross-agent shared learning where one agent's observed preference becomes available to the next agent that touches the same entity.
How a temporal graph handles a four-year preference change
The paper's Figure 1 illustrates the model with a dietary preference: omnivore in 2021 to vegan in 2025. The edge structure follows the paper; specific dates and metadata fields are illustrative.
In January 2021, the user mentions in passing that they enjoy cooking steak on weekends. The system commits an edge.
In March 2024, the user is diagnosed with high cholesterol and tells the agent they are cutting back on red meat. The system commits a new edge.
In November 2025, after a year of progressively cutting back, the user tells the agent they have decided to go fully vegan and have already stopped buying meat. The system commits a third edge.
Three edges, no overwrites.
"What is the user's current dietary preference?" Resolve E(user, cuisine) by sorting on t_valid and filtering to t_valid ≤ now. Returns e_3: vegan.
"What was the user's dietary preference in 2023?" The same resolution with t_valid ≤ '2023-12-31'. Returns e_1: omnivore. The agent suggesting a steakhouse for a 2023 anniversary dinner is now reasoning over the right premise.
"When did the user's dietary preference change?" Walk the edge sequence. Two transitions: 2024-03-08 (omnivore to reducing red meat) and 2025-10-15 (reducing to vegan).
"Why did the user change?" Read C_meta on e_2 and e_3. Medical advice from a cardiologist drove the first transition. The second was driven by both medical and ethical motivations, discussed with the user's physician and spouse. The agent now understands the state change well enough to navigate adjacent decisions. It should not push the user toward beef-heavy recipes or restaurants whose menus are concentrated around red meat, even if older interactions show enthusiasm for those cuisines.
"What did the agent know about the user's diet on 2024-04-01?" Resolve with t_commit ≤ '2024-04-01' rather than t_valid. Returns e_1 and e_2. The agent at that point knew the user was an omnivore reducing red meat. Any decisions it made then can be audited against that state.
Stale recommendations, contradictory profile fields, missing transition reasoning, and unauditable decisions all collapse into resolutions of the same edge sequence with different temporal filters.
Benchmark: HydraDB scores 90.79% on LongMemEval-s
The architecture's claims are testable. The HydraDB paper evaluates the system against LongMemEval-s, a 500-question benchmark for long-term agent memory introduced at ICLR 2025. Each question-conversation stack exceeds 115,000 tokens, simulating roughly 50 continuous user sessions. That is well past the point where naive context-stuffing or flat vector retrieval falls apart.
HydraDB scores 90.79% overall, +5.0 absolute over the next strongest published system (Supermemory at 85.20%) and roughly +30 points over a GPT-4o full-context baseline (60.2%). The category breakdown:
| Category | HydraDB | Supermemory | Zep | Full-context (GPT-4o) | Mem0-oss |
|---|---|---|---|---|---|
| Single-session (User) | 100.00% | 98.57% | 92.9% | 81.4% | 38.71% |
| Single-session (Assistant) | 100.00% | 98.21% | 80.4% | 94.6% | 8.93% |
| Single-session (Preference) | 96.67% | 70.00% | 56.7% | 20.0% | 40.00% |
| Knowledge Update | 97.43% | 89.74% | 83.3% | 78.2% | 52.56% |
| Temporal Reasoning | 90.97% | 81.95% | 62.4% | 45.1% | 25.56% |
| Multi-session Reasoning | 76.69% | 76.69% | 57.9% | 44.3% | 20.30% |
| Overall | 90.79% | 85.20% | 71.2% | 60.2% | 29.07% |
The largest absolute gains land where the bitemporal versioned graph architecture should help most. Knowledge Update at 97.43% measures whether the system can correctly handle conflicting or evolving facts, the precise failure mode the State Confusion Problem causes. Temporal Reasoning at 90.97% measures queries that depend on knowing what was true when. Single-session Preference Extraction at 96.67% measures implicit preference modeling, exactly the topology-derived inference the architecture exposes.
Multi-session Reasoning tells a less flattering story. HydraDB and Supermemory both score 76.69%. This is a tie, not a win. It is the hardest category in the suite. The architectural advantages don't close this gap. Combining facts distributed across many sessions remains the open frontier for every published system, HydraDB included.
The architecture's wins also do not depend on the largest available model. Re-running the same evaluation with GPT-5 Mini lands at 85.80% overall, and GPT-5.2 at 84.73%. Both numbers are well above every non-HydraDB system in Table 2, on substantially smaller backbones. The paper's conclusion: long-term memory quality is governed primarily by preprocessing and representation design, not raw model capacity. Users can select backbone models based on operational constraints (cost, latency, throughput) without compromising memory reliability.
Costs of a bitemporal agent memory: storage, write-cost, query complexity
Append-only versioned graphs are not free.
Storage growth. Append-only is monotonically growing by design. Every edge committed sticks. For a long-running multi-tenant system, the storage profile under retention-free assumptions is unsustainable. The paper introduces a Bio-Mimetic Decay Engine that scores memory nodes by initial salience, exponential decay over chronological age, and reinforcement boost from successful retrievals. Memories with low retention scores migrate through tiered storage and eventually evict. The decay engine is experimental in the current paper. The general shape (retention policies for different data classes, plus a vacuum-equivalent that reclaims space without breaking replay) is a known pattern from event-sourcing literature. It is solvable. It is not free.
Write-path inference cost. Sliding-window enrichment, entity resolution, and preference mapping happen at ingestion time, not retrieval time. That ingestion overhead is the explicit price of having self-contained, structured memory chunks at retrieval time. In production, HydraDB sustains 2.5 million tokens per minute of ingestion with sub-500ms wait times at peak, and has handled bursts of 50 million tokens per hour under noisy-neighbor conditions. The cost is real, and it is paid efficiently at scale. The tradeoff is principled: write-time enrichment scales with the number of facts ingested; query-time enrichment scales with the number of queries multiplied by the recall window. For any system with a write-to-read ratio above roughly 1:10, paying the cost on the write path is the cheaper integration.
Query complexity. Multi-hop graph traversal with cross-encoder reranking and query expansion is more involved than nearest-neighbor vector lookup. The query path has more moving pieces: adaptive query expansion, hybrid semantic search, entity-anchored graph search, chunk-level graph expansion, triple-tier reranking. The complexity is the bug fix. The flat retrieval pipeline is simpler precisely because it is not solving the problems the multi-stage pipeline is built to solve.
The paper does not pitch the architecture as a free lunch. It pitches it as the correct shape for a problem the industry has been trying to solve with the wrong tools.
Agent memory is a database engineering problem
Agent memory is a database engineering problem. The shape that handles changing facts, multi-hop relational reasoning, preference inference, and auditability is a versioned, bitemporal, typed graph with structured context attached at the edge.
The Git-Style Versioned Temporal Graph is one pillar of HydraDB's broader Composite Context architecture. The Sliding Window Inference Pipeline makes ingested chunks self-contained before they hit the graph. Multi-Stage Retrieval fuses hybrid semantic search with graph traversal. The temporal graph alone is necessary. The composite is what makes the system work end-to-end.
If you are running into stale recommendations, lost preferences, contradictory profile state, or unauditable agent decisions in production, the question is not which retrieval heuristic to swap in next. The question is whether your memory layer is architecturally capable of distinguishing what was true at the time the agent acted from what is true now. If the answer is no, the failures will keep coming, and the next tweak to the retrieval scoring function will not fix them.
Frequently Asked Questions
What is the State Confusion Problem in AI agent memory?
The State Confusion Problem occurs when an agent's memory layer overwrites or fails to time-order facts about a user, customer, or system. When state changes (a user moving cities, a customer downgrading their tier), a destructive store either loses the old value entirely or stores both without temporal ordering, leaving the agent unable to reason about what is current. The model reasons correctly over what it is given; the memory layer feeds it the wrong premise.
Why isn't a vector database enough for AI agent memory?
Vector databases are search indexes, not state stores. They retrieve by semantic similarity, which works for retrieval-augmented generation over static corpora but fails for stateful agents that need typed entities, versioned writes, transactional read isolation, and multi-record consistency. Two facts that should retrieve together are often causally connected but semantically distant in embedding space, and a flat vector index has no primitive for that.
What is a bitemporal knowledge graph?
A bitemporal knowledge graph stores every fact with two timestamps: a transaction time (when the system learned the fact) and a valid time (when the fact actually became true in the world). This lets the system answer two different questions about any historical point: what did the system know at time T, and what was actually true at time T. Append-only versioned graphs apply this model to typed entity relationships rather than rows in a relational table.
How does HydraDB handle conflicting or evolving facts?
HydraDB never mutates an existing edge. When a fact changes, HydraDB appends a new edge with a fresh transaction time and valid time, leaving the prior edge intact. The current state of any relationship is a function over the full edge log, so historical state and the reasoning behind each transition remain queryable. On LongMemEval-s, this architecture scores 97.43% on the Knowledge Update category, the benchmark axis that tests handling conflicting or evolving facts.
What is the difference between transaction time and valid time?
Transaction time records when the system ingested a fact. Valid time records when the fact actually became true in the world. A user might tell an agent in March that they moved in October: transaction time captures the March commit, valid time captures the October move. Conflating the two makes audit queries and correctness queries indistinguishable, so a bitemporal architecture keeps them as separate axes.
How does HydraDB compare to Mem0, Zep, and other agent memory systems on LongMemEval-s?
On LongMemEval-s, HydraDB scores 90.79% overall, +5 points absolute over the next-best system (Supermemory at 85.20%) and roughly +30 points over a GPT-4o full-context baseline (60.2%). Mem0-oss scores 29.07%, Zep 71.2%. HydraDB's largest gains are in Knowledge Update (97.43%) and Temporal Reasoning (90.97%), the categories where bitemporal versioning matters most. The two top systems tie on Multi-session Reasoning at 76.69%, the suite's hardest category.
What are Hive Memories in HydraDB?
Hive Memories are HydraDB's cross-agent shared learning layer. When preference subgraphs accumulate across agents operating on behalf of the same organization, one agent's observed preference becomes available to the next agent that touches the same entity. Preferences propagate across the agent network instead of being relearned each session.
What are the tradeoffs of an append-only versioned memory graph?
Three costs are real. Storage grows monotonically and needs retention policies to stay sustainable; HydraDB introduces a Bio-Mimetic Decay Engine to score and evict low-salience memories. Write-path enrichment costs compute since sliding-window inference happens at ingestion; HydraDB sustains 2.5 million tokens per minute in production with sub-500ms wait times at peak. Query plans are more involved than nearest-neighbor lookup since traversal combines hybrid semantic search, graph paths, and cross-encoder reranking.





Top comments (0)