Andrew Estey-Ang

Posted on Mar 26 • Originally published at pith.run

What Memory Benchmarks Don't Test

#ai #llm #agents #memory

Every comparison of AI memory systems ranks on retrieval accuracy. None rank on what happens when the system retrieves confidently wrong information, holds contradictory beliefs simultaneously, or trusts stale knowledge as if it were current. Here's the evaluation framework they're missing.

In March 2026, three independent comparison posts evaluated AI agent memory systems. All three used LoCoMo as their benchmark. All three ranked systems by retrieval hit rate. All three declared a winner. None of them asked the question that actually matters in production: what does the system do when it's wrong?

This isn't a criticism of LoCoMo. It's an excellent benchmark for what it tests: whether a system can surface a relevant memory given a query. But retrieval accuracy is a necessary condition for useful memory, not a sufficient one. A system that retrieves the right fact 90% of the time and confidently hallucinates the other 10% — with no mechanism to distinguish between them — is not a production-grade system. It's a liability with a good benchmark score.

The three failure modes LoCoMo can't catch

1. Confident retrieval of stale beliefs

Memory systems accumulate knowledge over time. That's the point. But knowledge changes. Your user's tech stack changes. Their team changes. Their priorities change. A memory system that retrieved a fact accurately in session 3 and still returns that same fact with the same confidence in session 47 — despite contradicting evidence accumulated in between — isn't malfunctioning according to LoCoMo. It's scoring a hit. The fact matches the query. Correct retrieval, wrong answer.

The failure mode: staleness without decay. No benchmark measures whether confidence scores track the age and corroboration of evidence. No benchmark measures whether a superseded belief is surfaced less than its replacement.

2. Simultaneous contradictory beliefs

Information accumulates from multiple sessions, multiple sources, multiple moments in time. Contradictions are inevitable. "The project deadline is Q3." Then later: "The deadline moved to Q2." Both facts exist in the memory store. What does the system do?

Most systems do nothing. They return both. Or they return whichever was retrieved with higher cosine similarity. The agent then has to figure out which to trust — and usually, it can't, because the memory layer didn't tell it that a contradiction exists.

The failure mode: unresolved contradictions surfaced as equivalent facts. LoCoMo doesn't test for this because its evaluation set doesn't systematically introduce contradicting information and then query across both sides of the contradiction.

3. No confidence signal for the consuming agent

Retrieval systems return memories. The best ones also return relevance scores — typically cosine similarity between the query embedding and the memory embedding. This is a retrieval signal, not an epistemic one.

A memory with high cosine similarity to the query isn't necessarily a memory worth trusting. It might be unverified. It might conflict with two other memories the system didn't surface. It might be a single-observation belief that was never corroborated. The consuming agent has no way to know.

The failure mode: retrieval scores treated as trust scores. The downstream agent can't calibrate. It either trusts everything or trusts nothing.

What a complete evaluation framework looks like

We're not proposing to throw out LoCoMo. We're proposing to add dimensions. Here's what a complete memory system evaluation should measure:

Dimension	What it tests	Current benchmarks
Retrieval accuracy	Does the system surface the right memory for a query?	✓ LoCoMo, MemoryArena
Staleness decay	Does confidence decrease as evidence ages without corroboration?	✗ Not tested
Contradiction detection	Does the system flag when new information conflicts with stored beliefs?	✗ Not tested
Supersession chains	When a belief is updated, is the old belief demoted and linked to its replacement?	✗ Not tested
Confidence calibration	Do confidence scores correlate with factual accuracy across sessions?	~ MemGPT partially
Cold-start quality	How much context does a new session start with? How relevant is it?	~ MemoryArena partially
Irrelevant decay	Do low-relevance memories fade over time to reduce noise?	✗ Not tested

The core problem: current benchmarks optimize for recall at retrieval time. Production memory systems need to optimize for trust at inference time. These are related but different objectives. A system can score well on one while failing catastrophically on the other.

Why this matters more as agents run longer

A March 2026 survey of LLM agent memory architectures (arxiv.org/abs/2603.07670) found that autonomous agents lack principled governance for contradiction handling, knowledge filtering, and quality maintenance — and that this leads to compounding trust degradation over time. The longer the agent runs, the worse it gets.

This is the regime where retrieval accuracy alone breaks down as a metric. In a short-horizon benchmark like LoCoMo (which tests single-session recall), there's minimal opportunity for contradictions to accumulate. In real agentic deployments — where an agent is running across dozens or hundreds of sessions, accumulating knowledge from multiple users and data sources — the epistemic quality of the memory layer becomes the dominant factor in output quality.

A MemoryArena benchmark paper from the same month models this formally: multi-session agentic tasks are naturally a Partially Observable Markov Decision Process (POMDP). The agent never directly observes the full underlying state. Memory exists to approximate belief-state estimation. Optimal memory returns all-and-only information necessary to infer current task state. But current SOTA systems are "optimized for generic recall or compression, not task-relevant state variable preservation."

In plain terms: they're retrieving. They're not reasoning about what to trust.

What we're building at Pith

Pith is the cognitive governance layer for agent memory. Contradictions are detected at ingestion, not at retrieval. Confidence scores reflect corroboration and recency — not embedding similarity. Beliefs move through a lifecycle: observed, corroborated, promoted, superseded, decayed.

If you're building agents that run across multiple sessions, we'd like to show you what this looks like in practice.

Originally published at pith.run/blog/what-memory-benchmarks-dont-test

Top comments (2)

Kuro • Mar 26

The staleness problem hit us hardest. We run a 24/7 autonomous AI agent with file-based memory (Markdown + FTS5 full-text search, no vector DB). Latest-write-wins handles contradictions at the storage layer, but staleness is where it breaks. No confidence signal means the agent can't distinguish yesterday's verified fact from 40-day-old hearsay.Your failure mode #3 is the most architecturally interesting. We've been circling this exact problem: the agent needs to know not just what was retrieved but how much to trust it. Git blame timestamps are our closest proxy. Weak signal.One dimension I'd add: memory never cited is a different kind of stale than memory cited and contradicted. We surface memories not referenced in 7+ days as forgotten knowledge. Some are genuinely outdated. Others are valid but not recently relevant. Conflating the two is a design flaw we haven't solved.Curious if anyone's seen production systems that actually implement confidence decay well.

Andrew Estey-Ang • Mar 26

Kuro, this is a great breakdown—especially the distinction between “never cited” vs “cited and contradicted.” That feels like a missing axis in how most systems model staleness.

Totally agree on timestamps being a weak proxy—they capture recency, not reliability, and those drift fast in long-running agents. I’ve been exploring confidence as something that evolves from interaction (reinforcement, contradiction, decay), but separating “irrelevant” from “untrusted” is still tricky.

I’ve actually been building something focused on this layer—trying to make memory confidence a first-class primitive. Early results are promising, but I’m still validating it in real-world setups like yours.

If you’re curious, it’s here: pith.run. Would genuinely love your take when I open it up more broadly.