Discussion on: 5 AI Agent Memory Systems Compared: Mem0, Zep, Letta, Supermemory, SuperLocalMemory (2026 Benchmark Data)

View post

Great benchmark roundup — LoCoMo is a solid eval framework for retrieval accuracy and the head-to-head comparisons are useful.

One dimension I'd love to see added: none of these systems are evaluated on what happens after retrieval. Specifically:
When two stored memories contradict each other, which one wins?
When a memory becomes stale (user changed jobs, moved cities, changed preferences), does the system detect and decay it?
Does confidence in a retrieved memory track with the quality of the evidence that created it?

These are epistemic governance questions — they sit above the retrieval layer and determine whether what the agent remembers is actually what the agent should believe.

LoCoMo tests "did the agent recall the right thing." There's an entire evaluation dimension above that: "should the agent trust what it recalled?" I'd argue that's where the next generation of memory benchmarks needs to go.

Would be curious if any of the five systems here have mechanisms for this, or if it's genuinely an open problem in the space.

Penfield • Apr 4

You're right that retrieval accuracy is only half the picture. We audited the LoCoMo benchmark specifically and found serious methodological issues that affect the validity of these scores and how they should be interpreted: github.com/dial481/locomo-audit

The deeper gap you're describing - contradiction resolution, confidence tracking, whether agents should trust retrieved memories - maps to what we think of as typed relationships at the storage layer. If a memory can explicitly supersede, contradict, or mark itself as an evolution_of a previous memory, the agent has the primitives to do epistemic governance without needing a separate system for it.

We wrote about this more broadly here: dev.to/penfieldlabs/we-audited-loc...