varun pratap Bhardwaj

Posted on Mar 18 • Edited on Apr 21

5 AI Agent Memory Systems Compared: Mem0, Zep, Letta, Supermemory, SuperLocalMemory (2026 Benchmark Data)

#ai #agents #machinelearning #opensource

A factual comparison of the five most-referenced AI agent memory systems — a core pillar of AI Reliability Engineering — on architecture, LoCoMo benchmark scores, and EU AI Act compliance.

Why This Post Exists

Every comparison post I've read either reads like marketing for one system, or compares them on features without benchmark data. This is different: I'm the author of one of the systems (SuperLocalMemory), which means I have strong incentive to be honest — I can't be credibly wrong about the others without undermining my own credibility.

All scores are from published papers or official documentation. I've noted where scores vary across sources.

The Five Systems

System	Architecture	Creator	License	Status
Mem0	Cloud-hosted	Mem0 AI ($24M funded)	Open core	Production
Zep	Cloud-hosted + self-host	Getzep	Apache 2.0 + Commercial	Production
Letta (MemGPT)	Agent framework + LLM memory	Letta AI	Apache 2.0	Production
Supermemory	Cloud-hosted	Open source project	MIT	Production
SuperLocalMemory	Local-first mathematical	Independent research	MIT	Production

LoCoMo Benchmark Results

The LoCoMo benchmark (Long Conversation Memory) is the most widely cited evaluation for this space — 81 question-answer pairs across long multi-session conversations.

System	Score	Cloud LLM Required	Open Source
EverMemOS	92.3%	Yes	No
MemMachine	91.7%	Yes	No
Hindsight	89.6%	Yes	No
SLM V3 Mode C	87.7%	Yes (synthesis)	Yes (MIT)
Zep	~85%	Yes	Partial
Letta / MemGPT	~83.2%	Yes	Yes (Apache)
SLM V3 Mode A	74.8%	No	Yes (MIT)
Supermemory	~70%*	Yes	Yes (MIT)
Mem0 (self-reported)	~66%	Yes	Partial
SLM V3 Zero-LLM	60.4%	No LLM at all	Yes (MIT)
Mem0 (independent)	~58%	Yes	Partial

*Supermemory score estimated from limited published data.

Key takeaway: Every system requiring cloud LLMs clusters between 83-92%. SuperLocalMemory Mode A achieves 74.8% with zero cloud dependency — demonstrating that mathematical retrieval captures most of the benchmark value without cloud compute. Mode C reaches 87.7%, competitive with the top tier.

Architecture Comparison

Mem0

Model: Cloud-first, API-based. Memories stored on Mem0's servers.
Retrieval: Vector similarity over cloud embeddings (typically OpenAI).
Best for: Teams needing shared memory, managed infrastructure, cross-device access.
Limitation: Data sovereignty, offline use, EU AI Act compliance require additional work.

Zep

Model: Temporal knowledge graph hosted in cloud (or self-hosted Community Edition).
Retrieval: Graph-based temporal reasoning + semantic similarity.
Best for: Complex agent workflows requiring temporal entity relationships.
Limitation: Self-hosting requires infrastructure management; cloud version has same data locality issues as Mem0.

Letta (MemGPT)

Model: OS-inspired agent framework. LLM manages memory tiers (core context, recall, archival).
Retrieval: LLM-driven — the model decides what to retrieve and when.
Best for: Building agents where memory management logic needs to be customizable by the LLM.
Limitation: Requires LLM for all memory operations. Memory decisions inherit LLM opacity.

Supermemory

Model: Cloud-hosted with importable sources (tweets, web pages, documents).
Retrieval: Vector similarity + semantic search.
Best for: Personal knowledge management with multi-source ingestion.
Limitation: Cloud dependency; primarily designed for personal knowledge, not agent memory.

SuperLocalMemory V3

Model: Local-first with three mathematical retrieval layers.
Retrieval: 4-channel RRF fusion: Fisher-Rao geometric + BM25 lexical + entity graph + temporal.
Best for: Privacy-required workloads, EU AI Act compliance, individual developer memory, zero-cloud operation.
Limitation: Single-device by default; no native team sharing.

EU AI Act Compliance (Takes Effect August 2, 2026)

This dimension is increasingly important for enterprise deployment in the EU.

System	Mode A Compliance	Notes
SuperLocalMemory Mode A	✅ By architecture	Data never leaves device. Zero cloud calls.
All others	❌ Requires work	DPA required. Data sent to cloud providers.

SuperLocalMemory is the only system in this table that claims compliance-by-architecture. All others can achieve compliance through additional legal and technical measures, but require active work.

The Right Tool for the Job

None of these systems is "best." The right choice depends on your requirements:

Need team memory? → Mem0 or Zep. Both are designed for shared memory.

Need LLM to manage memory logic? → Letta. It's designed for LLM-driven memory management.

Need data sovereignty or EU AI Act compliance? → SuperLocalMemory Mode A. Only local-first provides this by architecture.

Need the highest benchmark score? → None of the open systems. EverMemOS/MemMachine/Hindsight score higher, but aren't open source.

Need open source + high score? → SuperLocalMemory Mode C (87.7%) or Letta (~83.2%).

Need zero cloud costs forever? → SuperLocalMemory Mode A. No API costs, no subscription.

My System (Full Disclosure)

I'm the author of SuperLocalMemory V3. I've tried to be factually accurate about all five systems. If I've gotten something wrong, open an issue on the repo or comment below.

Paper: arXiv:2603.14588
Code: github.com/qualixar/superlocalmemory
Website: superlocalmemory.com

Varun Pratap Bhardwaj — Independent Researcher
A Qualixar Research Initiative

Top comments (6)

Andrew Estey-Ang • Mar 20

Great benchmark roundup — LoCoMo is a solid eval framework for retrieval accuracy and the head-to-head comparisons are useful.

One dimension I'd love to see added: none of these systems are evaluated on what happens after retrieval. Specifically:
When two stored memories contradict each other, which one wins?
When a memory becomes stale (user changed jobs, moved cities, changed preferences), does the system detect and decay it?
Does confidence in a retrieved memory track with the quality of the evidence that created it?

These are epistemic governance questions — they sit above the retrieval layer and determine whether what the agent remembers is actually what the agent should believe.

LoCoMo tests "did the agent recall the right thing." There's an entire evaluation dimension above that: "should the agent trust what it recalled?" I'd argue that's where the next generation of memory benchmarks needs to go.

Would be curious if any of the five systems here have mechanisms for this, or if it's genuinely an open problem in the space.

Penfield • Apr 4

You're right that retrieval accuracy is only half the picture. We audited the LoCoMo benchmark specifically and found serious methodological issues that affect the validity of these scores and how they should be interpreted: github.com/dial481/locomo-audit

The deeper gap you're describing - contradiction resolution, confidence tracking, whether agents should trust retrieved memories - maps to what we think of as typed relationships at the storage layer. If a memory can explicitly supersede, contradict, or mark itself as an evolution_of a previous memory, the agent has the primitives to do epistemic governance without needing a separate system for it.

We wrote about this more broadly here: dev.to/penfieldlabs/we-audited-loc...

Schmerbert • Jul 9

An open problem, I think.
Confidence scores don't track evidence quality because nothing in the loop knows what the evidence was.

Generative Agents is a clean example, retrieval weighted by recency/relevance/importance, where importance is scored by the model.
The model grades its own memory's significance and then trusts the grade.
There's no place in that pipeline where "the user said this" and "I inferred this at 2am" are different kinds of thing.

*The approach I landed on: *
Record custody at insert instead of scoring trust afterward.
Every entry carries who produced it, what kind of text it is, and where it came from.

Status isn't a column, it's derived from an append only record trail.
Ground truth exists only because an adoption record exists, and writing an adoption record is the ceremony.
Forging trust and performing it are the same operation.
Retrieval stays wide open. Drafts, guesses, hearsay, superseded stuff is all searchable.

Similarity can retrieve; similarity can't promote.

Typed relationships are useful but they tell you what a memory claims about another one.
They don't tell you whether to believe the claim, and the system can type its own way to trusted.
Somebody with authority still has to say yes, and that yes is the thing worth making a record of.

pip install forest-custody-memory if you want to poke at it — github.com/schmerbert/The_Forest

xulingfeng • May 27

Nice comparison! One thing I'd be curious about — what benchmark methodology did you use? We ran similar comparisons building our own MemBridge system and found that results vary wildly depending on whether you test with real conversational data vs synthetic queries. Also noticed you didn't include any local-first/open source options — any reason? 😅

skynetter • May 21

There’s also local-first options like skynetcmd/m3-memory.

It performs well on the harder LME-S benchmark (vs LoCoMo). With a 91.8% SHR at k=1 (gold answer is the answer) and 99% SHR at k=10. See results here.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.