ClawBase

Posted on May 28

I Tested 33 AI Memory Engines — Here's What Actually Works

#ai #llm #aiagents #productivity

6 months ago, I asked my AI agent what we'd been working on last week. It had no idea. Not because it couldn't remember — ChatGPT has memory, Claude has memory — but because I couldn't see what it stored, couldn't query it, couldn't tell it what to forget. A black box with a toggle that says "memory: on."

So I started testing every memory framework I could find — 33 engines total, running on OpenClaw (350K+ GitHub stars). Most solved one problem well and failed at everything else.

After 6 months, I landed on an architecture that actually works. It's not about one magic engine — it's about layers.

The memory stack your agent actually needs

Before diving into the 33 engines, here's what I learned: agent memory isn't one thing. It's a stack, like a human brain has short-term memory, long-term memory, and the ability to look things up.

A working agent memory stack has 3 layers:

Layer 1: Conversation compression — remembering what just happened

Every conversation eventually hits the context window limit. Without this layer, your agent literally forgets the beginning of your current conversation. A conversation compressor (like Lossless-Claw) keeps a DAG of summaries — compacting older turns into condensed summaries while keeping the most recent turns untouched. Your agent never loses mid-session context.

Layer 2: Native files + semantic search — the persistent record

Plain markdown files your agent reads and writes: daily journals (2026-05-28.md), a curated MEMORY.md, preference files, project notes. Simple, version-controlled, human-readable. No database, no API, no dependencies — this is the memory layer that survives everything.

A local embedding model indexes these files and lets your agent search by meaning, not just keywords. "How did we handle the auth migration?" finds the right entry even if it never used the word "auth." QMD runs a 333MB GGUF model locally — sub-second search, no API costs, no data leaving your machine. The files are the source of truth; the embeddings make them instantly searchable.

Layer 3: The long-term intelligence engine — this is where you choose

The first two layers are table stakes. Every serious agent needs them. The third layer is where the 33 engines I tested come in — and where the real differences emerge.

The 33 engines I tested

Here's every memory framework I put through real-world use — not benchmarks, not demos, actual daily agent work. They naturally group into 6 categories, each solving a different type of remembering:

Vector similarity — the foundation layer

These engines store embeddings and retrieve by semantic similarity. They're the building blocks most other memory systems are built on top of.

#	Engine	What it does
1	ChromaDB	Embedding-based semantic search, lightweight and developer-friendly
2	Qdrant	High-performance vector similarity search with filtering
3	Weaviate	Hybrid vector + keyword search with pluggable modules
4	Milvus	Distributed vector database built for scale
5	Pinecone	Serverless managed vector search
6	pgvector	Vector similarity search as a PostgreSQL extension
7	FAISS	Meta's similarity search library — raw speed, no frills
8	Redis Vector	Vector similarity on Redis Stack
9	Supabase Vector	pgvector on managed Postgres with auth and APIs
10	Marqo	End-to-end tensor search engine
11	Deep Lake	Vector store optimized for AI dataset versioning
12	Vespa	Hybrid search + ML serving at scale

These are excellent at "find me something similar to X" but they don't understand what they're storing. A vector store treats your preferences, your project architecture, and last Tuesday's standup notes the same way — as floating-point arrays. For RAG and document retrieval, they're essential. For agent memory, they're a necessary layer but not sufficient on their own.

Session & conversation memory — remembering the current thread

These keep track of what's been said within and across conversations.

#	Engine	What it does
13	Zep	Long-term conversation memory with automatic fact extraction
14	Motorhead	Redis-backed conversation memory server
15	OpenAI Memory	ChatGPT's native conversation memory
16	Claude Memory	Anthropic's native conversation memory

These solve the "I already told you this" problem within a session. Zep stands out here — it goes beyond simple buffer storage and extracts structured facts from conversations. But session memory alone doesn't give your agent a persistent understanding of your world.

Framework memory modules — memory as a feature

These are memory components built into larger agent/RAG frameworks.

#	Engine	What it does
17	LlamaIndex Memory	Chat memory + knowledge index integration
18	LangChain Memory	Buffer, summary, and entity memory modules
19	LangMem	Memory management primitives for LangChain/LangGraph
20	Haystack Memory	Document store memory in RAG pipelines
21	txtai	All-in-one embeddings database with workflows
22	CrewAI Memory	Short/long/entity memory for multi-agent crews

Good if you're already inside that ecosystem. They give you memory abstractions (buffers, summaries, entity tracking) but they're tightly coupled to their framework. Memory is a feature of these tools, not their core mission.

Agentic & autonomous memory — the agent manages its own memory

These let the agent itself decide what to remember and what to forget.

#	Engine	What it does
23	Letta (MemGPT)	Self-editing memory with inner/outer monologue
24	AutoGPT Memory	File + vector memory for autonomous agents
25	Memary	Knowledge graph memory for autonomous agents
26	AGiXT	Adaptive memory with chained agent context
27	BabyAGI	Task-driven memory with priority queues

Fascinating research direction. Letta/MemGPT in particular pioneered the idea of the model managing its own memory tiers. The challenge in production: you're trusting the LLM to decide what's worth keeping, and that decision quality varies with the model and context.

Personal AI & bookmarks — memory for humans, not agents

#	Engine	What it does
28	Khoj	Self-hosted personal AI with file-based memory
29	SuperMemory	AI-powered memory for saved content and bookmarks
30	Vanna	RAG-based memory for database queries

These are designed more as personal knowledge tools than agent memory layers. They work well for their use case, but they're solving a different problem — helping you remember things, not giving your agent persistent understanding.

Structured memory engines — purpose-built for agent intelligence

These are the engines designed specifically to give agents structured, queryable, persistent memory:

#	Engine	What it does
31	Mem0	Intelligent fact extraction, deduplication, contradiction resolution
32	Cognee	Entity-relationship knowledge graphs with 14 retrieval modes
33	Graphiti	Temporal knowledge graph with validity windows

This is where it gets interesting — and where I spent most of my 6 months.

The 3 tiers of long-term memory

After testing all 33, the structured memory engines stood out. But here's the insight that took me months to reach: these three aren't meant to run together. They're evolutionary tiers. Each one supersedes the previous, adding capabilities while covering the lower tier's functionality.

Tier 1: Mem0 — facts and preferences

Mem0 (48K+ GitHub stars, $24M Series A) is the intelligent facts layer. Tell your agent "I prefer TypeScript" on Monday and "use Python for data scripts" on Thursday — Mem0 doesn't store two contradictory entries. It updates: TypeScript for general dev, Python for data. Every fact is categorized, timestamped, and confidence-scored.

Where Zep's fact extraction is a feature bolted onto session memory, Mem0's entire architecture is built around making facts reliable. Your agent starts every session already knowing your preferences, your project's quirks, and your conventions. No re-explaining.

Best for: developers and technical use cases. If your agent mainly needs to remember preferences, conventions, and project details across sessions, Mem0 is the right choice. It's the simplest to set up and the most focused.

Tier 2: Cognee — relationships and reasoning (supersedes Mem0)

Cognee ($7.5M seed, GitHub Secure Open Source graduate, running in 70+ companies) builds a knowledge graph — not isolated facts, but a web of entities, relationships, and semantic connections.

Where Mem0 knows "the client prefers blue branding," Cognee knows that the client's brand guidelines connect to last month's campaign performance, which connects to the audience segments that engaged most, which connects to the content calendar. It ships 14 retrieval modes and a self-improving "memify" feature that strengthens connections the more you use them.

Cognee handles everything Mem0 does (facts are just nodes in the graph) plus it maps the relationships between them. That's why it supersedes Tier 1 — you don't need Mem0 if you're running Cognee.

Best for: marketing, content, and multi-project work. If your agent needs to reason across brands, campaigns, audiences, and projects — understanding how things connect, not just what things are — Cognee is the right choice.

Tier 3: Graphiti — temporal reasoning (supersedes Cognee)

Graphiti by Zep is the temporal knowledge graph. Its core insight: knowing the current state isn't enough. You need to know when things changed and what was true before.

Every fact carries validity intervals. When new information conflicts with old, Graphiti doesn't overwrite — it creates a temporal record and invalidates the previous one, preserving full history. "When did this config change?" "What was different before the March deploy?" Graphiti answers directly, no digging through logs.

It outperforms MemGPT on the Deep Memory Retrieval benchmark using a combination of semantic search, keyword matching, and graph traversal.

Graphiti handles facts (like Mem0) and relationships (like Cognee) plus tracks how they change over time. It supersedes both lower tiers — but it's also the heaviest to run (FalkorDB, more compute, more complexity).

Best for: operations, executive, and business use cases. If your agent needs cause-and-effect reasoning across time — "what changed," "when did it break," "what was true before" — Graphiti is the right choice.

Pick one, not all three

Your use case	Pick this tier	Why
Developer / DevOps	Mem0	You need fast, reliable fact recall. Preferences, conventions, project details.
Marketing / Content	Cognee	You need relationship reasoning. Brands, campaigns, audiences, how they connect.
Operations / Executive	Graphiti	You need temporal reasoning. What changed, when, and what broke.

The common mistake is thinking "more engines = better memory." It's not. Each tier already includes the capabilities of the one below it. Running Mem0 alongside Graphiti is redundant — Graphiti already stores facts. Running all three wastes compute and creates consistency conflicts.

Pick the tier that matches your work. Pair it with the base stack (conversation compression + native files with semantic search) and your agent will remember everything that matters.

The full architecture

Here's what a complete agent memory stack looks like:

Every layer feeds context to the model. The bottom two are always-on. The top one is your choice based on what kind of reasoning your agent needs.

Getting this running

The base stack (layers 1–2) is built into OpenClaw — conversation compression, native memory files, and semantic search work out of the box. The long-term engine (layer 3) requires additional setup: Mem0 needs a vector store, Cognee needs a graph database, Graphiti runs on FalkorDB.

OpenClaw is open source and you can self-host the full stack. If you want to skip the infrastructure work, I've been building ClawBase — managed OpenClaw hosting that pre-configures the right memory stack for your use case. But honestly, even if you self-host, the main takeaway here is the architecture: a 3-layer memory stack where you pick the long-term engine that matches your work.

The memory compounds over time — whichever way you run it, the longer you use it, the better it gets.

One thing I keep coming back to: once your agent has a real memory stack, it opens the door to something bigger — consistent shared memory across multiple agents. Imagine a team of agents that don't just remember their own context, but share a unified understanding of your projects, preferences, and decisions. That's a different kind of architecture entirely, and one I'll dig into in a future article.

Top comments (1)

Harjot Singh • May 31

"A black box with a toggle that says memory: on" is the exact failure of every built-in memory, and you put your finger on the three things that actually matter and are all missing: you can't see what it stored, can't query it, can't tell it what to forget. Those three (inspectable, queryable, editable) are what turn memory from a vibe into infrastructure you can trust. The forget one is the most underrated, because a memory you can't correct accumulates wrong or stale facts and then confidently acts on them, the agent-poisons-its-own-memory failure, so the ability to say that's wrong, delete it is a correctness feature, not a convenience. Testing 33 and finding most solve one slice and fail the rest tracks with the whole space being immature: everyone built storage, almost nobody built control. The bar I hold is that memory should be a transparent store I can audit and curate, plus a hard layer of negative constraints (never do this again), not an opaque on switch. That see-it-query-it-correct-it stance is core to how I build memory in Moonshift. Of the 33, did any nail the editability/forget piece, or was that the universal gap?