6 months ago, I asked my AI agent what we'd been working on last week. It had no idea. Not because it couldn't remember — ChatGPT has memory, Claude has memory — but because I couldn't see what it stored, couldn't query it, couldn't tell it what to forget. A black box with a toggle that says "memory: on."
So I started testing every memory framework I could find — 33 engines total, running on OpenClaw (350K+ GitHub stars). Most solved one problem well and failed at everything else.
After 6 months, I landed on an architecture that actually works. It's not about one magic engine — it's about layers.
The memory stack your agent actually needs
Before diving into the 33 engines, here's what I learned: agent memory isn't one thing. It's a stack, like a human brain has short-term memory, long-term memory, and the ability to look things up.
A working agent memory stack has 3 layers:
Layer 1: Conversation compression — remembering what just happened
Every conversation eventually hits the context window limit. Without this layer, your agent literally forgets the beginning of your current conversation. A conversation compressor (like Lossless-Claw) keeps a DAG of summaries — compacting older turns into condensed summaries while keeping the most recent turns untouched. Your agent never loses mid-session context.
Layer 2: Native files + semantic search — the persistent record
Plain markdown files your agent reads and writes: daily journals (2026-05-28.md), a curated MEMORY.md, preference files, project notes. Simple, version-controlled, human-readable. No database, no API, no dependencies — this is the memory layer that survives everything.
A local embedding model indexes these files and lets your agent search by meaning, not just keywords. "How did we handle the auth migration?" finds the right entry even if it never used the word "auth." QMD runs a 333MB GGUF model locally — sub-second search, no API costs, no data leaving your machine. The files are the source of truth; the embeddings make them instantly searchable.
Layer 3: The long-term intelligence engine — this is where you choose
The first two layers are table stakes. Every serious agent needs them. The third layer is where the 33 engines I tested come in — and where the real differences emerge.
The 33 engines I tested
Here's every memory framework I put through real-world use — not benchmarks, not demos, actual daily agent work. They naturally group into 6 categories, each solving a different type of remembering:
Vector similarity — the foundation layer
These engines store embeddings and retrieve by semantic similarity. They're the building blocks most other memory systems are built on top of.
| # | Engine | What it does |
|---|---|---|
| 1 | ChromaDB | Embedding-based semantic search, lightweight and developer-friendly |
| 2 | Qdrant | High-performance vector similarity search with filtering |
| 3 | Weaviate | Hybrid vector + keyword search with pluggable modules |
| 4 | Milvus | Distributed vector database built for scale |
| 5 | Pinecone | Serverless managed vector search |
| 6 | pgvector | Vector similarity search as a PostgreSQL extension |
| 7 | FAISS | Meta's similarity search library — raw speed, no frills |
| 8 | Redis Vector | Vector similarity on Redis Stack |
| 9 | Supabase Vector | pgvector on managed Postgres with auth and APIs |
| 10 | Marqo | End-to-end tensor search engine |
| 11 | Deep Lake | Vector store optimized for AI dataset versioning |
| 12 | Vespa | Hybrid search + ML serving at scale |
These are excellent at "find me something similar to X" but they don't understand what they're storing. A vector store treats your preferences, your project architecture, and last Tuesday's standup notes the same way — as floating-point arrays. For RAG and document retrieval, they're essential. For agent memory, they're a necessary layer but not sufficient on their own.
Session & conversation memory — remembering the current thread
These keep track of what's been said within and across conversations.
| # | Engine | What it does |
|---|---|---|
| 13 | Zep | Long-term conversation memory with automatic fact extraction |
| 14 | Motorhead | Redis-backed conversation memory server |
| 15 | OpenAI Memory | ChatGPT's native conversation memory |
| 16 | Claude Memory | Anthropic's native conversation memory |
These solve the "I already told you this" problem within a session. Zep stands out here — it goes beyond simple buffer storage and extracts structured facts from conversations. But session memory alone doesn't give your agent a persistent understanding of your world.
Framework memory modules — memory as a feature
These are memory components built into larger agent/RAG frameworks.
| # | Engine | What it does |
|---|---|---|
| 17 | LlamaIndex Memory | Chat memory + knowledge index integration |
| 18 | LangChain Memory | Buffer, summary, and entity memory modules |
| 19 | LangMem | Memory management primitives for LangChain/LangGraph |
| 20 | Haystack Memory | Document store memory in RAG pipelines |
| 21 | txtai | All-in-one embeddings database with workflows |
| 22 | CrewAI Memory | Short/long/entity memory for multi-agent crews |
Good if you're already inside that ecosystem. They give you memory abstractions (buffers, summaries, entity tracking) but they're tightly coupled to their framework. Memory is a feature of these tools, not their core mission.
Agentic & autonomous memory — the agent manages its own memory
These let the agent itself decide what to remember and what to forget.
| # | Engine | What it does |
|---|---|---|
| 23 | Letta (MemGPT) | Self-editing memory with inner/outer monologue |
| 24 | AutoGPT Memory | File + vector memory for autonomous agents |
| 25 | Memary | Knowledge graph memory for autonomous agents |
| 26 | AGiXT | Adaptive memory with chained agent context |
| 27 | BabyAGI | Task-driven memory with priority queues |
Fascinating research direction. Letta/MemGPT in particular pioneered the idea of the model managing its own memory tiers. The challenge in production: you're trusting the LLM to decide what's worth keeping, and that decision quality varies with the model and context.
Personal AI & bookmarks — memory for humans, not agents
| # | Engine | What it does |
|---|---|---|
| 28 | Khoj | Self-hosted personal AI with file-based memory |
| 29 | SuperMemory | AI-powered memory for saved content and bookmarks |
| 30 | Vanna | RAG-based memory for database queries |
These are designed more as personal knowledge tools than agent memory layers. They work well for their use case, but they're solving a different problem — helping you remember things, not giving your agent persistent understanding.
Structured memory engines — purpose-built for agent intelligence
These are the engines designed specifically to give agents structured, queryable, persistent memory:
| # | Engine | What it does |
|---|---|---|
| 31 | Mem0 | Intelligent fact extraction, deduplication, contradiction resolution |
| 32 | Cognee | Entity-relationship knowledge graphs with 14 retrieval modes |
| 33 | Graphiti | Temporal knowledge graph with validity windows |
This is where it gets interesting — and where I spent most of my 6 months.
The 3 tiers of long-term memory
After testing all 33, the structured memory engines stood out. But here's the insight that took me months to reach: these three aren't meant to run together. They're evolutionary tiers. Each one supersedes the previous, adding capabilities while covering the lower tier's functionality.
Tier 1: Mem0 — facts and preferences
Mem0 (48K+ GitHub stars, $24M Series A) is the intelligent facts layer. Tell your agent "I prefer TypeScript" on Monday and "use Python for data scripts" on Thursday — Mem0 doesn't store two contradictory entries. It updates: TypeScript for general dev, Python for data. Every fact is categorized, timestamped, and confidence-scored.
Where Zep's fact extraction is a feature bolted onto session memory, Mem0's entire architecture is built around making facts reliable. Your agent starts every session already knowing your preferences, your project's quirks, and your conventions. No re-explaining.
Best for: developers and technical use cases. If your agent mainly needs to remember preferences, conventions, and project details across sessions, Mem0 is the right choice. It's the simplest to set up and the most focused.
Tier 2: Cognee — relationships and reasoning (supersedes Mem0)
Cognee ($7.5M seed, GitHub Secure Open Source graduate, running in 70+ companies) builds a knowledge graph — not isolated facts, but a web of entities, relationships, and semantic connections.
Where Mem0 knows "the client prefers blue branding," Cognee knows that the client's brand guidelines connect to last month's campaign performance, which connects to the audience segments that engaged most, which connects to the content calendar. It ships 14 retrieval modes and a self-improving "memify" feature that strengthens connections the more you use them.
Cognee handles everything Mem0 does (facts are just nodes in the graph) plus it maps the relationships between them. That's why it supersedes Tier 1 — you don't need Mem0 if you're running Cognee.
Best for: marketing, content, and multi-project work. If your agent needs to reason across brands, campaigns, audiences, and projects — understanding how things connect, not just what things are — Cognee is the right choice.
Tier 3: Graphiti — temporal reasoning (supersedes Cognee)
Graphiti by Zep is the temporal knowledge graph. Its core insight: knowing the current state isn't enough. You need to know when things changed and what was true before.
Every fact carries validity intervals. When new information conflicts with old, Graphiti doesn't overwrite — it creates a temporal record and invalidates the previous one, preserving full history. "When did this config change?" "What was different before the March deploy?" Graphiti answers directly, no digging through logs.
It outperforms MemGPT on the Deep Memory Retrieval benchmark using a combination of semantic search, keyword matching, and graph traversal.
Graphiti handles facts (like Mem0) and relationships (like Cognee) plus tracks how they change over time. It supersedes both lower tiers — but it's also the heaviest to run (FalkorDB, more compute, more complexity).
Best for: operations, executive, and business use cases. If your agent needs cause-and-effect reasoning across time — "what changed," "when did it break," "what was true before" — Graphiti is the right choice.
Pick one, not all three
| Your use case | Pick this tier | Why |
|---|---|---|
| Developer / DevOps | Mem0 | You need fast, reliable fact recall. Preferences, conventions, project details. |
| Marketing / Content | Cognee | You need relationship reasoning. Brands, campaigns, audiences, how they connect. |
| Operations / Executive | Graphiti | You need temporal reasoning. What changed, when, and what broke. |
The common mistake is thinking "more engines = better memory." It's not. Each tier already includes the capabilities of the one below it. Running Mem0 alongside Graphiti is redundant — Graphiti already stores facts. Running all three wastes compute and creates consistency conflicts.
Pick the tier that matches your work. Pair it with the base stack (conversation compression + native files with semantic search) and your agent will remember everything that matters.
The full architecture
Here's what a complete agent memory stack looks like:
Every layer feeds context to the model. The bottom two are always-on. The top one is your choice based on what kind of reasoning your agent needs.
Getting this running
The base stack (layers 1–2) is built into OpenClaw — conversation compression, native memory files, and semantic search work out of the box. The long-term engine (layer 3) requires additional setup: Mem0 needs a vector store, Cognee needs a graph database, Graphiti runs on FalkorDB.
OpenClaw is open source and you can self-host the full stack. If you want to skip the infrastructure work, I've been building ClawBase — managed OpenClaw hosting that pre-configures the right memory stack for your use case. But honestly, even if you self-host, the main takeaway here is the architecture: a 3-layer memory stack where you pick the long-term engine that matches your work.
The memory compounds over time — whichever way you run it, the longer you use it, the better it gets.
One thing I keep coming back to: once your agent has a real memory stack, it opens the door to something bigger — consistent shared memory across multiple agents. Imagine a team of agents that don't just remember their own context, but share a unified understanding of your projects, preferences, and decisions. That's a different kind of architecture entirely, and one I'll dig into in a future article.

Top comments (0)