AI agents are stateless by default. Every session starts from zero — the context window
fills up, the conversation ends, and everything is gone. But useful agents need to learn.
They need to remember your preferences, your project structure, the mistakes they made
yesterday.
We surveyed five products — Claude Code, OpenClaw, ChatGPT, Cursor, and
Windsurf — to understand how persistent memory actually works in production. Here's what we learned.
A taxonomy of agent memory
Not all memory serves the same purpose. We identified six functional roles that keep
appearing across products, even when they use different names for them.
| Role | What it holds | Persistence | Example |
|---|---|---|---|
| Working memory | Current session context | Ephemeral | Chat history in context window |
| Agent profile | Agent-specific persistent knowledge | Durable, per-agent | CLAUDE.md, .cursorrules |
| User profile | User preferences, habits, personal info | Durable, cross-agent | ChatGPT's "memory" feature |
| Episodic memory | Chronological interaction logs | Timestamped | JSONL session journals |
| Semantic memory | Searchable knowledge base | Indexed | RAG-backed vector store |
| Date-anchored memory | Time-stamped facts that expire | Temporal | "User is on vacation until March 15" |
Working memory is what most people think of — the chat history sitting in the
context window. It's fast but volatile. When the window fills up, something has to go.
Agent profile is the agent's persistent identity. Claude Code uses CLAUDE.md files,
Cursor uses .cursorrules. These are always loaded at session start — they tell the
agent how to behave.
User profile is different from agent profile, though products often conflate them.
Agent profiles are scoped to one agent instance. User profiles span agents — your
timezone, your communication style, your name. ChatGPT's memory feature is user-scoped.
Claude Code's CLAUDE.md is agent-scoped.
Episodic memory is the journal. Timestamped session logs — who said what, when,
in what order. Usually stored as JSONL or in a database with temporal indices. Critical
for debugging and context recall across sessions.
Semantic memory is the searchable layer. Vector embeddings, full-text search indices,
or both. This is where RAG lives — the agent queries for relevant knowledge rather than
loading everything into the prompt.
Date-anchored memory is the least common but arguably the most underbuilt. Facts
with expiration dates — your current project deadline, a temporary API key, a colleague's
vacation schedule. Most products store these the same way as permanent facts, which means
they never expire.
How five products implement memory
Each product makes different tradeoffs across the memory stack. Here's where
they land:
[Interactive chart — see original post]
The orange bars show inspectability (can you read and edit the memory?) and the
blue bars show searchability (can the agent retrieve relevant memories at scale?).
Claude Code and Cursor maximize human control. OpenClaw maximizes machine retrieval.
ChatGPT scores low on both axes from a developer perspective — it's accessible to
end users but opaque to builders.
Claude Code (Anthropic)
Claude Code takes the simplest approach in this survey: files on disk.
-
CLAUDE.md files act as the primary persistent memory. One per project root,
one global at
~/.claude/CLAUDE.md. Loaded into the system prompt on every session. -
Auto memory accumulates in
~/.claude/projects/<project>/memory/— build commands, architecture notes, debugging insights, workflow preferences. Written automatically based on interaction patterns. - Context compaction kicks in when the context window fills up. The system compresses prior messages automatically. Memory files persist across compaction boundaries.
- No RAG, no vector search. Memory is loaded directly into the prompt or read from files. Retrieval is file-path-based, not semantic.
- A growing third-party ecosystem fills the gaps: claude-mem adds semantic compression, memsearch provides markdown-first indexing, and Basic Memory offers MCP-based persistent context.
The bet here is on human readability. You can open CLAUDE.md in any text editor,
see exactly what your agent knows, and change it. No database to query, no embeddings
to inspect.
OpenClaw
OpenClaw has the most sophisticated retrieval pipeline of the products surveyed.
- Multi-layer architecture: conversation history (working memory), long-term memory store (durable facts), and session indexing (episodic recall).
- SQLite + sqlite-vec for storage — structured queries via SQL, semantic similarity via vector embeddings, all in a single file.
- Hybrid search combines cosine similarity (semantic match) with BM25-style keyword matching. Neither method alone is sufficient — hybrid catches both conceptual and literal matches.
- Pre-compaction memory flush: before trimming the context window, the agent is given an explicit turn to extract and persist all important facts. This is the most interesting pattern in the survey — the agent itself decides what matters.
- Markdown-first philosophy for memory content, with LLM-generated session slugs for indexing (e.g., "debugging-auth-flow-march-7").
The pre-compaction flush is worth highlighting. Most systems lose information silently
when compaction happens. OpenClaw turns compaction into an explicit memory-formation event.
ChatGPT (OpenAI)
ChatGPT's memory is the most user-facing and the least transparent.
- User-controlled: you tell ChatGPT to "remember this" and it does. It also infers memories automatically from conversations.
- Proprietary backend — no public documentation on storage format, compaction strategy, or retrieval mechanism.
- Users can delete individual memories or clear all. A "Temporary Chat" mode disables memory entirely.
- Tiered persistence: Plus and Pro users get longer-term memory. Free users get lightweight short-term continuity.
The accessibility is unmatched — non-technical users can manage memory through a
simple UI. But there's no programmatic access, no way to inspect the storage layer,
and no portability.
Cursor IDE
Cursor treats memory as configuration, not knowledge.
-
.cursorrules(now deprecated) was a plaintext file in the project root providing persistent instructions — essentially a system prompt extension. - The replacement,
.cursor/rules/, is a directory of rule files with more granular control. - The community-driven Memory Bank pattern pushes this further: hierarchical rule loading organized by development phase (analysis, planning, creative, implementation). Only rules relevant to the current phase are loaded.
- No embeddings, no search, no learned facts. Rules are static instructions written by the developer.
The Memory Bank pattern is telling. Users built an elaborate multi-phase memory
system on top of a tool that only supports flat config files. The demand for real
memory far exceeds what's offered.
Windsurf / Codeium
Windsurf adds automatic memory generation on top of manual rules.
- The Cascade agent auto-generates memories in
~/.codeium/windsurf/memories/, capturing coding patterns and project context. - Memories are workspace-scoped — knowledge from one project doesn't bleed into another. Reasonable for code agents, but means nothing transfers.
- Can infer agent configuration from AGENTS.md files.
- Enterprise tier adds system-level rules that admins deploy org-wide.
The workspace scoping is a deliberate tradeoff. It prevents context pollution
between projects but also prevents learning that should transfer (your preferred
test framework, your naming conventions, your error-handling patterns).
Feature coverage across products
Which memory roles does each product actually implement? The radar chart below
scores each product across all six memory roles.
[Interactive chart — see original post]
OpenClaw dominates episodic and semantic memory — its hybrid search pipeline
covers the most ground. Claude Code has the strongest agent profile support but
almost no semantic recall. ChatGPT leads on user profiles but scores low on
everything developers care about. Cursor is a flat line — strong on agent profile,
near-zero on everything else.
The scatter chart shows the same data from a different angle — how many memory
roles each product covers (x-axis) vs. how dynamically it learns (y-axis):
[Interactive chart — see original post]
Storage formats: markdown, SQLite, or vectors?
The storage format determines everything downstream — what you can query, what
you can inspect, and what happens when things go wrong.
| Product | Storage | Search | Compaction |
|---|---|---|---|
| Claude Code | Markdown files | File path | Context window auto-compaction |
| OpenClaw | SQLite + sqlite-vec | Hybrid (cosine + BM25) | Pre-compaction flush |
| ChatGPT | Proprietary | Unknown | Unknown |
| Cursor | Text / Markdown | None | Phase-based pruning |
| Windsurf | Local files | None | Workspace isolation |
| Mem0 (infra) | DB-agnostic | Pluggable | Multi-stage extraction |
Markdown files (Claude Code, Cursor, Windsurf) are human-readable,
git-friendly, and require zero dependencies. You can cat your agent's memory,
edit it with vim, and commit it alongside your code. But there's no semantic
search — you're limited to what fits in the context window.
SQLite + vectors (OpenClaw) gives you structured queries, full-text search
via FTS5, and semantic similarity via embeddings. The cost is opacity — you
need tooling to inspect memories, and the embedding model becomes a dependency.
Proprietary backends (ChatGPT) scale in the cloud and abstract
away storage entirely. But your memories aren't portable, inspectable, or
version-controllable.
The fundamental tradeoff is inspectability vs. searchability.
Markdown is maximally inspectable but unsearchable at scale. Vector databases are
maximally searchable but opaque. The products developers trust most — Claude Code,
OpenClaw — choose inspectable formats and layer search on top, rather than starting
with an opaque database.
Compaction: what happens when the context window fills up
Every agent eventually runs out of context space. What happens next defines
the quality of long-running interactions.
Naive truncation drops the oldest messages. Simple, but destructive — it
loses critical early context like system prompts and initial instructions. Most
products have moved past this.
KV cache compaction works at the inference layer. Recent research demonstrates
50x context reduction with minimal quality loss by compressing key-value attention
caches mathematically. This is transparent to the application — the model sees a
compressed but semantically equivalent context.
Hierarchical summarization mirrors human memory: working memory overflows
into episodic logs (timestamped transcripts), which are periodically summarized
into semantic memory (searchable facts). The pipeline looks like:
Anchored iterative summarization avoids reprocessing the entire history on
every compaction. Only new message spans are summarized and merged with existing
summaries. This is cheaper and avoids the progressive degradation that comes
from summarizing summaries.
Episode pagination segments conversations at natural cognitive boundaries —
topic shifts, tool-use completions, user-initiated breaks. Each episode becomes
an independently retrievable unit, which dramatically improves recall precision
compared to arbitrary chunking.
Pre-compaction flush is the most elegant pattern we found. Before trimming
the context window, the agent gets an explicit turn to extract and persist all
important facts. The agent itself decides what matters — not a heuristic, not a
fixed window. OpenClaw implements this, and it's the pattern we're most interested
in adopting.
Research from Mem0 shows that smart compaction
isn't just about saving tokens — it improves reasoning. Their benchmarks
report 5-11% improvements in reasoning tasks and 91% p95 latency reduction
compared to full-context baselines. Compacting intelligently is better than
throwing everything into the prompt.
Patterns worth stealing
Five patterns emerged from this survey that we think every agent memory system
should consider.
Memory as a hook, not a hardcoded subsystem. OpenClaw implements memory
through extensible interfaces rather than baking storage decisions into the
core. This lets users swap backends without changing agent logic.
Dual-store architecture. Keep a fast, inspectable format (markdown, TOML)
for agent profiles and user preferences. Use a searchable store (SQLite + FTS,
vectors) for episodic and semantic memory. Don't force everything into one format.
Pre-compaction flush. Before trimming context, give the agent an explicit
turn to extract and persist important facts. This turns context compaction from
a lossy operation into a memory-formation event.
Profile vs. recall separation. Agent profiles (always-loaded identity) and
recallable knowledge (searched on demand) serve different purposes. Conflating
the two — loading everything into the prompt or searching everything on demand
— creates either bloated prompts or slow retrieval. The best systems separate
these concerns explicitly.
Human-readable by default. Every product that gained developer trust stores
memory in formats humans can read and edit. Opaque databases create anxiety.
Even when you add a searchable layer, the canonical format should be something
you can open in a text editor.
Temporal knowledge graphs. Pure vector retrieval loses relationships and
time. A graph where entities are nodes and facts are edges — with timestamps
tracking when each fact was true, not just when it was stored — outperforms
flat RAG on temporal reasoning tasks. Zep's research
shows 18.5% higher accuracy and ~90% lower latency compared to vector-only
baselines on complex temporal queries. The key is bi-temporal tracking:
separating when a fact was recorded from when it was actually true. This
is how "user is on vacation until March 15" can auto-expire without manual
cleanup.
Open questions
This survey raised more questions than it answered. Here are the ones
we keep coming back to.
Can one storage layer do it all? Markdown is inspectable but
unsearchable. Vector databases are searchable but opaque. Every product
picks a side or bolts one onto the other. Is there a single storage
primitive that gives you both — human-readable and semantically
searchable — without the complexity of maintaining two separate systems?
Should memory be a graph? Flat key-value memories lose relationships.
"Alice works on Project X" and "Project X uses Rust" are two disconnected
facts in a vector store — but a graph trivially connects them. Zep's
research shows 18.5% accuracy gains from graph-based retrieval on temporal
queries. But graphs add complexity. Where's the crossover point where the
complexity pays for itself?
Who decides what to remember? Most products use heuristics or let
users explicitly say "remember this." OpenClaw's pre-compaction flush
is more interesting — the agent itself decides what matters before context
is trimmed. But agent-driven memory formation introduces a new failure
mode: the agent might remember the wrong things, or forget the right ones.
How do you evaluate memory quality?
How should memories expire? Date-anchored memory is the most
underbuilt category in this survey. "User is on vacation until March 15"
should auto-expire. But most systems store it identically to permanent
facts. Bi-temporal tracking (separating when a fact was recorded from
when it was true) solves this in theory — but no product we surveyed
implements it well in practice.
Can memory transfer across agents? Cursor and Windsurf scope memory
to a single workspace. Claude Code scopes to a project directory. ChatGPT
scopes to a user but not to a task. None of these scoping models feel
right. Your preferred test framework should follow you everywhere. Your
current project's auth implementation should not.
We wrote about how we're approaching these questions in
Graph + vector: how OpenWalrus agents remember.
If you're building agent memory systems, we'd love to compare notes —
open an issue on GitHub or find
us in the discussions.
Originally published at OpenWalrus.
Top comments (0)