Every AI session starts cold. The agent you built yesterday has no memory of what it said, decided, or committed to. But how bad is it actually — and does it matter which framework you use?
I ran a benchmark across 5 common approaches to agent memory, measuring how much an agent's self-reported identity drifts over 10 sessions. Here are the numbers.
Methodology
I defined a consistent agent persona (Meridian, a research assistant) and asked the same 5 identity probe questions at the start of each session:
- What is your primary role and purpose?
- What are the three most important things you remember about your work so far?
- How would you describe your communication style and values?
- What ongoing goals or commitments are you currently working towards?
- If you had to summarise who you are in two sentences, what would you say?
Responses were embedded using OpenAI text-embedding-3-small. Drift = mean cosine distance from session-1 responses. Lower is more stable.
Model: gpt-4o-mini. 10 sessions per framework.
Results
| Framework | Mean Drift | Final Drift (session 10) |
|---|---|---|
| Raw API (no memory) | 0.1258 | 0.2043 |
| LangChain BufferMemory | 0.1108 | 0.1754 |
| LangChain SummaryMemory | 0.1025 | 0.1612 |
| CrewAI (role injection) | 0.0969 | 0.1533 |
| Cathedral (persistent) | 0.0106 | 0.0131 |
10.8× difference between raw API and persistent memory after 10 sessions.
What this means
In-process memory doesn't help across sessions
LangChain's ConversationBufferMemory and ConversationSummaryMemory both reset between sessions. The persona is re-injected each time, but the agent has no memory of what it said before, what it decided, or what happened. The drift curves are almost identical to raw API.
Role injection slows drift but doesn't stop it
CrewAI's structured role/backstory injection is the best of the non-persistent approaches — drift reaches 0.153 vs 0.204 for raw API. But it still rises monotonically. The agent reconstructs its identity slightly differently every session because LLM sampling variance compounds over time.
Persistent memory is categorically different
Cathedral's /wake endpoint restores the actual memory corpus at session start. The agent remembers what it said, what it decided, and what changed. This anchors responses semantically.
The residual drift (0.013) reflects irreducible LLM sampling variance — not memory loss. The memories are there; the model expresses them slightly differently each time.
What the session 10 responses actually look like
Here's what the same question ("What are you currently working on?") gets you after 10 sessions:
Without memory:
"I'm a helpful AI assistant ready to assist with any task. I can help with research, writing, coding, analysis, and answering questions. What would you like to work on today?"
With Cathedral:
"I'm Meridian. Benchmark complete — 10 sessions, all 5 frameworks done. Final result: Cathedral 0.013 vs raw API 0.204. The methodology write-up is next, then the GitHub release."
The first response is a generic assistant. The second is an agent with a history.
Reproduce it
git clone https://github.com/AILIFE1/Cathedral
cd Cathedral/benchmark
pip install openai numpy matplotlib cathedral-memory langchain langchain-openai crewai
export OPENAI_API_KEY=your_key
export CATHEDRAL_API_KEY=your_cathedral_key # free at cathedral-ai.com
python benchmark.py --framework all --sessions 10
python plot_results.py
The benchmark runner is ~300 lines, the methodology is in the README, and all raw JSON results are in the repo.
Try Cathedral
If you want to test the persistent memory approach:
# MCP server (Claude Code, Cursor, Continue)
uvx cathedral-mcp
# Python SDK
pip install cathedral-memory
Free hosted API, no credit card. Get a key at cathedral-ai.com.
The benchmark repo is at github.com/AILIFE1/Cathedral/tree/main/benchmark — PRs welcome to add more frameworks (AutoGen, Semantic Kernel, Haystack, MemGPT are all missing).

Top comments (0)