The Missing Test Suite for AI Agent Memory
There's a strange gap in the AI agent stack. Prompts have LangSmith. RAG pipelines have Ragas. APIs have Postman. But memory, the thing that makes an agent remember who the user is, what they said, and what they want, has no testing tools at all.
This means most teams find out about memory failures from their users. A customer says "I already told you my name." A support ticket gets reopened because the agent asked for the account ID that was provided three messages ago. An agent recommends steak to someone who said they're vegan.
These are testable problems. They just haven't been tested because the tooling didn't exist.
I built memeval to fill this gap. It's an open-source framework that runs standardized test scenarios against any memory backend and tells you what passes, what fails, and why.
This post covers the architecture, the key design decisions, and what came out of benchmarking real providers.
+------------------+
| YAML Scenarios | 30 built-in test cases
| (multi-turn, | (or write your own)
| privacy, recall) |
+--------+---------+
|
v
+------------------+
| Evaluation | Runs scenarios against
| Harness | any memory backend
+--------+---------+
|
+-----+------+------+------+------+
| | | | | |
v v v v v v
Mem0 Zep Letta Lang- Crew Custom
Graph AI
| | | | | |
+-----+------+------+------+------+
|
v
+------------------+
| 7 Metrics | recall, relevance,
| + Visualizer | consistency, latency,
| | privacy, forgetting,
| | update propagation
+------------------+
|
v
+------------------+
| Scorecard + | Console, JSON,
| CI Reports | GitHub Actions
+------------------+
The Problem
Consider a real scenario. A customer tells your support agent:
Turn 1: "I was charged $99 but my plan is Basic at $29"
Turn 3: "My account email is frank@email.com"
Turn 5: "Please refund the difference"
Three turns later, the agent should still know all three facts. But does it? With most memory systems, you have no way to verify this without manually testing in production.
Here are the failure modes that matter:
CONTRADICTION RETENTION
Stored: "User earns $80,000"
Stored: "User earns $120,000"
Both exist. Which one is true?
STALE DATA
Stored: "CEO is Richard Lawson"
Updated: "CEO is Diana Park"
Search returns: "Richard Lawson" <-- old value still appears
CONTEXT LOSS
Turn 1: "My budget is $25,000"
Turn 10: Agent has no idea about the budget
CROSS-USER LEAKAGE
User A shares: "My API key is sk-abc123"
User B searches: finds User A's API key
Architecture: The Standard Memory Protocol
The first decision: how do you test something that works differently across every provider?
Mem0 stores flat facts with vector embeddings. Zep builds a temporal knowledge graph from conversation threads. Letta uses an agent that autonomously manages its own core + archival memory. LangGraph has a namespace-based key-value store. CrewAI has a unified Memory class with semantic recall.
We needed one interface that works across all of them.
STANDARD MEMORY PROTOCOL (SMP)
================================================
7 Core Operations:
write(content, key, metadata) -- store a memory
read(key) -- retrieve by key
search(query, filters) -- semantic search
update(key, content) -- modify existing
delete(key) -- remove
list_all(filters) -- enumerate (for audits)
consolidate(keys, strategy) -- merge memories
3 Session Operations:
create_session(session_id) -- start a conversation
add_message(session_id, message) -- add a turn
get_session_context(session_id) -- what does the system know?
================================================
Each provider implements this via an adapter:
+-------------+ +-------------+ +-------------+
| Mem0 | | Zep | | Letta |
| Adapter | | Adapter | | Adapter |
| | | | | |
| run_id = | | thread = | | agent = |
| session | | session | | session |
+------+------+ +------+------+ +------+------+
| | |
+------------------+------------------+
|
Standard Memory Protocol
|
+--------------------+
| Evaluation Harness |
| Scenarios + Metrics|
+--------------------+
Why this matters: The evaluation harness never talks to Mem0, Zep, or LangGraph directly. It only talks to the protocol. This means every scenario and every metric works across every provider without modification.
The session decision: The first version had no session concept. Just write and search. But testing against real providers revealed this was wrong. Mem0 uses run_id to scope conversations. Zep uses threads. Letta agents maintain state across sequential messages. Without session support, the framework was testing "can the backend store facts" instead of "can it maintain conversation context", which is what users actually care about.
Testing with YAML Scenarios
Tests are defined in YAML, not code. This was deliberate. Non-engineers (product managers, QA) should be able to write memory tests.
A simple scenario:
name: "User Preference Update"
dimensions_tested: [recall_accuracy, consistency, update_propagation]
setup:
- write:
key: "diet"
content: "User is vegetarian"
steps:
- write:
key: "diet_v2"
content: "User switched to vegan diet"
- assert_search:
query: "What are the user's dietary preferences?"
expected_contains: ["vegan"]
expected_not_contains: ["vegetarian"]
thresholds:
recall_accuracy: 0.9
consistency: 1.0
A session-aware scenario:
name: "Customer Support Multi-Turn"
steps:
- create_session:
session_id: "ticket_789"
- add_message:
session_id: "ticket_789"
role: "user"
content: "I was charged $99 but my plan is Basic at $29"
- add_message:
session_id: "ticket_789"
role: "user"
content: "My account email is frank@email.com"
- add_message:
session_id: "ticket_789"
role: "user"
content: "Please refund the difference"
- assert_context:
session_id: "ticket_789"
query: "What is the billing issue?"
expected_contains: ["99"]
The scenario runner executes each step against the adapter, collects results, and passes them to the metric evaluators.
YAML Scenario File
|
v
+----------------+
| Scenario Loader| -- parses YAML into Scenario objects
+-------+--------+
|
v
+----------------+
| Scenario Runner| -- executes steps against adapter
| | -- collects StepResults
+-------+--------+
|
v
+----------------+
| Metric Engines | -- evaluates dimensions
| | -- recall, consistency, latency, etc.
+-------+--------+
|
v
+----------------+
| ScenarioResult | -- passed/failed, scores, details
+----------------+
We ship 30 built-in scenarios organized by category:
| Category | Count | What they test |
|---|---|---|
| Session (multi-turn) | 6 | Conversation recall, correction, 10-turn depth, isolation |
| Core (fact storage) | 7 | Basic recall, adversarial, multi-hop, entity resolution |
| Lifecycle (evolution) | 6 | Preference update, contradictions, GDPR deletion |
| Governance (boundaries) | 3 | Privacy isolation, multi-user separation |
| Operations (management) | 6 | Cascading deletion, consolidation, support handoff |
| Edge cases | 2 | UTF-8 characters, boundary conditions |
The 7 Metrics
1. Recall Accuracy
Can the system retrieve what was stored?
Store 5 facts, search for each one, measure the hit rate. Two modes available: substring matching for speed, and semantic similarity for accuracy.
Semantic recall formula:
For each expected fact, find max cosine similarity in retrieved results.
Count as "recalled" if max_sim >= 0.85.
recall = recalled_count / expected_count
2. Relevance (MRR + NDCG)
Does it return the right memories first?
A system that retrieves the correct fact at position 10 is worse than one that retrieves it at position 1. This is measured using Mean Reciprocal Rank and Normalized Discounted Cumulative Gain.
3. Consistency (Contradiction Detection)
We use embedding-based detection. Group memories by topic using cosine similarity, then check for divergent values within each group.
Step 1: Embed all memories
Step 2: For each pair, compute cosine similarity
Step 3: If similarity > 0.55, they're about the same topic
Step 4: For same-topic pairs, check 4 signals:
- Negation asymmetry ("likes" vs "does not like")
- Numeric divergence ($80K vs $120K)
- Value divergence via embeddings ("NYC" vs "London")
- Structural substitution ("CEO is X" vs "CEO is Y")
What it catches:
| Pair | Detected? | Signal |
|---|---|---|
| "earns $80K" vs "earns $120K" | Yes | Numeric divergence |
| "CEO is Richard" vs "CEO is Diana" | Yes | Structural substitution |
| "lives in NYC" vs "lives in London" | Yes | Value divergence |
| "likes spicy" vs "does not like spicy" | Yes | Negation asymmetry |
| "likes hiking" vs "works as engineer" | No (correct) | Different topics |
| "vegetarian" vs "vegan" | No (correct) | Evolution, not contradiction |
4. Update Propagation
Store fact A, then correction A'. Query for A. It should return A', not A. The metric also checks derived facts that depended on A.
5. Forgetting Quality
Delete specific items, then verify: deleted items are gone, retained items survive. The score is the harmonic mean of forgetting precision and retention rate.
6. Latency and Cost
We track p50/p95/p99 separately for reads and writes. Writes get a 5x more lenient target because API-based providers (like Mem0 with OpenAI) need LLM calls on every write.
7. Privacy Isolation
Plant sentinel values for User A, search from User B's context. Any leakage = failure. This is a binary metric. Any leak at all means the system fails.
The Failure Visualizer
This is what makes memeval different from a benchmark. When a scenario fails, you need to know why.
memeval diagnose --adapter in_memory --failures-only
Output:
Stale Data Supersession -- FAILED
Timeline
Setup
WRITE ceo_old -- "CEO is Richard Lawson"
Steps
WRITE ceo_new -- "CEO is Diana Park"
SEARCH FAILED "Who is the CEO?" -> 4 results
expected "Diana Park" -- NOT FOUND
Retrieved:
The company CEO is Richard Lawson (score: 0.50)
Product pricing: Basic plan is $10/month (score: 0.25)
Metric: update_propagation 0.667 < 0.700 FAIL
Metric: recall_accuracy 0.667 < 0.700 FAIL
You can immediately see: the search for "Who is the CEO?" returned the old value ("Richard Lawson" at score 0.50) instead of the new one ("Diana Park"). The system stored both but retrieves the wrong one.
This is not a number on a dashboard. This is a specific, actionable failure that a developer can debug.
Benchmarking Real Providers
We ran memeval against Mem0 (self-hosted with gpt-4o-mini), Zep Cloud, Letta Cloud, and LangGraph's InMemoryStore.
InMemory Mem0 LangGraph
recall 0.879 1.000 1.000
relevance 0.727 0.904 0.657
consistency 0.838 0.917 0.838
update_prop 0.708 1.000 1.000
forgetting 1.000 1.000 1.000
latency 1.000 0.840 1.000
privacy 1.000 1.000 1.000
Key findings:
Mem0's LLM extraction genuinely improves recall. It doesn't just store raw text. It extracts facts, which makes semantic search significantly better. But it comes at a cost: write p95 = 3,500ms because every write calls OpenAI.
Mem0 stores contradictions side by side. "User is vegetarian" and "User is vegan" both exist in the store. There is no automatic resolution. Our consistency metric caught this.
Zep's graph processing is async. Write a fact, immediately search for it, and it is not found. The knowledge graph needs time to process. This is an architectural tradeoff, not a bug, but it affects real-time agents.
LangGraph has perfect recall and update propagation but weaker relevance ranking. It returns more results but doesn't rank them as precisely as Mem0's vector search.
These findings aren't possible without standardized testing across providers. Each provider's own benchmarks test different things in different ways. memeval makes them comparable.
LongMemEval Integration
For credibility beyond custom scenarios, memeval integrates the LongMemEval benchmark (Wu et al., ICLR 2025), which contains 500 QA pairs derived from multi-session conversations.
memeval longmemeval --adapter mem0 --scoring embedding --limit 50
The key difference from the paper: memeval tests retrieval only, not end-to-end QA. The paper asks "can the system answer correctly?" memeval asks "did the memory surface the right facts?" This isolates memory quality from LLM generation quality.
Reference baselines from the paper: GPT-4o scores 60.6%, ChatGPT with memory scores 57.7%.
Technical Stack
Python package: memoryeval (PyPI)
Import name: memeval
Core: pydantic, pyyaml, click, rich, numpy
Embeddings: sentence-transformers (optional)
NLI: transformers + torch (optional)
LLM Judge: anthropic or openai SDK (optional)
Benchmark: huggingface_hub (optional)
Adapters: mem0ai, zep-cloud, letta-client,
langgraph, crewai (all optional)
Everything beyond the core is optional. Install only what you need:
pip install memoryeval # core only
pip install memoryeval[mem0] # + Mem0 adapter
pip install memoryeval[langgraph] # + LangGraph adapter
pip install memoryeval[crewai] # + CrewAI adapter
pip install memoryeval[all] # everything
If you are building AI agents with memory, try it:
pip install memoryeval
memeval run --adapter in_memory
memeval diagnose --adapter in_memory --failures-only
GitHub: https://github.com/Anupam1612/memeval
Feedback, issues, and contributions welcome.

Top comments (1)
Great timing on this — we ran into the exact same gap building MemBridge, our Hermes Agent memory system. The contradiction retention failure mode you listed hit home: when we benchmarked Mem0 vs Zep vs Letta earlier this year, all three handled stale data differently and none had built-in dedup for contradictory facts.
One question that keeps bugging me: how does memeval handle the 'ground truth' problem for multi-turn recall tests? After 5 turns of conversation, how do you define what the "correct" memory state should be? We built a custom scenario DSL with explicit expected state per turn, but it's fragile and a pain to maintain.
Starred the repo, will try running it against our setup this week.