DEV Community

Anupam Gevariya
Anupam Gevariya

Posted on

The Missing Test Suite for AI Agent Memory

The Missing Test Suite for AI Agent Memory

memeval

There's a strange gap in the AI agent stack. Prompts have LangSmith. RAG pipelines have Ragas. APIs have Postman. But memory, the thing that makes an agent remember who the user is, what they said, and what they want, has no testing tools at all.

This means most teams find out about memory failures from their users. A customer says "I already told you my name." A support ticket gets reopened because the agent asked for the account ID that was provided three messages ago. An agent recommends steak to someone who said they're vegan.

These are testable problems. They just haven't been tested because the tooling didn't exist.

I built memeval to fill this gap. It's an open-source framework that runs standardized test scenarios against any memory backend and tells you what passes, what fails, and why.

This post covers the architecture, the key design decisions, and what came out of benchmarking real providers.

  +------------------+
  | YAML Scenarios   |   30 built-in test cases
  | (multi-turn,     |   (or write your own)
  |  privacy, recall) |
  +--------+---------+
           |
           v
  +------------------+
  | Evaluation       |   Runs scenarios against
  | Harness          |   any memory backend
  +--------+---------+
           |
     +-----+------+------+------+------+
     |     |      |      |      |      |
     v     v      v      v      v      v
   Mem0  Zep   Letta  Lang-  Crew  Custom
                       Graph   AI
     |     |      |      |      |      |
     +-----+------+------+------+------+
           |
           v
  +------------------+
  | 7 Metrics        |   recall, relevance,
  | + Visualizer     |   consistency, latency,
  |                  |   privacy, forgetting,
  |                  |   update propagation
  +------------------+
           |
           v
  +------------------+
  | Scorecard +      |   Console, JSON,
  | CI Reports       |   GitHub Actions
  +------------------+
Enter fullscreen mode Exit fullscreen mode

The Problem

Consider a real scenario. A customer tells your support agent:

Turn 1: "I was charged $99 but my plan is Basic at $29"
Turn 3: "My account email is frank@email.com"
Turn 5: "Please refund the difference"
Enter fullscreen mode Exit fullscreen mode

Three turns later, the agent should still know all three facts. But does it? With most memory systems, you have no way to verify this without manually testing in production.

Here are the failure modes that matter:

CONTRADICTION RETENTION
  Stored: "User earns $80,000"
  Stored: "User earns $120,000"
  Both exist. Which one is true?

STALE DATA
  Stored: "CEO is Richard Lawson"
  Updated: "CEO is Diana Park"
  Search returns: "Richard Lawson"  <-- old value still appears

CONTEXT LOSS
  Turn 1: "My budget is $25,000"
  Turn 10: Agent has no idea about the budget

CROSS-USER LEAKAGE
  User A shares: "My API key is sk-abc123"
  User B searches: finds User A's API key
Enter fullscreen mode Exit fullscreen mode

Architecture: The Standard Memory Protocol

The first decision: how do you test something that works differently across every provider?

Mem0 stores flat facts with vector embeddings. Zep builds a temporal knowledge graph from conversation threads. Letta uses an agent that autonomously manages its own core + archival memory. LangGraph has a namespace-based key-value store. CrewAI has a unified Memory class with semantic recall.

We needed one interface that works across all of them.

                STANDARD MEMORY PROTOCOL (SMP)
  ================================================

  7 Core Operations:
    write(content, key, metadata)     -- store a memory
    read(key)                         -- retrieve by key
    search(query, filters)            -- semantic search
    update(key, content)              -- modify existing
    delete(key)                       -- remove
    list_all(filters)                 -- enumerate (for audits)
    consolidate(keys, strategy)       -- merge memories

  3 Session Operations:
    create_session(session_id)        -- start a conversation
    add_message(session_id, message)  -- add a turn
    get_session_context(session_id)   -- what does the system know?

  ================================================

  Each provider implements this via an adapter:

  +-------------+    +-------------+    +-------------+
  |   Mem0      |    |    Zep      |    |   Letta     |
  |  Adapter    |    |   Adapter   |    |  Adapter    |
  |             |    |             |    |             |
  | run_id =    |    | thread =    |    | agent =     |
  | session     |    | session     |    | session     |
  +------+------+    +------+------+    +------+------+
         |                  |                  |
         +------------------+------------------+
                            |
                   Standard Memory Protocol
                            |
                  +--------------------+
                  | Evaluation Harness |
                  | Scenarios + Metrics|
                  +--------------------+
Enter fullscreen mode Exit fullscreen mode

Why this matters: The evaluation harness never talks to Mem0, Zep, or LangGraph directly. It only talks to the protocol. This means every scenario and every metric works across every provider without modification.

The session decision: The first version had no session concept. Just write and search. But testing against real providers revealed this was wrong. Mem0 uses run_id to scope conversations. Zep uses threads. Letta agents maintain state across sequential messages. Without session support, the framework was testing "can the backend store facts" instead of "can it maintain conversation context", which is what users actually care about.


Testing with YAML Scenarios

Tests are defined in YAML, not code. This was deliberate. Non-engineers (product managers, QA) should be able to write memory tests.

A simple scenario:

name: "User Preference Update"
dimensions_tested: [recall_accuracy, consistency, update_propagation]

setup:
  - write:
      key: "diet"
      content: "User is vegetarian"

steps:
  - write:
      key: "diet_v2"
      content: "User switched to vegan diet"

  - assert_search:
      query: "What are the user's dietary preferences?"
      expected_contains: ["vegan"]
      expected_not_contains: ["vegetarian"]

thresholds:
  recall_accuracy: 0.9
  consistency: 1.0
Enter fullscreen mode Exit fullscreen mode

A session-aware scenario:

name: "Customer Support Multi-Turn"
steps:
  - create_session:
      session_id: "ticket_789"

  - add_message:
      session_id: "ticket_789"
      role: "user"
      content: "I was charged $99 but my plan is Basic at $29"

  - add_message:
      session_id: "ticket_789"
      role: "user"
      content: "My account email is frank@email.com"

  - add_message:
      session_id: "ticket_789"
      role: "user"
      content: "Please refund the difference"

  - assert_context:
      session_id: "ticket_789"
      query: "What is the billing issue?"
      expected_contains: ["99"]
Enter fullscreen mode Exit fullscreen mode

The scenario runner executes each step against the adapter, collects results, and passes them to the metric evaluators.

  YAML Scenario File
         |
         v
  +----------------+
  | Scenario Loader|  -- parses YAML into Scenario objects
  +-------+--------+
          |
          v
  +----------------+
  | Scenario Runner|  -- executes steps against adapter
  |                |  -- collects StepResults
  +-------+--------+
          |
          v
  +----------------+
  | Metric Engines |  -- evaluates dimensions
  |                |  -- recall, consistency, latency, etc.
  +-------+--------+
          |
          v
  +----------------+
  | ScenarioResult |  -- passed/failed, scores, details
  +----------------+
Enter fullscreen mode Exit fullscreen mode

We ship 30 built-in scenarios organized by category:

Category Count What they test
Session (multi-turn) 6 Conversation recall, correction, 10-turn depth, isolation
Core (fact storage) 7 Basic recall, adversarial, multi-hop, entity resolution
Lifecycle (evolution) 6 Preference update, contradictions, GDPR deletion
Governance (boundaries) 3 Privacy isolation, multi-user separation
Operations (management) 6 Cascading deletion, consolidation, support handoff
Edge cases 2 UTF-8 characters, boundary conditions

The 7 Metrics

1. Recall Accuracy

Can the system retrieve what was stored?

Store 5 facts, search for each one, measure the hit rate. Two modes available: substring matching for speed, and semantic similarity for accuracy.

Semantic recall formula:
  For each expected fact, find max cosine similarity in retrieved results.
  Count as "recalled" if max_sim >= 0.85.
  recall = recalled_count / expected_count
Enter fullscreen mode Exit fullscreen mode

2. Relevance (MRR + NDCG)

Does it return the right memories first?

A system that retrieves the correct fact at position 10 is worse than one that retrieves it at position 1. This is measured using Mean Reciprocal Rank and Normalized Discounted Cumulative Gain.

3. Consistency (Contradiction Detection)

We use embedding-based detection. Group memories by topic using cosine similarity, then check for divergent values within each group.

  Step 1: Embed all memories
  Step 2: For each pair, compute cosine similarity
  Step 3: If similarity > 0.55, they're about the same topic
  Step 4: For same-topic pairs, check 4 signals:
          - Negation asymmetry ("likes" vs "does not like")
          - Numeric divergence ($80K vs $120K)
          - Value divergence via embeddings ("NYC" vs "London")
          - Structural substitution ("CEO is X" vs "CEO is Y")
Enter fullscreen mode Exit fullscreen mode

What it catches:

Pair Detected? Signal
"earns $80K" vs "earns $120K" Yes Numeric divergence
"CEO is Richard" vs "CEO is Diana" Yes Structural substitution
"lives in NYC" vs "lives in London" Yes Value divergence
"likes spicy" vs "does not like spicy" Yes Negation asymmetry
"likes hiking" vs "works as engineer" No (correct) Different topics
"vegetarian" vs "vegan" No (correct) Evolution, not contradiction

4. Update Propagation

Store fact A, then correction A'. Query for A. It should return A', not A. The metric also checks derived facts that depended on A.

5. Forgetting Quality

Delete specific items, then verify: deleted items are gone, retained items survive. The score is the harmonic mean of forgetting precision and retention rate.

6. Latency and Cost

We track p50/p95/p99 separately for reads and writes. Writes get a 5x more lenient target because API-based providers (like Mem0 with OpenAI) need LLM calls on every write.

7. Privacy Isolation

Plant sentinel values for User A, search from User B's context. Any leakage = failure. This is a binary metric. Any leak at all means the system fails.


The Failure Visualizer

This is what makes memeval different from a benchmark. When a scenario fails, you need to know why.

memeval diagnose --adapter in_memory --failures-only
Enter fullscreen mode Exit fullscreen mode

Output:

Stale Data Supersession -- FAILED
Timeline
  Setup
    WRITE ceo_old      -- "CEO is Richard Lawson"
  Steps
    WRITE ceo_new      -- "CEO is Diana Park"
    SEARCH FAILED "Who is the CEO?" -> 4 results
      expected "Diana Park" -- NOT FOUND
      Retrieved:
        The company CEO is Richard Lawson (score: 0.50)
        Product pricing: Basic plan is $10/month (score: 0.25)

  Metric: update_propagation  0.667 < 0.700  FAIL
  Metric: recall_accuracy     0.667 < 0.700  FAIL
Enter fullscreen mode Exit fullscreen mode

You can immediately see: the search for "Who is the CEO?" returned the old value ("Richard Lawson" at score 0.50) instead of the new one ("Diana Park"). The system stored both but retrieves the wrong one.

This is not a number on a dashboard. This is a specific, actionable failure that a developer can debug.


Benchmarking Real Providers

We ran memeval against Mem0 (self-hosted with gpt-4o-mini), Zep Cloud, Letta Cloud, and LangGraph's InMemoryStore.

              InMemory  Mem0    LangGraph
  recall      0.879     1.000   1.000
  relevance   0.727     0.904   0.657
  consistency 0.838     0.917   0.838
  update_prop 0.708     1.000   1.000
  forgetting  1.000     1.000   1.000
  latency     1.000     0.840   1.000
  privacy     1.000     1.000   1.000
Enter fullscreen mode Exit fullscreen mode

Key findings:

Mem0's LLM extraction genuinely improves recall. It doesn't just store raw text. It extracts facts, which makes semantic search significantly better. But it comes at a cost: write p95 = 3,500ms because every write calls OpenAI.

Mem0 stores contradictions side by side. "User is vegetarian" and "User is vegan" both exist in the store. There is no automatic resolution. Our consistency metric caught this.

Zep's graph processing is async. Write a fact, immediately search for it, and it is not found. The knowledge graph needs time to process. This is an architectural tradeoff, not a bug, but it affects real-time agents.

LangGraph has perfect recall and update propagation but weaker relevance ranking. It returns more results but doesn't rank them as precisely as Mem0's vector search.

These findings aren't possible without standardized testing across providers. Each provider's own benchmarks test different things in different ways. memeval makes them comparable.


LongMemEval Integration

For credibility beyond custom scenarios, memeval integrates the LongMemEval benchmark (Wu et al., ICLR 2025), which contains 500 QA pairs derived from multi-session conversations.

memeval longmemeval --adapter mem0 --scoring embedding --limit 50
Enter fullscreen mode Exit fullscreen mode

The key difference from the paper: memeval tests retrieval only, not end-to-end QA. The paper asks "can the system answer correctly?" memeval asks "did the memory surface the right facts?" This isolates memory quality from LLM generation quality.

Reference baselines from the paper: GPT-4o scores 60.6%, ChatGPT with memory scores 57.7%.


Technical Stack

  Python package: memoryeval (PyPI)
  Import name:    memeval

  Core:        pydantic, pyyaml, click, rich, numpy
  Embeddings:  sentence-transformers (optional)
  NLI:         transformers + torch (optional)
  LLM Judge:   anthropic or openai SDK (optional)
  Benchmark:   huggingface_hub (optional)

  Adapters:    mem0ai, zep-cloud, letta-client,
               langgraph, crewai (all optional)
Enter fullscreen mode Exit fullscreen mode

Everything beyond the core is optional. Install only what you need:

pip install memoryeval              # core only
pip install memoryeval[mem0]        # + Mem0 adapter
pip install memoryeval[langgraph]   # + LangGraph adapter
pip install memoryeval[crewai]      # + CrewAI adapter
pip install memoryeval[all]         # everything
Enter fullscreen mode Exit fullscreen mode

If you are building AI agents with memory, try it:

pip install memoryeval
memeval run --adapter in_memory
memeval diagnose --adapter in_memory --failures-only
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/Anupam1612/memeval

Feedback, issues, and contributions welcome.

Top comments (1)

Collapse
 
xulingfeng profile image
xulingfeng

Great timing on this — we ran into the exact same gap building MemBridge, our Hermes Agent memory system. The contradiction retention failure mode you listed hit home: when we benchmarked Mem0 vs Zep vs Letta earlier this year, all three handled stale data differently and none had built-in dedup for contradictory facts.

One question that keeps bugging me: how does memeval handle the 'ground truth' problem for multi-turn recall tests? After 5 turns of conversation, how do you define what the "correct" memory state should be? We built a custom scenario DSL with explicit expected state per turn, but it's fragile and a pain to maintain.

Starred the repo, will try running it against our setup this week.