DEV Community

Anupam Gevariya
Anupam Gevariya

Posted on

The Missing Test Suite for AI Agent Memory

The Missing Test Suite for AI Agent Memory

memeval

There's a strange gap in the AI agent stack. Prompts have LangSmith. RAG pipelines have Ragas. APIs have Postman. But memory, the thing that makes an agent remember who the user is, what they said, and what they want, has no testing tools at all.

This means most teams find out about memory failures from their users. A customer says "I already told you my name." A support ticket gets reopened because the agent asked for the account ID that was provided three messages ago. An agent recommends steak to someone who said they're vegan.

These are testable problems. They just haven't been tested because the tooling didn't exist.

I built memeval to fill this gap. It's an open-source framework that runs standardized test scenarios against any memory backend and tells you what passes, what fails, and why.

This post covers the architecture, the key design decisions, and what came out of benchmarking real providers.

  +------------------+
  | YAML Scenarios   |   30 built-in test cases
  | (multi-turn,     |   (or write your own)
  |  privacy, recall) |
  +--------+---------+
           |
           v
  +------------------+
  | Evaluation       |   Runs scenarios against
  | Harness          |   any memory backend
  +--------+---------+
           |
     +-----+------+------+------+------+
     |     |      |      |      |      |
     v     v      v      v      v      v
   Mem0  Zep   Letta  Lang-  Crew  Custom
                       Graph   AI
     |     |      |      |      |      |
     +-----+------+------+------+------+
           |
           v
  +------------------+
  | 7 Metrics        |   recall, relevance,
  | + Visualizer     |   consistency, latency,
  |                  |   privacy, forgetting,
  |                  |   update propagation
  +------------------+
           |
           v
  +------------------+
  | Scorecard +      |   Console, JSON,
  | CI Reports       |   GitHub Actions
  +------------------+
Enter fullscreen mode Exit fullscreen mode

The Problem

Consider a real scenario. A customer tells your support agent:

Turn 1: "I was charged $99 but my plan is Basic at $29"
Turn 3: "My account email is frank@email.com"
Turn 5: "Please refund the difference"
Enter fullscreen mode Exit fullscreen mode

Three turns later, the agent should still know all three facts. But does it? With most memory systems, you have no way to verify this without manually testing in production.

Here are the failure modes that matter:

CONTRADICTION RETENTION
  Stored: "User earns $80,000"
  Stored: "User earns $120,000"
  Both exist. Which one is true?

STALE DATA
  Stored: "CEO is Richard Lawson"
  Updated: "CEO is Diana Park"
  Search returns: "Richard Lawson"  <-- old value still appears

CONTEXT LOSS
  Turn 1: "My budget is $25,000"
  Turn 10: Agent has no idea about the budget

CROSS-USER LEAKAGE
  User A shares: "My API key is sk-abc123"
  User B searches: finds User A's API key
Enter fullscreen mode Exit fullscreen mode

Architecture: The Standard Memory Protocol

The first decision: how do you test something that works differently across every provider?

Mem0 stores flat facts with vector embeddings. Zep builds a temporal knowledge graph from conversation threads. Letta uses an agent that autonomously manages its own core + archival memory. LangGraph has a namespace-based key-value store. CrewAI has a unified Memory class with semantic recall.

We needed one interface that works across all of them.

                STANDARD MEMORY PROTOCOL (SMP)
  ================================================

  7 Core Operations:
    write(content, key, metadata)     -- store a memory
    read(key)                         -- retrieve by key
    search(query, filters)            -- semantic search
    update(key, content)              -- modify existing
    delete(key)                       -- remove
    list_all(filters)                 -- enumerate (for audits)
    consolidate(keys, strategy)       -- merge memories

  3 Session Operations:
    create_session(session_id)        -- start a conversation
    add_message(session_id, message)  -- add a turn
    get_session_context(session_id)   -- what does the system know?

  ================================================

Enter fullscreen mode Exit fullscreen mode

Each provider implements this via an adapter:

  +-------------+    +-------------+    +-------------+
  |   Mem0      |    |    Zep      |    |   Letta     |
  |  Adapter    |    |   Adapter   |    |  Adapter    |
  |             |    |             |    |             |
  | run_id =    |    | thread =    |    | agent =     |
  | session     |    | session     |    | session     |
  +------+------+    +------+------+    +------+------+
         |                  |                  |
         +------------------+------------------+
                            |
                   Standard Memory Protocol
                            |
                  +--------------------+
                  | Evaluation Harness |
                  | Scenarios + Metrics|
                  +--------------------+
Enter fullscreen mode Exit fullscreen mode

Why this matters: The evaluation harness never talks to Mem0, Zep, or LangGraph directly. It only talks to the protocol. This means every scenario and every metric works across every provider without modification.

The session decision: The first version had no session concept. Just write and search. But testing against real providers revealed this was wrong. Mem0 uses run_id to scope conversations. Zep uses threads. Letta agents maintain state across sequential messages. Without session support, the framework was testing "can the backend store facts" instead of "can it maintain conversation context", which is what users actually care about.


Testing with YAML Scenarios

Tests are defined in YAML, not code. This was deliberate. Non-engineers (product managers, QA) should be able to write memory tests.

A simple scenario:

name: "User Preference Update"
dimensions_tested: [recall_accuracy, consistency, update_propagation]

setup:
  - write:
      key: "diet"
      content: "User is vegetarian"

steps:
  - write:
      key: "diet_v2"
      content: "User switched to vegan diet"

  - assert_search:
      query: "What are the user's dietary preferences?"
      expected_contains: ["vegan"]
      expected_not_contains: ["vegetarian"]

thresholds:
  recall_accuracy: 0.9
  consistency: 1.0
Enter fullscreen mode Exit fullscreen mode

A session-aware scenario:

name: "Customer Support Multi-Turn"
steps:
  - create_session:
      session_id: "ticket_789"

  - add_message:
      session_id: "ticket_789"
      role: "user"
      content: "I was charged $99 but my plan is Basic at $29"

  - add_message:
      session_id: "ticket_789"
      role: "user"
      content: "My account email is frank@email.com"

  - add_message:
      session_id: "ticket_789"
      role: "user"
      content: "Please refund the difference"

  - assert_context:
      session_id: "ticket_789"
      query: "What is the billing issue?"
      expected_contains: ["99"]
Enter fullscreen mode Exit fullscreen mode

The scenario runner executes each step against the adapter, collects results, and passes them to the metric evaluators.

  YAML Scenario File
         |
         v
  +----------------+
  | Scenario Loader|  parses YAML into Scenario objects
  +-------+--------+
          |
          v
  +----------------+
  | Scenario Runner|  executes steps against adapter
  |                |  collects StepResults
  +-------+--------+
          |
          v
  +----------------+
  | Metric Engines |  evaluates dimensions
  |                |  recall, consistency, latency, etc.
  +-------+--------+
          |
          v
  +----------------+
  | ScenarioResult |  passed/failed, scores, details
  +----------------+
Enter fullscreen mode Exit fullscreen mode

We ship 30 built-in scenarios organized by category:

Category Count What they test
Session (multi-turn) 6 Conversation recall, correction, 10-turn depth, isolation
Core (fact storage) 7 Basic recall, adversarial, multi-hop, entity resolution
Lifecycle (evolution) 6 Preference update, contradictions, GDPR deletion
Governance (boundaries) 3 Privacy isolation, multi-user separation
Operations (management) 6 Cascading deletion, consolidation, support handoff
Edge cases 2 UTF-8 characters, boundary conditions

The 7 Metrics

1. Recall Accuracy

Can the system retrieve what was stored?

Store 5 facts, search for each one, measure the hit rate. Two modes available: substring matching for speed, and semantic similarity for accuracy.

Semantic recall formula:
  For each expected fact, find max cosine similarity in retrieved results.
  Count as "recalled" if max_sim >= 0.85.
  recall = recalled_count / expected_count
Enter fullscreen mode Exit fullscreen mode

2. Relevance (MRR + NDCG)

Does it return the right memories first?

A system that retrieves the correct fact at position 10 is worse than one that retrieves it at position 1. This is measured using Mean Reciprocal Rank and Normalized Discounted Cumulative Gain.

3. Consistency (Contradiction Detection)

We use embedding-based detection. Group memories by topic using cosine similarity, then check for divergent values within each group.

  Step 1: Embed all memories
  Step 2: For each pair, compute cosine similarity
  Step 3: If similarity > 0.55, they're about the same topic
  Step 4: For same-topic pairs, check 4 signals:
          - Negation asymmetry ("likes" vs "does not like")
          - Numeric divergence ($80K vs $120K)
          - Value divergence via embeddings ("NYC" vs "London")
          - Structural substitution ("CEO is X" vs "CEO is Y")
Enter fullscreen mode Exit fullscreen mode

What it catches:

Pair Detected? Signal
"earns $80K" vs "earns $120K" Yes Numeric divergence
"CEO is Richard" vs "CEO is Diana" Yes Structural substitution
"lives in NYC" vs "lives in London" Yes Value divergence
"likes spicy" vs "does not like spicy" Yes Negation asymmetry
"likes hiking" vs "works as engineer" No (correct) Different topics
"vegetarian" vs "vegan" No (correct) Evolution, not contradiction

4. Update Propagation

Store fact A, then correction A'. Query for A. It should return A', not A. The metric also checks derived facts that depended on A.

5. Forgetting Quality

Delete specific items, then verify: deleted items are gone, retained items survive. The score is the harmonic mean of forgetting precision and retention rate.

6. Latency and Cost

We track p50/p95/p99 separately for reads and writes. Writes get a 5x more lenient target because API-based providers (like Mem0 with OpenAI) need LLM calls on every write.

7. Privacy Isolation

Plant sentinel values for User A, search from User B's context. Any leakage = failure. This is a binary metric. Any leak at all means the system fails.


The Failure Visualizer

This is what makes memeval different from a benchmark. When a scenario fails, you need to know why.

memeval diagnose --adapter in_memory --failures-only
Enter fullscreen mode Exit fullscreen mode

Output:

Stale Data Supersession -- FAILED
Timeline
  Setup
    WRITE ceo_old      -- "CEO is Richard Lawson"
  Steps
    WRITE ceo_new      -- "CEO is Diana Park"
    SEARCH FAILED "Who is the CEO?" -> 4 results
      expected "Diana Park" -- NOT FOUND
      Retrieved:
        The company CEO is Richard Lawson (score: 0.50)
        Product pricing: Basic plan is $10/month (score: 0.25)

  Metric: update_propagation  0.667 < 0.700  FAIL
  Metric: recall_accuracy     0.667 < 0.700  FAIL
Enter fullscreen mode Exit fullscreen mode

You can immediately see: the search for "Who is the CEO?" returned the old value ("Richard Lawson" at score 0.50) instead of the new one ("Diana Park"). The system stored both but retrieves the wrong one.

This is not a number on a dashboard. This is a specific, actionable failure that a developer can debug.


Benchmarking Real Providers

We ran memeval against Mem0 (self-hosted with gpt-4o-mini), Zep Cloud, Letta Cloud, and LangGraph's InMemoryStore.

              InMemory  Mem0    LangGraph
  recall      0.879     1.000   1.000
  relevance   0.727     0.904   0.657
  consistency 0.838     0.917   0.838
  update_prop 0.708     1.000   1.000
  forgetting  1.000     1.000   1.000
  latency     1.000     0.840   1.000
  privacy     1.000     1.000   1.000
Enter fullscreen mode Exit fullscreen mode

Key findings:

Mem0's LLM extraction genuinely improves recall. It doesn't just store raw text. It extracts facts, which makes semantic search significantly better. But it comes at a cost: write p95 = 3,500ms because every write calls OpenAI.

Mem0 stores contradictions side by side. "User is vegetarian" and "User is vegan" both exist in the store. There is no automatic resolution. Our consistency metric caught this.

Zep's graph processing is async. Write a fact, immediately search for it, and it is not found. The knowledge graph needs time to process. This is an architectural tradeoff, not a bug, but it affects real-time agents.

LangGraph has perfect recall and update propagation but weaker relevance ranking. It returns more results but doesn't rank them as precisely as Mem0's vector search.

These findings aren't possible without standardized testing across providers. Each provider's own benchmarks test different things in different ways. memeval makes them comparable.


LongMemEval Integration

For credibility beyond custom scenarios, memeval integrates the LongMemEval benchmark (Wu et al., ICLR 2025), which contains 500 QA pairs derived from multi-session conversations.

memeval longmemeval --adapter mem0 --scoring embedding --limit 50
Enter fullscreen mode Exit fullscreen mode

The key difference from the paper: memeval tests retrieval only, not end-to-end QA. The paper asks "can the system answer correctly?" memeval asks "did the memory surface the right facts?" This isolates memory quality from LLM generation quality.

Reference baselines from the paper: GPT-4o scores 60.6%, ChatGPT with memory scores 57.7%.


Technical Stack

  Python package: memoryeval (PyPI)
  Import name:    memeval

  Core:        pydantic, pyyaml, click, rich, numpy
  Embeddings:  sentence-transformers (optional)
  NLI:         transformers + torch (optional)
  LLM Judge:   anthropic or openai SDK (optional)
  Benchmark:   huggingface_hub (optional)

  Adapters:    mem0ai, zep-cloud, letta-client,
               langgraph, crewai (all optional)
Enter fullscreen mode Exit fullscreen mode

Everything beyond the core is optional. Install only what you need:

pip install memoryeval              # core only
pip install memoryeval[mem0]        # + Mem0 adapter
pip install memoryeval[langgraph]   # + LangGraph adapter
pip install memoryeval[crewai]      # + CrewAI adapter
pip install memoryeval[all]         # everything
Enter fullscreen mode Exit fullscreen mode

If you are building AI agents with memory, try it:

pip install memoryeval
memeval run --adapter in_memory
memeval diagnose --adapter in_memory --failures-only
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/Anupam1612/memeval

Feedback, issues, and contributions welcome.

Top comments (6)

Collapse
 
xulingfeng profile image
xulingfeng

Great timing on this — we ran into the exact same gap building MemBridge, our Hermes Agent memory system. The contradiction retention failure mode you listed hit home: when we benchmarked Mem0 vs Zep vs Letta earlier this year, all three handled stale data differently and none had built-in dedup for contradictory facts.

One question that keeps bugging me: how does memeval handle the 'ground truth' problem for multi-turn recall tests? After 5 turns of conversation, how do you define what the "correct" memory state should be? We built a custom scenario DSL with explicit expected state per turn, but it's fragile and a pain to maintain.

Starred the repo, will try running it against our setup this week.

Collapse
 
anupam_gevariya_66b03d3ad profile image
Anupam Gevariya

Thanks for checking it out!

On the ground-truth problem, this is exactly why I chose YAML-based scenarios with explicit assertions rather than trying to infer what the "correct" state should be.

For example, a multi-turn recall test might look like:

- add_message: "My budget is $25,000"
- add_message: "The deadline is Friday"
- add_message: ... # 8 more turns of conversation
- assert_context:
    query: "What is the budget?"
    expected_contains: ["25,000"]
Enter fullscreen mode Exit fullscreen mode

The ground truth is defined by the scenario author, not inferred by the framework.

The tradeoff is the same one you mentioned: someone has to define and maintain the expected state. However, YAML tends to be much easier to maintain than a custom DSL, and the 30 built-in scenarios cover the most common memory failure modes, so most teams won't need to write many custom tests.

For contradiction detection specifically, memoryeval doesn't require ground truth. The consistency metric uses embedding similarity to identify facts that appear to be about the same topic but contain conflicting values (for example, "$80K salary" vs "$120K salary"). It's essentially a pairwise consistency check across the stored memory.

I'd be curious to hear what you find when running it against your Hermes Agent setup.

If you run into edge cases that aren't covered by the built-in scenarios, feel free to open an issue or contribute a new scenario.

Collapse
 
xulingfeng profile image
xulingfeng

Appreciate you taking the time to write that out. The 30 built-in scenarios covering common failure modes makes sense — I think most teams will get 80% of the value from those without touching custom tests.

One thing I keep circling back to though: how do you handle the case where two scenarios produce conflicting expected states? We ran into this when a conversation context said "budget is flexible" but a later assert expected a hard number. Ended up adding priority levels to our scenario definitions, but it felt hacky.

Curious if you hit anything similar.

Thread Thread
 
anupam_gevariya_66b03d3ad profile image
Anupam Gevariya • Edited

Good question.

Each scenario in memoryeval runs in complete isolation. The adapter is reset between scenarios, so no state can leak from one test to another. Two scenarios can never conflict because they never share the same memory store.

Within a single scenario, assertions are evaluated against the memory state at that specific point in the conversation.

For example:

  • Turn 2: "My budget is flexible"
  • Turn 5: "My budget is $25,000"

The expected result depends on where the assertion is placed:

  • Assert after Turn 2 → expect "flexible"
  • Assert after Turn 5 → expect "$25,000"

The scenario author defines the ground truth at each checkpoint. memoryeval doesn't impose an automatic priority or conflict-resolution system. it simply verifies that the memory system returns the expected state at the specified moment in the conversation.

That said, your priority-based approach is interesting. My intuition is that if a scenario becomes complex enough to require priority levels, it's often a sign that the test should be split into smaller, more focused scenarios. In practice, one scenario per failure mode tends to be easier to understand, maintain, and debug.

Thread Thread
 
Sloan, the sloth mascot
Comment deleted
Collapse
 
xulingfeng profile image
xulingfeng

The scenario isolation approach makes sense. We were overcomplicating it with priority levels just to avoid splitting scenarios — but you're right, one failure mode per scenario, kept focused, is easier to maintain in the long run. Going to try restructuring MemBridge'''s test suite this way. Appreciate the detailed breakdown!