<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anupam Gevariya</title>
    <description>The latest articles on DEV Community by Anupam Gevariya (@anupam_gevariya_66b03d3ad).</description>
    <link>https://dev.to/anupam_gevariya_66b03d3ad</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3962166%2F96335836-6bf8-4b25-8a85-cccf026a4f6b.png</url>
      <title>DEV Community: Anupam Gevariya</title>
      <link>https://dev.to/anupam_gevariya_66b03d3ad</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anupam_gevariya_66b03d3ad"/>
    <language>en</language>
    <item>
      <title>The Missing Test Suite for AI Agent Memory</title>
      <dc:creator>Anupam Gevariya</dc:creator>
      <pubDate>Mon, 01 Jun 2026 07:41:55 +0000</pubDate>
      <link>https://dev.to/anupam_gevariya_66b03d3ad/how-we-built-memeval-a-testing-framework-for-ai-agent-memory-1e2o</link>
      <guid>https://dev.to/anupam_gevariya_66b03d3ad/how-we-built-memeval-a-testing-framework-for-ai-agent-memory-1e2o</guid>
      <description>&lt;h1&gt;
  
  
  The Missing Test Suite for AI Agent Memory
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6qfu45jbualyjsjk3kv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6qfu45jbualyjsjk3kv.png" alt="memeval"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's a strange gap in the AI agent stack. Prompts have LangSmith. RAG pipelines have Ragas. APIs have Postman. But memory, the thing that makes an agent remember who the user is, what they said, and what they want, has no testing tools at all.&lt;/p&gt;

&lt;p&gt;This means most teams find out about memory failures from their users. A customer says "I already told you my name." A support ticket gets reopened because the agent asked for the account ID that was provided three messages ago. An agent recommends steak to someone who said they're vegan.&lt;/p&gt;

&lt;p&gt;These are testable problems. They just haven't been tested because the tooling didn't exist.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/Anupam1612/memeval" rel="noopener noreferrer"&gt;memeval&lt;/a&gt; to fill this gap. It's an open-source framework that runs standardized test scenarios against any memory backend and tells you what passes, what fails, and why.&lt;/p&gt;

&lt;p&gt;This post covers the architecture, the key design decisions, and what came out of benchmarking real providers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  +------------------+
  | YAML Scenarios   |   30 built-in test cases
  | (multi-turn,     |   (or write your own)
  |  privacy, recall) |
  +--------+---------+
           |
           v
  +------------------+
  | Evaluation       |   Runs scenarios against
  | Harness          |   any memory backend
  +--------+---------+
           |
     +-----+------+------+------+------+
     |     |      |      |      |      |
     v     v      v      v      v      v
   Mem0  Zep   Letta  Lang-  Crew  Custom
                       Graph   AI
     |     |      |      |      |      |
     +-----+------+------+------+------+
           |
           v
  +------------------+
  | 7 Metrics        |   recall, relevance,
  | + Visualizer     |   consistency, latency,
  |                  |   privacy, forgetting,
  |                  |   update propagation
  +------------------+
           |
           v
  +------------------+
  | Scorecard +      |   Console, JSON,
  | CI Reports       |   GitHub Actions
  +------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Consider a real scenario. A customer tells your support agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn 1: "I was charged $99 but my plan is Basic at $29"
Turn 3: "My account email is frank@email.com"
Turn 5: "Please refund the difference"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three turns later, the agent should still know all three facts. But does it? With most memory systems, you have no way to verify this without manually testing in production.&lt;/p&gt;

&lt;p&gt;Here are the failure modes that matter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CONTRADICTION RETENTION
  Stored: "User earns $80,000"
  Stored: "User earns $120,000"
  Both exist. Which one is true?

STALE DATA
  Stored: "CEO is Richard Lawson"
  Updated: "CEO is Diana Park"
  Search returns: "Richard Lawson"  &amp;lt;-- old value still appears

CONTEXT LOSS
  Turn 1: "My budget is $25,000"
  Turn 10: Agent has no idea about the budget

CROSS-USER LEAKAGE
  User A shares: "My API key is sk-abc123"
  User B searches: finds User A's API key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Architecture: The Standard Memory Protocol
&lt;/h2&gt;

&lt;p&gt;The first decision: how do you test something that works differently across every provider?&lt;/p&gt;

&lt;p&gt;Mem0 stores flat facts with vector embeddings. Zep builds a temporal knowledge graph from conversation threads. Letta uses an agent that autonomously manages its own core + archival memory. LangGraph has a namespace-based key-value store. CrewAI has a unified Memory class with semantic recall.&lt;/p&gt;

&lt;p&gt;We needed one interface that works across all of them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                STANDARD MEMORY PROTOCOL (SMP)
  ================================================

  7 Core Operations:
    write(content, key, metadata)     -- store a memory
    read(key)                         -- retrieve by key
    search(query, filters)            -- semantic search
    update(key, content)              -- modify existing
    delete(key)                       -- remove
    list_all(filters)                 -- enumerate (for audits)
    consolidate(keys, strategy)       -- merge memories

  3 Session Operations:
    create_session(session_id)        -- start a conversation
    add_message(session_id, message)  -- add a turn
    get_session_context(session_id)   -- what does the system know?

  ================================================

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each provider implements this via an adapter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  +-------------+    +-------------+    +-------------+
  |   Mem0      |    |    Zep      |    |   Letta     |
  |  Adapter    |    |   Adapter   |    |  Adapter    |
  |             |    |             |    |             |
  | run_id =    |    | thread =    |    | agent =     |
  | session     |    | session     |    | session     |
  +------+------+    +------+------+    +------+------+
         |                  |                  |
         +------------------+------------------+
                            |
                   Standard Memory Protocol
                            |
                  +--------------------+
                  | Evaluation Harness |
                  | Scenarios + Metrics|
                  +--------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; The evaluation harness never talks to Mem0, Zep, or LangGraph directly. It only talks to the protocol. This means every scenario and every metric works across every provider without modification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The session decision:&lt;/strong&gt; The first version had no session concept. Just write and search. But testing against real providers revealed this was wrong. Mem0 uses &lt;code&gt;run_id&lt;/code&gt; to scope conversations. Zep uses threads. Letta agents maintain state across sequential messages. Without session support, the framework was testing "can the backend store facts" instead of "can it maintain conversation context", which is what users actually care about.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing with YAML Scenarios
&lt;/h2&gt;

&lt;p&gt;Tests are defined in YAML, not code. This was deliberate. Non-engineers (product managers, QA) should be able to write memory tests.&lt;/p&gt;

&lt;p&gt;A simple scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Preference&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Update"&lt;/span&gt;
&lt;span class="na"&gt;dimensions_tested&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;recall_accuracy&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;consistency&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;update_propagation&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;setup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diet"&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;vegetarian"&lt;/span&gt;

&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;write&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diet_v2"&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;switched&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;vegan&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;diet"&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;assert_search&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;user's&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dietary&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;preferences?"&lt;/span&gt;
      &lt;span class="na"&gt;expected_contains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vegan"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;expected_not_contains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vegetarian"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;thresholds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;recall_accuracy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.9&lt;/span&gt;
  &lt;span class="na"&gt;consistency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A session-aware scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Support&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Multi-Turn"&lt;/span&gt;
&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;create_session&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;session_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket_789"&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;add_message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;session_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket_789"&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user"&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;was&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;charged&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;but&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;my&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Basic&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$29"&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;add_message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;session_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket_789"&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user"&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;account&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;frank@email.com"&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;add_message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;session_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket_789"&lt;/span&gt;
      &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user"&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;difference"&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;assert_context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;session_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticket_789"&lt;/span&gt;
      &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;issue?"&lt;/span&gt;
      &lt;span class="na"&gt;expected_contains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;99"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The scenario runner executes each step against the adapter, collects results, and passes them to the metric evaluators.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  YAML Scenario File
         |
         v
  +----------------+
  | Scenario Loader|  parses YAML into Scenario objects
  +-------+--------+
          |
          v
  +----------------+
  | Scenario Runner|  executes steps against adapter
  |                |  collects StepResults
  +-------+--------+
          |
          v
  +----------------+
  | Metric Engines |  evaluates dimensions
  |                |  recall, consistency, latency, etc.
  +-------+--------+
          |
          v
  +----------------+
  | ScenarioResult |  passed/failed, scores, details
  +----------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We ship 30 built-in scenarios organized by category:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;What they test&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Session (multi-turn)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Conversation recall, correction, 10-turn depth, isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Core (fact storage)&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Basic recall, adversarial, multi-hop, entity resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lifecycle (evolution)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Preference update, contradictions, GDPR deletion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance (boundaries)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Privacy isolation, multi-user separation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations (management)&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Cascading deletion, consolidation, support handoff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge cases&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;UTF-8 characters, boundary conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The 7 Metrics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Recall Accuracy
&lt;/h3&gt;

&lt;p&gt;Can the system retrieve what was stored?&lt;/p&gt;

&lt;p&gt;Store 5 facts, search for each one, measure the hit rate. Two modes available: substring matching for speed, and semantic similarity for accuracy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Semantic recall formula:
  For each expected fact, find max cosine similarity in retrieved results.
  Count as "recalled" if max_sim &amp;gt;= 0.85.
  recall = recalled_count / expected_count
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Relevance (MRR + NDCG)
&lt;/h3&gt;

&lt;p&gt;Does it return the right memories first?&lt;/p&gt;

&lt;p&gt;A system that retrieves the correct fact at position 10 is worse than one that retrieves it at position 1. This is measured using Mean Reciprocal Rank and Normalized Discounted Cumulative Gain.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Consistency (Contradiction Detection)
&lt;/h3&gt;

&lt;p&gt;We use embedding-based detection. Group memories by topic using cosine similarity, then check for divergent values within each group.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Step 1: Embed all memories
  Step 2: For each pair, compute cosine similarity
  Step 3: If similarity &amp;gt; 0.55, they're about the same topic
  Step 4: For same-topic pairs, check 4 signals:
          - Negation asymmetry ("likes" vs "does not like")
          - Numeric divergence ($80K vs $120K)
          - Value divergence via embeddings ("NYC" vs "London")
          - Structural substitution ("CEO is X" vs "CEO is Y")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What it catches:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pair&lt;/th&gt;
&lt;th&gt;Detected?&lt;/th&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"earns $80K" vs "earns $120K"&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Numeric divergence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"CEO is Richard" vs "CEO is Diana"&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Structural substitution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"lives in NYC" vs "lives in London"&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Value divergence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"likes spicy" vs "does not like spicy"&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Negation asymmetry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"likes hiking" vs "works as engineer"&lt;/td&gt;
&lt;td&gt;No (correct)&lt;/td&gt;
&lt;td&gt;Different topics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"vegetarian" vs "vegan"&lt;/td&gt;
&lt;td&gt;No (correct)&lt;/td&gt;
&lt;td&gt;Evolution, not contradiction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  4. Update Propagation
&lt;/h3&gt;

&lt;p&gt;Store fact A, then correction A'. Query for A. It should return A', not A. The metric also checks derived facts that depended on A.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Forgetting Quality
&lt;/h3&gt;

&lt;p&gt;Delete specific items, then verify: deleted items are gone, retained items survive. The score is the harmonic mean of forgetting precision and retention rate.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Latency and Cost
&lt;/h3&gt;

&lt;p&gt;We track p50/p95/p99 separately for reads and writes. Writes get a 5x more lenient target because API-based providers (like Mem0 with OpenAI) need LLM calls on every write.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Privacy Isolation
&lt;/h3&gt;

&lt;p&gt;Plant sentinel values for User A, search from User B's context. Any leakage = failure. This is a binary metric. Any leak at all means the system fails.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Failure Visualizer
&lt;/h2&gt;

&lt;p&gt;This is what makes memeval different from a benchmark. When a scenario fails, you need to know why.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;memeval diagnose &lt;span class="nt"&gt;--adapter&lt;/span&gt; in_memory &lt;span class="nt"&gt;--failures-only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stale Data Supersession -- FAILED
Timeline
  Setup
    WRITE ceo_old      -- "CEO is Richard Lawson"
  Steps
    WRITE ceo_new      -- "CEO is Diana Park"
    SEARCH FAILED "Who is the CEO?" -&amp;gt; 4 results
      expected "Diana Park" -- NOT FOUND
      Retrieved:
        The company CEO is Richard Lawson (score: 0.50)
        Product pricing: Basic plan is $10/month (score: 0.25)

  Metric: update_propagation  0.667 &amp;lt; 0.700  FAIL
  Metric: recall_accuracy     0.667 &amp;lt; 0.700  FAIL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can immediately see: the search for "Who is the CEO?" returned the old value ("Richard Lawson" at score 0.50) instead of the new one ("Diana Park"). The system stored both but retrieves the wrong one.&lt;/p&gt;

&lt;p&gt;This is not a number on a dashboard. This is a specific, actionable failure that a developer can debug.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarking Real Providers
&lt;/h2&gt;

&lt;p&gt;We ran memeval against Mem0 (self-hosted with gpt-4o-mini), Zep Cloud, Letta Cloud, and LangGraph's InMemoryStore.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              InMemory  Mem0    LangGraph
  recall      0.879     1.000   1.000
  relevance   0.727     0.904   0.657
  consistency 0.838     0.917   0.838
  update_prop 0.708     1.000   1.000
  forgetting  1.000     1.000   1.000
  latency     1.000     0.840   1.000
  privacy     1.000     1.000   1.000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key findings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mem0's LLM extraction genuinely improves recall.&lt;/strong&gt; It doesn't just store raw text. It extracts facts, which makes semantic search significantly better. But it comes at a cost: write p95 = 3,500ms because every write calls OpenAI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mem0 stores contradictions side by side.&lt;/strong&gt; "User is vegetarian" and "User is vegan" both exist in the store. There is no automatic resolution. Our consistency metric caught this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zep's graph processing is async.&lt;/strong&gt; Write a fact, immediately search for it, and it is not found. The knowledge graph needs time to process. This is an architectural tradeoff, not a bug, but it affects real-time agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangGraph has perfect recall and update propagation&lt;/strong&gt; but weaker relevance ranking. It returns more results but doesn't rank them as precisely as Mem0's vector search.&lt;/p&gt;

&lt;p&gt;These findings aren't possible without standardized testing across providers. Each provider's own benchmarks test different things in different ways. memeval makes them comparable.&lt;/p&gt;




&lt;h2&gt;
  
  
  LongMemEval Integration
&lt;/h2&gt;

&lt;p&gt;For credibility beyond custom scenarios, memeval integrates the LongMemEval benchmark (Wu et al., ICLR 2025), which contains 500 QA pairs derived from multi-session conversations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;memeval longmemeval &lt;span class="nt"&gt;--adapter&lt;/span&gt; mem0 &lt;span class="nt"&gt;--scoring&lt;/span&gt; embedding &lt;span class="nt"&gt;--limit&lt;/span&gt; 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key difference from the paper: memeval tests retrieval only, not end-to-end QA. The paper asks "can the system answer correctly?" memeval asks "did the memory surface the right facts?" This isolates memory quality from LLM generation quality.&lt;/p&gt;

&lt;p&gt;Reference baselines from the paper: GPT-4o scores 60.6%, ChatGPT with memory scores 57.7%.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technical Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Python package: memoryeval (PyPI)
  Import name:    memeval

  Core:        pydantic, pyyaml, click, rich, numpy
  Embeddings:  sentence-transformers (optional)
  NLI:         transformers + torch (optional)
  LLM Judge:   anthropic or openai SDK (optional)
  Benchmark:   huggingface_hub (optional)

  Adapters:    mem0ai, zep-cloud, letta-client,
               langgraph, crewai (all optional)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything beyond the core is optional. Install only what you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;memoryeval              &lt;span class="c"&gt;# core only&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;memoryeval[mem0]        &lt;span class="c"&gt;# + Mem0 adapter&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;memoryeval[langgraph]   &lt;span class="c"&gt;# + LangGraph adapter&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;memoryeval[crewai]      &lt;span class="c"&gt;# + CrewAI adapter&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;memoryeval[all]         &lt;span class="c"&gt;# everything&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;If you are building AI agents with memory, try it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;memoryeval
memeval run &lt;span class="nt"&gt;--adapter&lt;/span&gt; in_memory
memeval diagnose &lt;span class="nt"&gt;--adapter&lt;/span&gt; in_memory &lt;span class="nt"&gt;--failures-only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/Anupam1612/memeval" rel="noopener noreferrer"&gt;https://github.com/Anupam1612/memeval&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback, issues, and contributions welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
