<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shane Farkas</title>
    <description>The latest articles on DEV Community by Shane Farkas (@shane-farkas).</description>
    <link>https://dev.to/shane-farkas</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3870105%2F34df5e35-256a-4ca0-b8e8-095342bca27b.jpg</url>
      <title>DEV Community: Shane Farkas</title>
      <link>https://dev.to/shane-farkas</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shane-farkas"/>
    <language>en</language>
    <item>
      <title>I wanted to build an Agent Memory System and blundered my way into 92% on LongMemEval</title>
      <dc:creator>Shane Farkas</dc:creator>
      <pubDate>Sat, 11 Apr 2026 11:11:16 +0000</pubDate>
      <link>https://dev.to/shane-farkas/i-wanted-to-build-an-agent-memory-system-and-blundered-my-way-into-92-on-longmemeval-3b27</link>
      <guid>https://dev.to/shane-farkas/i-wanted-to-build-an-agent-memory-system-and-blundered-my-way-into-92-on-longmemeval-3b27</guid>
      <description>&lt;p&gt;Like most users of AI agents like Claude Code, I have been frustrated by the agent memory problem. The models have gotten extremely good and no longer lose focus in one long conversation like they used to, but across sessions the memory is pretty spotty whether it's a conversation with an LLM where it recalls imperfect or irrelevant data from previous chats, or a new Claude Code session where I feel like it’s Groundhog Day onboarding a brand new employee who’s smart and talented but knows nothing about my world. So I started looking into the various memory systems. I tried a folder of markdown files, Obsidian vaults etc. but every AI memory system I tried had the same problem: dump text into a vector store, retrieve by cosine similarity, hope for the best. &lt;/p&gt;

&lt;p&gt;It works fine for "what did we talk about last week?" but falls apart the moment you need real reasoning like when facts contradict each other, when the answer requires connecting information from three different conversations, or when the user asks "what changed since January?"&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://github.com/shane-farkas/memento" rel="noopener noreferrer"&gt;&lt;strong&gt;Memento&lt;/strong&gt;&lt;/a&gt;, a bitemporal knowledge graph memory system for AI agents. Then I put it through the best long-term memory benchmark I could find, &lt;a href="https://arxiv.org/abs/2410.10813" rel="noopener noreferrer"&gt;LongMemEval&lt;/a&gt;. This is the story of what worked, what didn't, and what I learned along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Vector Store Memory
&lt;/h2&gt;

&lt;p&gt;Most AI memory looks like: user says something, you embed it into a vector, store it, and later embed a query to find the nearest neighbors and put the results into a prompt. This is basically document search, and has no concept of entities eg when "John" and "John Smith" are two different chunks. It has no awareness of time, so a fact from January looks the same as one from yesterday. And it has no ability to detect when new information contradicts old information.&lt;/p&gt;

&lt;p&gt;I wanted memory that could track entities, time, and contradictions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Memento Does Differently
&lt;/h2&gt;

&lt;p&gt;Memento builds a knowledge graph from conversations. When you ingest text, it extracts entities and their properties (people, organizations, projects), then resolves them against existing entities — "John," "John Smith," and "the sales VP" become one node using tiered matching: exact, fuzzy, phonetic, then embedding similarity, then LLM tiebreaker. It detects contradictions — if John's title was "VP of Sales" and now it's "SVP of Sales," that gets flagged. It tracks time bitemporally, recording both when a fact was true in the world and when the system learned it. It also stores verbatim text as fallback — the raw conversation is always there via FTS5 and vector search, so extraction errors don't mean lost information.&lt;/p&gt;

&lt;p&gt;When you recall information, it doesn't just search vectors. It traverses the graph: "What should I know before my meeting with John?" finds John's node, walks his relationships to Alpha Corp, finds Alpha Corp's pending acquisition, and composes a briefing — all within a token budget you specify.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   Agent / LLM
         │ (Query/Ingest)
         ▼
  Retrieval Engine  &amp;lt;───&amp;gt;  Ingestion Pipeline
         │                     │
         ▼                     ▼
    Temporal Knowledge Graph (SQLite)
         │
         ├── Consolidation Engine (decay, dedup, prune)
         ├── Verbatim Fallback (FTS5 + vector search)
         └── Privacy Layer (export, audit, hard delete)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Finding a Benchmark
&lt;/h2&gt;

&lt;p&gt;I wanted to evaluate on solid examples. I found &lt;strong&gt;LongMemEval&lt;/strong&gt;, a benchmark specifically designed for long-term conversational memory. It has 500 questions across five categories of increasing difficulty:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-session recall&lt;/strong&gt; — facts stated by the user or assistant in one conversation. &lt;strong&gt;Preference tracking&lt;/strong&gt; — applying user preferences revealed in past conversations. &lt;strong&gt;Multi-session reasoning&lt;/strong&gt; — synthesizing information scattered across multiple conversations. &lt;strong&gt;Knowledge updates&lt;/strong&gt; — returning the latest value when facts change over time. &lt;strong&gt;Temporal reasoning&lt;/strong&gt; — understanding when events happened and their chronological order.&lt;/p&gt;

&lt;p&gt;Each question comes with haystack sessions (the conversations containing the answer) and a reference answer. A GPT-4o judge compares the system's output against the reference. All runs used Claude Sonnet as the extraction and reasoning backbone, with GPT-4o as the eval judge (following the LongMemEval paper's methodology).Each question also has an abstention variant where the correct answer is "I don't know" which tests that the system doesn't hallucinate.&lt;/p&gt;

&lt;p&gt;I ran against the oracle variant: evidence-only sessions with no distractors, 1–6 sessions per question. A clean test of whether the system can extract and reason over information it definitely has access to.&lt;/p&gt;

&lt;h2&gt;
  
  
  First Contact — 91.0%
&lt;/h2&gt;

&lt;p&gt;I'd just open-sourced Memento and didn't have a benchmark harness yet. I wrote one, ran a 5-question smoke test (5/5, just checking the plumbing), then kicked off a full 500-question run.&lt;/p&gt;

&lt;p&gt;But before that run, I'd already spotted two gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The timestamp gap.&lt;/strong&gt; Memento had no way to know &lt;em&gt;when&lt;/em&gt; a conversation happened. For temporal reasoning questions like "which happened first, X or Y?" the system was guessing. Fix: pipe session dates into ingestion as timestamps and prepend &lt;code&gt;[Conversation date: ...]&lt;/code&gt; headers to the text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The verbatim search gap.&lt;/strong&gt; Memento was ingesting each session as one big text block. If a user asked about a specific phrase, FTS5 was searching across entire concatenated sessions instead of individual turns. Fix: store each individual turn separately in the verbatim store for fine-grained keyword search, while still ingesting full sessions for entity extraction.&lt;/p&gt;

&lt;p&gt;The first full run came back at 455/500 — &lt;strong&gt;91.0% overall, 92.4% task-averaged.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single-session (assistant)&lt;/td&gt;
&lt;td&gt;100.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single-session (user)&lt;/td&gt;
&lt;td&gt;94.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single-session (preference)&lt;/td&gt;
&lt;td&gt;93.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temporal reasoning&lt;/td&gt;
&lt;td&gt;91.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge update&lt;/td&gt;
&lt;td&gt;91.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-session&lt;/td&gt;
&lt;td&gt;85.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Single-session questions were almost perfect. Multi-session at 85.0% was the weakest link.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Retrieval Trap — 89.6%
&lt;/h2&gt;

&lt;p&gt;The natural instinct: multi-session is weak, so give it more context. I widened the retrieval window — &lt;code&gt;top_k&lt;/code&gt; from 10 to 20, conversation cap from 5 to 10, token budget from 4K to 8K. I also added some prompt improvements: don't ask clarifying questions for preference queries, enumerate before counting.&lt;/p&gt;

&lt;p&gt;The full 500-question run came back: &lt;strong&gt;89.6% overall, 91.0% task-averaged.&lt;/strong&gt; A regression. Wider retrieval hurt single-session accuracy by 4–6 percentage points through context dilution, while only helping multi-session by +0.7%. When you dump 8K tokens of loosely related context into a prompt, the LLM gets confused and starts second-guessing itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson: more retrieval is not better retrieval.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I reverted the retrieval widening but kept the prompt improvements, and ran again. &lt;strong&gt;91.2% overall, 92.4% task-averaged&lt;/strong&gt; — the best result yet. The prompt changes alone were pulling their weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Diminishing Returns — 86.0% → 90.8%
&lt;/h2&gt;

&lt;p&gt;Next I fixed five bugs (conflict references, idempotent decay, recall entity depth, soft-delete for relationships, confirmation counts) and implemented two new features: &lt;strong&gt;adaptive retrieval&lt;/strong&gt;, which classifies queries as "wide" (counting/enumeration) vs. "narrow" (single-fact recall) and adjusts parameters accordingly, and &lt;strong&gt;two-pass counting&lt;/strong&gt;, which enumerates items first, then counts.&lt;/p&gt;

&lt;p&gt;Quick validation on 25 questions after the bug fixes: 92.0%. A 50-question stratified sample with the new features: 92.0%. Looking good.&lt;/p&gt;

&lt;p&gt;Full 500-question run: &lt;strong&gt;86.0% overall, 88.8% task-averaged.&lt;/strong&gt; A disaster. Two-pass counting helped roughly 5 multi-session questions but hurt 10+ temporal and knowledge-update questions. I also tested self-verification (having the LLM double-check its own answer), which dropped accuracy to 68%. Each additional LLM call is an opportunity to corrupt a correct answer.&lt;/p&gt;

&lt;p&gt;I ripped out two-pass counting, kept adaptive retrieval only, and ran the final evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final result: 454/500 — 90.8% overall, 92.2% task-averaged.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Correct&lt;/th&gt;
&lt;th&gt;Total&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single-session (assistant)&lt;/td&gt;
&lt;td&gt;55&lt;/td&gt;
&lt;td&gt;56&lt;/td&gt;
&lt;td&gt;98.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single-session (user)&lt;/td&gt;
&lt;td&gt;68&lt;/td&gt;
&lt;td&gt;70&lt;/td&gt;
&lt;td&gt;97.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single-session (preference)&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;93.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Temporal reasoning&lt;/td&gt;
&lt;td&gt;119&lt;/td&gt;
&lt;td&gt;133&lt;/td&gt;
&lt;td&gt;89.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge update&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;td&gt;88.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-session&lt;/td&gt;
&lt;td&gt;115&lt;/td&gt;
&lt;td&gt;133&lt;/td&gt;
&lt;td&gt;86.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's how all five full runs played out:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Overall&lt;/th&gt;
&lt;th&gt;Task-avg&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1&lt;/td&gt;
&lt;td&gt;91.0%&lt;/td&gt;
&lt;td&gt;92.4%&lt;/td&gt;
&lt;td&gt;Baseline (timestamps + verbatim turns)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v2&lt;/td&gt;
&lt;td&gt;89.6%&lt;/td&gt;
&lt;td&gt;91.0%&lt;/td&gt;
&lt;td&gt;Wider retrieval (regression)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v3&lt;/td&gt;
&lt;td&gt;91.2%&lt;/td&gt;
&lt;td&gt;92.4%&lt;/td&gt;
&lt;td&gt;Revert widening, keep prompt improvements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v4&lt;/td&gt;
&lt;td&gt;86.0%&lt;/td&gt;
&lt;td&gt;88.8%&lt;/td&gt;
&lt;td&gt;Two-pass counting (regression)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v5&lt;/td&gt;
&lt;td&gt;90.8%&lt;/td&gt;
&lt;td&gt;92.2%&lt;/td&gt;
&lt;td&gt;Remove two-pass, adaptive retrieval only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The headline number went &lt;em&gt;down&lt;/em&gt; from v1 to v5 (91.0% → 90.8%), but the category breakdown tells a different story. Single-session assistant dropped from 100% to 98.2% — likely noise on a 56-question sample — while single-session user climbed from 94.3% to 97.1% and multi-session improved from 85.0% to 86.5%. The task-averaged score stayed nearly identical at 92.2% vs. 92.4%. What really changed was robustness: adaptive retrieval made the system less brittle across query types, even if the aggregate number didn't move.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Knowledge graphs beat vector stores for structured memory.&lt;/strong&gt; Entity resolution and temporal tracking make the difference. When you need to answer "what changed about John's role since January?" you need entities, timestamps, and contradiction detection — not cosine similarity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval quality beats quantity.&lt;/strong&gt; Focused retrieval with a well-crafted prompt consistently outperformed broad context. The 4K token budget with 10 results beat the 8K budget with 20 results every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-pass generation is a trap.&lt;/strong&gt; Each additional LLM call — self-verification, two-pass counting, chain-of-thought validation — is an opportunity to corrupt a correct answer. The simplest pipeline that works is the best pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Small samples lie.&lt;/strong&gt; A 50-question sample scored 92% on the same configuration that scored 86% on 500 questions. Always run the full benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Memento works with any LLM provider (Claude, GPT, Gemini, Llama, Ollama) and any MCP client (Claude Desktop, Cursor, Claude Code, Cline, Windsurf).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pick your LLM provider:&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;memento-memory[anthropic]   &lt;span class="c"&gt;# Claude (ANTHROPIC_API_KEY)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;memento-memory[openai]      &lt;span class="c"&gt;# GPT   (OPENAI_API_KEY)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;memento-memory[gemini]      &lt;span class="c"&gt;# Gemini (GOOGLE_API_KEY)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;memento-memory[openai]      &lt;span class="c"&gt;# Ollama (set MEMENTO_LLM_PROVIDER=ollama)&lt;/span&gt;

memento-mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use it as a Python library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;memento&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MemoryStore&lt;/span&gt;

&lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John Smith is VP of Sales at Alpha Corp.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ingest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alpha Corp is acquiring Beta Inc.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What should I know about John?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is MIT-licensed: &lt;strong&gt;&lt;a href="https://github.com/shane-farkas/memento-memory" rel="noopener noreferrer"&gt;github.com/shane-farkas/memento-memory&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
