Shane Farkas

Posted on Apr 13

I built an Agent Memory System for myself and got 90.8% (end-to-end) on LongMemEval

#ai #llm #mcp #opensource

Like most users of AI agents like Claude Code, I have been frustrated by the agent memory problem. The models have gotten extremely good and no longer lose focus in one long conversation like they used to, but across sessions the memory is pretty spotty whether it's a conversation with an LLM where it recalls imperfect or irrelevant data from previous chats, or a new Claude Code session where I feel like it's Groundhog Day onboarding a brand new employee who's smart and talented but knows nothing about my world.

I started looking into the various memory systems. I tried markdown files, Obsidian vaults, a few memory tools — they all collapse into two approaches: flat vector similarity or LLM-distilled fact text files. On LongMemEval's multi-session and temporal categories, vector stores drop into the 60s. Markdown does a bit better on those but loses 60% of assistant-side questions because LLM distillation skews toward user statements. It works fine for "what did we talk about last week?" but falls apart the moment you need real reasoning like when facts contradict each other, when the answer requires connecting information from three different conversations, or when the user asks "what changed since January?"

So I built Memento, a bitemporal knowledge graph memory system for AI agents. Then I put it through the best long-term memory benchmark I could find, LongMemEval. This is the story of what worked, what didn't, and what I learned along the way.

The problem with vector store memory

Standard AI memory goes like this: user says something, you embed it, store it, and later embed a query to find the nearest neighbors and shove the results into a prompt. That's document search, not memory.

It has no concept of entities - "John" and "John Smith" might be two different chunks. It has no awareness of time - a fact from January looks identical to one from yesterday. And it has no way to detect when new information contradicts old information.

I wanted memory that could track entities, time, and contradictions.

What Memento does differently

Memento builds a knowledge graph from conversations. When you ingest text, it extracts entities and their properties - people, organizations, projects - and resolves them against what it already knows. "John," "John Smith," and "the sales VP" collapse into one node using tiered matching: exact, fuzzy, phonetic, then embedding similarity, then an LLM tiebreaker.

It detects contradictions. If John's title was "VP of Sales" and now it's "SVP of Sales," that gets flagged. It tracks time bitemporally - when a fact was true in the world, and when the system learned it. And it stores verbatim text as a fallback, so extraction errors don't mean lost information. The raw conversation is always there via FTS5 and vector search.

When you query, it traverses the graph instead of just searching vectors. "What should I know before my meeting with John?" finds John's node, walks his relationships to Alpha Corp, finds Alpha Corp's pending acquisition, and assembles a briefing - all within a token budget you specify.

   Agent / LLM
         │ (Query/Ingest)
         ▼
  Retrieval Engine  <───>  Ingestion Pipeline
         │                     │
         ▼                     ▼
    Temporal Knowledge Graph (SQLite)
         │
         ├── Consolidation Engine (decay, dedup, prune)
         ├── Verbatim Fallback (FTS5 + vector search)
         └── Privacy Layer (export, audit, hard delete)

Finding a benchmark

I wanted to evaluate against real examples. LongMemEval is a benchmark built specifically for long-term conversational memory - 500 questions across five categories:

Single-session recall - facts stated in one conversation
Preference tracking - applying preferences revealed in past conversations
Multi-session reasoning - synthesizing information scattered across multiple conversations
Knowledge updates - returning the latest value when facts change
Temporal reasoning - understanding when events happened and their order

Each question includes haystack sessions and a reference answer. A GPT-4o judge compares the system's output against the reference. All runs used Claude Sonnet for extraction and reasoning; GPT-4o as the eval judge, following the paper's methodology. Each question also has an abstention variant where the correct answer is "I don't know" - that tests whether the system hallucinates.

I ran against the oracle variant: evidence-only sessions, no distractors, 1–6 sessions per question. A clean test of whether the system can extract and reason over information it definitely has.

First contact - 91.0%

I'd just open-sourced Memento and didn't have a benchmark harness yet. I wrote one, ran a 5-question smoke test (5/5, just checking the plumbing), then set up the full 500-question run.

But before that run, I'd already spotted two gaps.

The timestamp gap. Memento had no way to know when a conversation happened. For temporal reasoning questions like "which happened first, X or Y?" the system was guessing. Fix: pipe session dates into ingestion as timestamps and prepend [Conversation date: ...] headers to the text.

The verbatim search gap. Memento was ingesting each session as one big text block. If a user asked about a specific phrase, FTS5 was searching across entire concatenated sessions instead of individual turns. Fix: store each individual turn separately in the verbatim store, while still ingesting full sessions for entity extraction.

The first full run: 455/500 - 91.0% overall, 92.4% task-averaged.

Category	Accuracy
Single-session (assistant)	100.0%
Single-session (user)	94.3%
Single-session (preference)	93.3%
Temporal reasoning	91.0%
Knowledge update	91.0%
Multi-session	85.0%

Single-session was nearly perfect. Multi-session at 85.0% was the weak point.

The retrieval trap - 89.6%

The obvious move: multi-session is weak, so give it more context. I widened the retrieval window - top_k from 10 to 20, conversation cap from 5 to 10, token budget from 4K to 8K. I also added some prompt improvements such as: "don't ask clarifying questions for preference queries" and "enumerate before counting."

Full 500-question run: 89.6% overall, 91.0% task-averaged. A regression. Wider retrieval hurt single-session accuracy by 4–6 percentage points through context dilution, while only helping multi-session by +0.7%. When you dump 8K tokens of loosely related context into a prompt, the model starts second-guessing itself on questions it had previously gotten right.

More retrieval is not better retrieval.

I reverted the retrieval widening but kept the prompt improvements and ran again: 91.2% overall, 92.4% task-averaged. The prompt changes alone were doing real work.

Diminishing returns - 86.0% → 90.8%

Next I fixed five bugs (conflict references, idempotent decay, recall entity depth, soft-delete for relationships, confirmation counts) and built two new features: adaptive retrieval, which classifies queries as "wide" (counting/enumeration) vs. "narrow" (single-fact recall) and adjusts parameters accordingly, and two-pass counting, which enumerates items first, then counts.

Quick validation on 25 questions after the bug fixes: 92.0%. A 50-question stratified sample with the new features: 92.0%. Looked good.

Full 500-question run: 86.0% overall, 88.8% task-averaged. A disaster. Two-pass counting helped maybe 5 multi-session questions but hurt 10+ temporal and knowledge-update ones. I also tested self-verification - having the model double-check its own answer - which dropped accuracy to 68%. Every additional LLM call is an opportunity to corrupt a correct answer.

I ripped out two-pass counting, kept adaptive retrieval, and ran the final evaluation.

Final result: 454/500 - 90.8% overall, 92.2% task-averaged.

Category	Correct	Total	Accuracy
Single-session (assistant)	55	56	98.2%
Single-session (user)	68	70	97.1%
Single-session (preference)	28	30	93.3%
Temporal reasoning	119	133	89.5%
Knowledge update	69	78	88.5%
Multi-session	115	133	86.5%

Here's how all five full runs played out:

Run	Overall	Task-avg	What changed
v1	91.0%	92.4%	Baseline (timestamps + verbatim turns)
v2	89.6%	91.0%	Wider retrieval (regression)
v3	91.2%	92.4%	Revert widening, keep prompt improvements
v4	86.0%	88.8%	Two-pass counting (regression)
v5	90.8%	92.2%	Remove two-pass, adaptive retrieval only

The headline number went down from v1 to v5 (91.0% → 90.8%), but the breakdown tells a more interesting story. Single-session assistant dropped from 100% to 98.2% - probably noise on a 56-question sample. Single-session user climbed from 94.3% to 97.1%. Multi-session improved from 85.0% to 86.5%. Task-averaged stayed nearly flat at 92.2% vs. 92.4%. What actually changed was robustness - adaptive retrieval made the system less brittle across query types, even if the aggregate number didn't move.

What I learned

Knowledge graphs beat vector stores for structured memory. Entity resolution and temporal tracking make the difference. When you need to answer "what changed about John's role since January?" you need entities, timestamps, and contradiction detection. Cosine similarity doesn't get you there.

Retrieval quality beats quantity. Focused retrieval with a well-crafted prompt beat broad context every single time. The 4K token budget with 10 results outperformed the 8K budget with 20 results.

Multi-pass generation is a trap. Each additional LLM call - self-verification, two-pass counting, chain-of-thought validation - is another opportunity to corrupt a correct answer. The simplest pipeline that works is the right pipeline.

Small samples lie. A 50-question sample scored 92% on the same configuration that scored 86% on 500 questions. Run the full benchmark.

Try it

Memento works with any LLM provider (Claude, GPT, Gemini, Llama, Ollama) and any MCP client (Claude Desktop, Cursor, Claude Code, Cline, Windsurf).

# Pick your LLM provider:
pip install memento-memory[anthropic]   # Claude (ANTHROPIC_API_KEY)
pip install memento-memory[openai]      # GPT   (OPENAI_API_KEY)
pip install memento-memory[gemini]      # Gemini (GOOGLE_API_KEY)
pip install memento-memory[openai]      # Ollama (set MEMENTO_LLM_PROVIDER=ollama)

memento-mcp

Or use it as a Python library:

from memento import MemoryStore

store = MemoryStore()
store.ingest("John Smith is VP of Sales at Alpha Corp.")
store.ingest("Alpha Corp is acquiring Beta Inc.")

memory = store.recall("What should I know about John?")
print(memory.text)

MIT-licensed: github.com/shane-farkas/memento-memory

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.