Penfield

Posted on Apr 10

Proposal: A Real Benchmark for Long-Term AI Memory Systems

#ai #aimemory #benchmarks

The Problem

Nearly every AI memory system is publishing scores on benchmarks that don't adequately measure what they claim to measure.

We audited LoCoMo and found 6.4% of the answer key is factually wrong (99 errors in 1,540 questions), the LLM judge accepts 63% of intentionally wrong answers, and 56% of per-category system comparisons are statistically indistinguishable from noise.

LongMemEval-S uses ~115K tokens per question — every frontier model can hold that in context. It's a better context window test than a memory test.

Meanwhile, each system uses its own ingestion, its own answer generation prompt, and sometimes its own judge configuration — then publishes scores in the same table as if they share a common methodology. The Mem0/Zep benchmark dispute illustrates this perfectly: two companies testing the same systems, arriving at wildly different numbers.

Ten Design Principles

1. Corpus must exceed context windows

1–2 million tokens of total context. Large enough to require genuine memory retrieval. Small enough to be economically feasible for independent researchers.

2. Corpus must model real agent usage

Multi-session conversations between one person and an AI assistant over ~6 months. Work projects, personal preferences, corrections, evolving facts — not disconnected chit-chat between strangers.

3. Ingestion is the system's problem, but must be disclosed

Each system ingests however it wants. But it must publish: ingestion method, model used, embedding model, total cost, and total time.

4. Answer generation: standardized OR fully disclosed

Standard track: Prescribed model, prescribed prompt, single-shot. The only variable is what memory retrieves. Apples-to-apples.
Open track: Use whatever you want, fully disclosed, reported separately. Never mixed with standard track scores.

5. Equal statistical power across categories

400 questions per category. LoCoMo's smallest category has 96 questions — Wilson Score margins of error so wide that score differences are noise.

6. Human-verified ground truth

Error rate target: <1%. Model council pre-screening, crowd-sourced review with bounties, expert tiebreakers.

7. Adversarially validated judge

Generate intentionally wrong answers before launch. The judge must reject >95%. No more judges that can't distinguish vague topically-adjacent answers from correct ones.

8. Abstention is scored

"I don't know" when the answer IS in the corpus: 0.10. Confidently wrong: 0.0. A system that knows its limits should beat one that hallucinates.

9. Multiple scoring dimensions

Accuracy alone hides everything interesting. The scorecard includes: accuracy (standard + open), retrieval precision (tokens per question), latency (p50/p95), abstention quality, and supersession handling.

10. Context-stuffing is measured, not hidden

Systems report the token count of context provided to the answer generation model for each question.

Six Question Categories

2,400 questions total — 400 per category:

Direct recall — Can you retrieve a specific fact that was stated explicitly?

Temporal reasoning — Can you reason about when things happened and how facts changed over time?

Multi-hop inference — Can you connect information from different conversations to answer a question never explicitly discussed?

Supersession and correction — Can you track when facts have been updated, corrected, or superseded?

Cognitive inference — Can you make connections that require understanding implications rather than explicit statements?

Adversarial abstention — Can you correctly identify when you DON'T have the information?

What We're NOT Doing

Not prescribing ingestion method
Not requiring a specific embedding model
Not testing with outdated models
Not making it cost-prohibitive to run
Not handing down a finished spec — this is a proposal and an invitation to collaborate

Read the Full Proposal

The complete write-up, including corpus generation methodology, model comparability framework, open questions, and full references can be found here:

A Real Benchmark for Long-Term AI Memory Systems

The full LoCoMo audit with all 99 errors documented is public.

We're looking for memory system builders, benchmark designers, and researchers who share the goal of honest measurement. Feedback, criticism, and contributions welcome.

DEV Community