DEV Community

MatrixOrigin
MatrixOrigin

Posted on

Benchmarking Memoria on LongMemEval: Strong Memory Retrieval, Clear Reader Separation

Memory systems for AI agents are easy to demo and much harder to evaluate.

A good anecdote can make any memory layer look impressive. A real benchmark has to answer a tougher question: when an agent is asked to recover user facts, track updates, reason over time, and connect information across sessions, does the memory system consistently surface the right context?

That is what we tested with Memoria on LongMemEval_s.

Using a single frozen retrieval snapshot from Memoria, we evaluated three different reader models on exactly the same retrieved memories, then scored all answers with a unified judge using the official LongMemEval task rules. The result is a cleaner measurement than a typical end-to-end benchmark: retrieval stays fixed, so differences in final accuracy mostly reflect how well a reader can use the memory Memoria provides.

The headline result: Memoria retrieval supported up to 88.78% overall accuracy on LongMemEval_s, with near-perfect performance on single-session recall and strong results on knowledge updates, temporal reasoning, and multi-session synthesis.

Why this benchmark matters

LongMemEval is a useful stress test because it goes beyond simple fact lookup. It evaluates whether a system can handle six distinct memory behaviors:

Category What it tests Count
SSU Single-session user facts 70
SSA Single-session assistant facts 56
SSP Single-session personalization 30
KU Knowledge updates and conflict resolution 77
TR Temporal reasoning 133
MS Multi-session synthesis 133

There is also an Abstention subset of 30 questions, where the correct behavior is to recognize that the answer is not available from memory.

That makes LongMemEval a strong fit for evaluating an agent memory layer. A production memory system is not just supposed to store text. It needs to retrieve the right facts, preserve recency, support reasoning over timelines, and avoid pushing the model into confident hallucinations when the answer is unavailable.

Experimental setup

This run used:

  • Dataset: LongMemEval_s

  • Data file: benchmarks/longmemeval/data/longmemeval_s_cleaned.json

  • Retrieval backend: Memoria

  • Retrieval snapshot: benchmarks/longmemeval/results/retrieval_results.json

The snapshot covered 500 retrieval records, with 10 memories returned per question. One retrieval timed out (db467c8c), leaving 499 judged examples.

The evaluation pipeline was straightforward:

  1. Historical sessions were ingested into Memoria.

  2. Memoria retrieved relevant memories for each question.

  3. A unified reader prompt was generated from the same retrieval snapshot.

  4. Three reader models answered using identical retrieved context.

  5. All hypotheses were scored by a single GPT-5.4 judge using the official LongMemEval task-specific rubric.

The three readers were:

  • gpt-5.4

  • claude-opus-4.6

  • claude-sonnet-4.5

The important design choice here is that retrieval was frozen across all readers. This isolates the effect of downstream reasoning from the quality of the memory backend itself.

Overall results

Reader run Reader Judge Correct Overall IDK Count
gpt-5.4 gpt-5.4 gpt-5.4 424/499 84.97% 3
opus-4.6 claude-opus-4.6 gpt-5.4 443/499 88.78% 0
sonnet-4.5 claude-sonnet-4.5 gpt-5.4 353/499 70.74% 79

The top-line takeaway is simple: Memoria retrieved enough useful context for a strong reader to answer nearly 89% of LongMemEval_s correctly.

That matters because this was not a jointly optimized stack. The retrieval snapshot was fixed. The judge was fixed. The only thing that changed was the reader. In other words, Memoria was already surfacing a context set strong enough to support high accuracy without changing the underlying memory results.

Category breakdown

Reader SSU SSA SSP KU TR MS Abstention
gpt-5.4 98.57 100.00 86.67 77.92 88.72 71.43 56.67
opus-4.6 100.00 100.00 76.67 89.61 90.23 78.95 93.33
sonnet-4.5 95.71 100.00 43.33 58.44 64.66 64.66 86.67

These numbers tell a more interesting story than the overall score alone.

1. Memoria is extremely strong on direct factual recall

Single-session recall is close to saturated.

All three readers reached 100% on SSA, and SSU ranged from 95.71% to 100%. That suggests Memoria is consistently retrieving the right evidence for straightforward factual questions, whether the fact originated from the user or from the assistant.

This is exactly the baseline behavior a memory layer has to get right. If retrieval is weak, these categories usually collapse first. Here, they are effectively solved.

2. The real separator is not retrieval, but reasoning over retrieved memory

The largest gaps appear in:

  • Knowledge Update

  • Temporal Reasoning

  • Multi-Session

  • Personalization

Those are the categories that require more than locating a fact. The reader has to decide which memory is most recent, reconcile conflicting evidence, infer ordering, or synthesize information across sessions.

That distinction is important. A weaker result in these categories does not necessarily mean the memory backend failed. In many cases, it means the model failed to use the retrieved evidence correctly.

The strongest example is Knowledge Update. With the same Memoria retrieval snapshot, performance ranged from 58.44% to 89.61% depending on the reader. That is a large spread, and it strongly suggests the retrieved context often contained the necessary evidence, but not every model was equally good at choosing the latest valid fact.

3. Memoria supports strong temporal and cross-session reasoning

Two of the hardest LongMemEval categories are Temporal Reasoning (TR) and Multi-Session (MS).

On the frozen Memoria retrieval snapshot, the best reader reached:

  • 90.23% on TR

  • 78.95% on MS

That is a strong result for a memory benchmark. These are not simple quote-retrieval tasks. They require a model to read multiple memory items, track dates or ordering, and compose an answer that reflects the correct timeline or session-level synthesis.

In practice, this is much closer to how agent memory is actually used. Real agents do not just need to remember that a user likes tea. They need to remember what changed, when it changed, and how different conversations relate.

4. Abstention reveals calibration, not just recall

The abstention subset is especially useful because it tests whether a model can recognize when memory is insufficient.

Here the best result came from claude-opus-4.6, which achieved 93.33% on abstention. claude-sonnet-4.5 was also relatively strong at 86.67%, while gpt-5.4 lagged at 56.67%.

The IDK Count helps explain why. GPT-5.4 only emitted the exact string I don't know 3 times, while Sonnet did so 79 times. GPT-5.4 was much more aggressive; Sonnet was much more conservative.

That does not change the core result for Memoria, but it does show something valuable: the same retrieval layer can support very different downstream answer behaviors depending on the reader. A memory stack should be evaluated not only on what it retrieves, but also on how different readers convert that retrieval into answers or refusals.

What this says about Memoria

Taken together, these results show three things.

First, Memoria can retrieve the right evidence at high frequency. Near-perfect SSU and SSA performance across readers is difficult to achieve if the memory layer is not consistently surfacing the correct context.

Second, Memoria preserves enough structure for complex downstream use. Strong scores in knowledge updates, temporal reasoning, and multi-session tasks indicate that the retrieved memories are not just loosely relevant. They are sufficiently precise and ordered to support harder reasoning.

Third, Memoria has a high performance ceiling. When paired with a stronger reader, the same retrieval snapshot reaches 88.78% overall. That is a strong signal that the backend is doing real work and that additional gains can come from improving reader behavior rather than rethinking the memory layer from scratch.

In other words, Memoria is not just storing conversation history. It is producing retrieval outputs that are good enough for advanced models to answer difficult memory questions accurately.

Why the unified-judge setup matters

One reason benchmark results are often hard to interpret is that too many variables change at once. Retrieval changes. Prompting changes. Judging changes. The final number reflects the whole stack, but it is hard to know what actually improved.

This evaluation reduces that ambiguity:

  • one dataset

  • one retrieval snapshot

  • one judge

  • one official rubric

  • three readers

That makes the result more legible. The benchmark is not asking whether one model is “best.” It is asking what Memoria retrieval enables under controlled conditions.

And under those conditions, the answer is clear: Memoria provides a strong enough memory substrate for high-accuracy long-horizon QA, especially when paired with a capable reader.

Limitations

A fair reading should note what this benchmark does not prove.

This is one dataset, one retrieval configuration, and one judge. The benchmark also fixes retrieval at 10 memories per question, which may not be optimal for every reader. And because the final metric is answer accuracy rather than retrieval recall in isolation, some portion of the variance still belongs to the model, not the memory layer.

But those caveats do not weaken the main finding. They sharpen it.

The result here is not that Memoria solves memory in the abstract. It is that on a realistic long-memory benchmark, Memoria retrieval is strong enough to support state-of-the-art answer quality under a controlled unified evaluation.

Conclusion

Agent memory should be judged on more than demos.

In this LongMemEval evaluation, Memoria retrieved a fixed context snapshot that enabled:

  • 88.78% overall accuracy at best

  • 100% SSA

  • 100% SSU with the strongest reader

  • 89.61% Knowledge Update

  • 90.23% Temporal Reasoning

  • 78.95% Multi-Session

  • 93.33% Abstention on unanswerable questions

That is a compelling profile for a production memory layer.

The most important conclusion is not just that one reader scored higher than another. It is that Memoria consistently surfaced the evidence needed for strong readers to perform well across direct recall, updates, timelines, and cross-session synthesis.

For teams building AI agents, that is what memory infrastructure is supposed to do.

And on this benchmark, Memoria does it.

— — —

Experience the power of persistent memory for AI Agents. 🧠
💻 GitHub (Star us!):[https://github.com/matrixorigin/Memoria]
🌐 Website: [https://thememoria.ai/]
👾 Discord: [https://discord.com/]

Top comments (0)