DEV Community

Michael Cabaza
Michael Cabaza

Posted on

Perfect Retrieval Recall on the Hardest AI Memory Benchmark — Running Fully Local

We've been benchmarking Aingram's hybrid retrieval pipeline against LongMemEval, the most rigorous public benchmark for long-term memory in AI chat assistants. This post covers the retrieval-only results — before any LLM generation step — because we think they tell an important story about where memory system failures actually come from.


Background: What LongMemEval Tests

LongMemEval (Wu et al., ICLR 2025) is a benchmark of 500 hand-curated questions embedded across scalable user-assistant chat histories. The LongMemEval-S split gives each question a history of approximately 115,000 tokens (~40 sessions). Questions span five memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. The standard evaluation is end-to-end: ingest the conversation history, retrieve relevant sessions, pass them to an LLM, generate an answer, and score with an LLM judge. Most published numbers (Zep: 71.2%, Emergence AI: 86%) are end-to-end accuracy. But, LongMemEval also includes oracle metadata — ground truth labels for which sessions contain the answer. That means you can measure pure retrieval quality separately from LLM reasoning quality. We think this distinction matters a lot.


The Oracle Run: Establishing the Retrieval Ceiling

We first ran Aingram's retrieval pipeline against longmemeval_oracle.json, which contains only the evidence sessions — a direct measure of whether our hybrid retrieval can find the right material.

Metric | Score
ndcg_any@1 | 0.976
ndcg_any@10 | 0.994
recall_any@1 | 0.976
recall_any@3 | 1.000
recall_any@10 | 1.000
recall_all@10 | 1.000
Median latency | 22ms

recall_any@3 = 1.000 across all 500 queries. The relevant session appeared in the top 3 results for every single question. At rank 10, all relevant sessions were present for every query. This tells us something specific: Aingram's retrieval component is not the bottleneck for end-to-end performance on this benchmark. Whatever end-to-end accuracy we achieve is bounded by LLM reasoning quality over the retrieved context, not by whether the right sessions were found.


The Real Benchmark: LongMemEval-S

The oracle split is an upper bound. LongMemEval-S is the real test: 500 instances with full noisy conversation histories, no hints about which sessions matter.

Metric | Score
ndcg_any@10 | 0.836
recall_any@1 | 0.759
recall_any@3 | 0.902
recall_any@10 | 0.955
recall_all@10 | 0.883
Median latency | 27ms

recall_any@10 = 0.955: the relevant session appears in the top 10 results for 95.5% of queries. The gap down to recall_any@1 (0.759) tells you that the correct session isn't always ranked first — but it's almost always present within the first 10 results.


What This Means for End-to-End Performance

Zep's published end-to-end accuracy of 71.2% (using gpt-4o) and Emergence AI's 86% (using gpt-4o-2024-08-06) are retrieval + LLM generation combined. Neither has published retrieval-only numbers. Here's the key relationship: your end-to-end accuracy cannot exceed your retrieval recall. If the correct session isn't retrieved, no LLM can answer the question correctly. A system with recall_any@10 = 0.71 can at best achieve 71% end-to-end accuracy, regardless of how capable the LLM is. Aingram's retrieval recall_any@10 of 0.955 means the context ceiling for end-to-end accuracy is set by LLM reasoning, not by retrieval failure. The system puts the right material in front of the LLM 95.5% of the time.


Retrieval Architecture

The recall numbers above come from Aingram's hybrid retrieval pipeline, which combines three signals via Reciprocal Rank Fusion (RRF):

  • FTS5 full-text search — keyword matching, fast, effective for exact terminology
  • sqlite-vec vector search — semantic similarity via nomic-embed-text-v1.5 (ONNX, 768 dims)
  • Knowledge graph traversal — entity relationships, multi-hop connections via CTE

This is the open-source Lite pipeline. Everything runs locally on SQLite — no external services, no cloud round-trip, no vector database to manage. Median retrieval latency is 22ms on an RTX 4060 8GB (measured on the oracle evaluation run with no caching layer active). The Pro tier adds a GPU-resident neural retrieval cache that shortcuts the full pipeline for high-confidence queries, keeping latency flat as memory grows. It doesn't change retrieval quality — the recall numbers here reflect the Lite pipeline alone.


Honest Caveats

These are retrieval metrics, not end-to-end accuracy. The comparison to Zep's 71.2% or Emergence AI's 86% requires running the full QA pipeline — which we're doing and will publish separately. The oracle run's perfect recall also reflects that oracle sessions are curated to be the exact evidence needed. LongMemEval-S is substantially harder because you're searching through ~40 sessions of noise to find 1–3 relevant ones.

Aingram v2.0.0-alpha.1 | RTX 4060 8GB | nomic-embed-text-v1.5


Aingram is a local-first, privacy-preserving shared memory layer for AI agent teams/swarms. The retrieval pipeline described here is the open-source Lite tier — the recall numbers reflect what the Lite architecture delivers on its own. We'll be publishing more benchmark results and opening up early access soon.

Top comments (1)

Collapse
 
bozbuilds profile image
Michael Cabaza

If you're interested in learning more, or want to contribute to AIgram, check out the git repo or aingram.dev!