Perfect Retrieval Recall on LongMemEval — Running Fully Local

#ai #machinelearning #rag #opensource

We've been benchmarking Aingram's hybrid retrieval pipeline against LongMemEval, the most rigorous public benchmark for long-term memory in AI chat assistants. This post covers the retrieval-only results — before any LLM generation step — because we think they tell an important story about where memory system failures actually come from.

Background: What LongMemEval Tests

LongMemEval (Wu et al., ICLR 2025) is a benchmark of 500 hand-curated questions embedded across scalable user-assistant chat histories. The LongMemEval-S split gives each question a history of approximately 115,000 tokens (~40 sessions). Questions span five memory abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. The standard evaluation is end-to-end: ingest the conversation history, retrieve relevant sessions, pass them to an LLM, generate an answer, and score with an LLM judge. Most published numbers (Zep: 71.2%, Emergence AI: 86%) are end-to-end accuracy. But, LongMemEval also includes oracle metadata — ground truth labels for which sessions contain the answer. That means you can measure pure retrieval quality separately from LLM reasoning quality. We think this distinction matters a lot.

The Oracle Run: Establishing the Retrieval Ceiling

We first ran Aingram's retrieval pipeline against longmemeval_oracle.json, which contains only the evidence sessions — a direct measure of whether our hybrid retrieval can find the right material.

recall_any@3 = 1.000 across all 500 queries. The relevant session appeared in the top 3 results for every single question. At rank 10, all relevant sessions were present for every query. This tells us something specific: Aingram's retrieval component is not the bottleneck for end-to-end performance on this benchmark. Whatever end-to-end accuracy we achieve is bounded by LLM reasoning quality over the retrieved context, not by whether the right sessions were found.

The Real Benchmark: LongMemEval-S

The oracle split is an upper bound. LongMemEval-S is the real test: 500 instances with full noisy conversation histories, no hints about which sessions matter.

recall_any@10 = 0.955: the relevant session appears in the top 10 results for 95.5% of queries. The gap down to recall_any@1 (0.759) tells you that the correct session isn't always ranked first — but it's almost always present within the first 10 results.

What This Means for End-to-End Performance

Zep's published end-to-end accuracy of 71.2% (using gpt-4o) and Emergence AI's 86% (using gpt-4o-2024-08-06) are retrieval + LLM generation combined. Neither has published retrieval-only numbers. Here's the key relationship: your end-to-end accuracy cannot exceed your retrieval recall. If the correct session isn't retrieved, no LLM can answer the question correctly. A system with recall_any@10 = 0.71 can at best achieve 71% end-to-end accuracy, regardless of how capable the LLM is. Aingram's retrieval recall_any@10 of 0.955 means the context ceiling for end-to-end accuracy is set by LLM reasoning, not by retrieval failure. The system puts the right material in front of the LLM 95.5% of the time.

Retrieval Architecture

The recall numbers above come from Aingram's hybrid retrieval pipeline, which combines three signals via Reciprocal Rank Fusion (RRF):

FTS5 full-text search — keyword matching, fast, effective for exact terminology
sqlite-vec vector search — semantic similarity via nomic-embed-text-v1.5 (ONNX, 768 dims)
Knowledge graph traversal — entity relationships, multi-hop connections via CTE

This is the open-source Lite pipeline. Everything runs locally on SQLite — no external services, no cloud round-trip, no vector database to manage. Median retrieval latency is 22ms on an RTX 4060 8GB (measured on the oracle evaluation run with no caching layer active). The Pro tier adds a GPU-resident neural retrieval cache that shortcuts the full pipeline for high-confidence queries, keeping latency flat as memory grows. It doesn't change retrieval quality — the recall numbers here reflect the Lite pipeline alone.

Honest Caveats

These are retrieval metrics, not end-to-end accuracy. The comparison to Zep's 71.2% or Emergence AI's 86% requires running the full QA pipeline — which we're doing and will publish separately. The oracle run's perfect recall also reflects that oracle sessions are curated to be the exact evidence needed. LongMemEval-S is substantially harder because you're searching through ~40 sessions of noise to find 1–3 relevant ones.

Aingram v1.1.0-alpha | RTX 4060 8GB | nomic-embed-text-v1.5

Aingram is a local-first, privacy-preserving shared memory layer for AI agent teams/swarms. The retrieval pipeline described here is the open-source Lite tier — the recall numbers reflect what the Lite architecture delivers on its own. We'll be publishing more benchmark results and opening up early access soon.

Top comments (1)

Michael Cabaza • Apr 7

If you're interested in learning more, or want to contribute to AIgram, check out the git repo or aingram.dev!