How I Built a Memory System That Scores 96.2% on LongMemEval (#1 in the World)

#opensource #ai #showdev #python

Agentmemory V4 is 481 correct out of 500. 96.20% on the LongMemEval benchmark under real-retrieval conditions, the highest published score on this benchmark by any single-pass system.

I built it alone. No team. No funding. No degree. A mid-range gaming PC with an Intel Core i3-12100F, 16 days of development, roughly $1,000 in API costs, and around 300 million tokens consumed across development.

The previous world record was 95.60%, held by PwC Chronos, a research team that published an arXiv paper. Before them, the leaderboard included Mastra (94.87%), OMEGA (93.2%), Hindsight from Vectorize/Virginia Tech (91.4%), Emergence AI (86%), Supermemory (85.86%), and Zep (71.2%). All funded companies or research labs with teams.
I want to write up exactly how this happened, because the path was not clean. It was systematic and slow, punctuated by a moment where I nearly accepted a completely invalid result.

What LongMemEval Is (and Why It Matters)
LongMemEval (Wu et al., 2024; ICLR 2025) is a 500-question benchmark designed to evaluate long-term memory in AI assistants. It's considered the gold standard in this space.

Each test case provides a system with multi-session conversation histories, roughly 115,000 tokens across ~40 sessions. The system must ingest these conversations into its memory, then answer a question purely from what it retrieves. No peeking at the original conversations at inference time. You either find the right memories and reason correctly, or you don't.

The 500 questions span six types: temporal reasoning (133 questions), multi-session aggregation (133), single-session user facts (70), single-session assistant observations (56), knowledge updates (78), and single-session preferences (30). There are also 30 deliberately unanswerable questions to test whether the system knows when to say "I don't know."

The score is simple: correct / 500 × 100. GPT-4o judges each answer against the gold standard using the benchmark's published evaluation templates. No weighting, no curves.

For context, if you just dump the entire 115k-token conversation history into GPT-4o's context window and ask the question directly, it only scores 60-64%. The context window can fit everything, but the model still can't find and reason over the right information. That's why retrieval architecture matters.

The Invalid Start
Early in development, I hit 98% and thought I was done.
I was running with USE_DIRECT_CONTEXT = True, a flag that bypasses the retrieval pipeline entirely and injects the raw conversation transcript directly into the model's prompt. That's oracle access. The system wasn't retrieving from memory; it was reading the original conversation with the answers sitting right there.

I didn't set this flag maliciously. It existed from early debugging when I was testing whether the generation and judging pipeline worked correctly, before the retrieval system was built. I just never turned it off. For days I thought my architecture was performing at 98%.
When I caught the mistake and flipped USE_DIRECT_CONTEXT to False, the score dropped to around 88%. I added a hard assert that crashes the entire benchmark run if anyone tries to enable it:
USE_DIRECT_CONTEXT = False
assert not USE_DIRECT_CONTEXT, "INVALID: must be False for legitimate evaluation"

The legitimate baseline after calibration was 82.0% (410/500). That was the real starting point. I'd wasted days celebrating a fake score.

The Architecture
The core of agentmemory is a six-signal hybrid retrieval pipeline. Every benchmark case gets a fresh :memory: SQLite MemoryStore, zero cross-case contamination.

Ingestion
All sessions from the conversation history are ingested with [Session date: YYYY-MM-DD] markers so temporal references like "last month" or "a few weeks ago" can be anchored to real dates. For temporal-reasoning questions, a post-ingestion event extraction pass identifies date-bearing statements and creates dedicated event nodes.

Retrieval
When a question comes in, async_recall() fires six parallel retrieval signals:
w_semantic: 0.30 # Cosine similarity (all-mpnet-base-v2, 768-dim)
w_lexical: 0.12 # BM25 via SQLite FTS5
w_activation: 0.18 # Recency/frequency scoring
w_graph: 0.18 # Knowledge graph spreading activation
w_importance: 0.10 # Node importance × calibrated confidence
w_temporal: 0.12 # Gaussian temporal proximity

Candidates come from three sources simultaneously: HNSW approximate nearest neighbor index, BM25 full-text search, and graph traversal through entity relationships. All results are merged into a single ranked list and reranked with cross-encoder/ms-marco-MiniLM-L-6-v2.

Context Assembly
Retrieved memories are assembled into a context block within per-type token budgets:
TOKEN_BUDGETS = {
"multi-session": 7500,
"temporal-reasoning": 5000,
"single-session-assistant": 3500,
"single-session-preference": 3500,
"knowledge-update": 2500,
"single-session-user": 1500,
}

Multi-session questions get more context because they require synthesizing information scattered across many conversations. Single-session user facts need less, the answer is in one place, you just have to find it.
Generation and Judgment
Claude Opus 4.6 generates the answer (temperature=0). GPT-4o judges it against the gold answer (temperature=0, seed=42) using the benchmark's official evaluation prompts verbatim.

The Climb: 82% → 95.6%
I developed a systematic iteration process across 46 cycles. Each cycle followed the same pattern: analyze failures from the previous run, form a hypothesis about why specific cases failed, implement a targeted fix, run tests on the affected cases plus a regression check, then commit or revert.

Early Phase: Retrieval Improvements (82% → ~89%)
The first gains came from fixing obvious retrieval gaps. Better BM25 tokenization. Tuning HNSW parameters (M=16, ef_construction=200, ef_search=100). Adding the cross-encoder reranker. Building the knowledge graph with automatic entity extraction and spreading activation. These were infrastructure improvements that helped across all question types.

Middle Phase: Prompt Engineering (~89% → 95.6%)
Once retrieval was solid, the remaining failures were almost all reasoning errors, the model had the right context but drew the wrong conclusion. The fixes were surgical prompt rules, each discovered by analyzing specific failure cases:
BORN vs ADOPTED (ITER-44): When a question asks "how many babies were born to the user," only count natural births. The model was including adopted children.
PLANS TO ACQUIRE ≠ CURRENTLY OWNS (ITER-44): "I'm thinking about getting a Tesla" does not mean the user owns a Tesla. The model was treating hypothetical intent as established fact.
SOLO CLASS ASSIGNMENT ≠ LED (ITER-44): An academic assignment completed alone is not a "project the user led." Work projects and personal research initiatives count; homework does not.
SAME-SESSION INCREMENT OVERRIDE (ITER-45): When a conversation mentions both a running total and new items in the same session, the stated total already includes the new items. Don't double-count.

Each of these rules was worth 1-3 cases. They accumulated slowly over dozens of cycles.

The Wall at 95.6%
Then I got stuck.
Opus1: 478/500 (95.6%). Opus2: 478/500. Opus3: 478/500. Opus4: 478/500.
Four consecutive full runs, each taking about six hours, and the exact same score every time. I was making prompt improvements that should have been worth 2-3 cases each, but the score wouldn't move.

This was genuinely demoralizing. I was tied with the Chronos world record, but I couldn't break past it. Every improvement I made was being canceled out by something.

Diagnosing the Problem
I spent an entire day tracing the issue. The root cause was non-determinism in the HNSW index, my retrieval was returning slightly different results on every run, and the variance was exactly large enough (±3 cases) to mask every improvement I made.

There were three independent sources:

Insertion-order-dependent node levels. The HNSW implementation used random.Random(42).random() for level assignment, the seed was fixed, but the actual level each node got depended on the order it was inserted. Different async scheduling across runs meant different insertion orders, which meant different graph structures, which meant different search results.
Python's randomized hash(). Python randomizes string hashing by default via PYTHONHASHSEED. The HNSW beam search iterates over sets of node IDs, and set iteration order depends on hash(). Different process launches meant different traversal orders.
Claude API non-determinism. Even at temperature=0, the Claude API routes requests to different GPU nodes, which can produce slightly different outputs due to floating-point variation. This one I couldn't fix, it's on Anthropic's side.

The Failed Fix
My first attempt was to replace HNSW with exact KNN, brute-force cosine similarity over all stored vectors. Zero randomness, mathematically correct results every time. I was confident this would work.
It made things worse. 476/500 (95.2%), a regression of two cases.

The counterintuitive finding: HNSW's approximate graph traversal was actually better than exact nearest-neighbor for this benchmark. The graph structure surfaces topically related clusters that pure cosine similarity misses. Some questions need context that's conceptually related but not the closest embedding match, and HNSW's exploration of the graph neighborhood finds these.

The Actual Fix
Instead of replacing HNSW, I made it deterministic while preserving its graph-traversal properties:
@staticmethod
def _vector_random(vector: list[float]) -> float:
"""SHA-256 hash of embedding → deterministic level assignment."""
n = min(16, len(vector))
data = struct.pack(f"{n}f", *vector[:n])
digest = hashlib.sha256(data).digest()
h = int.from_bytes(digest[:8], "big") or 1
return h / 0x1_0000_0000_0000_0000

Same content → same embedding → same hash → same level in the graph. Regardless of insertion order. The PYTHONHASHSEED was fixed via subprocess re-execution:
_DESIRED_HASH_SEED = "42"
if os.environ.get("PYTHONHASHSEED") != _DESIRED_HASH_SEED:
env = {**os.environ, "PYTHONHASHSEED": _DESIRED_HASH_SEED}
result = subprocess.run([sys.executable] + sys.argv, env=env)
sys.exit(result.returncode)

The result: 481/500 = 96.20%. The deterministic graph happened to be a superior retrieval configuration, one that had been masked by noise in every prior run.

What I Got Wrong and What's Honest
Possible benchmark overfitting. I iterated 46 times on the same 500 questions. So did every other system in this space, there's no train/test split. But my prompt rules are general reasoning principles, not memorized answers. I plan to run on LoCoMo and other benchmarks to test generalization.

Model strength matters. Using Claude Opus as the generator contributes to the score. Mastra showed a 10-point jump going from GPT-4o to gpt-5-mini with zero architecture changes. The retrieval system is doing real work, full-context Opus would likely score in the 60-65% range like full-context GPT-4o, but I want to be honest that the score reflects the full stack, not just retrieval.

Benchmark ≠ production. LongMemEval hands you the haystack. Real-world memory systems need to decide what to keep across an unbounded lifetime of interactions with users who contradict themselves, change their minds, and never tell you what's important. That's a harder problem, and this benchmark doesn't test it. A commenter on Reddit made this point well, and they're right.

19 remaining failures. Eight are retrieval misses where the evidence never made it into the context. Five are cases requiring complex inference the model can't consistently perform. Six are borderline cases that flip based on which GPU node the Claude API routes to, outside my control.

The Journey in Numbers

Invalid oracle mode
~98% (discarded)
Real retrieval cold start
~88%
ITER-1 calibrated baseline
82.0% (410/500)
After 32 iteration cycles
91.4% (457/500)
Opus1–Opus4 (the wall)
95.6% (478/500)
Opus5 (exact KNN — regression)
95.2% (476/500)
Opus6 (deterministic HNSW)
96.2% (481/500)

What's Next
Better multi-session aggregation, it's my weakest category at 93.2%. The failures are mostly counting and list-reconstruction problems where the retrieval surfaces the right sessions but the model miscounts items or can't reconstruct original ordering. This is a prompt engineering and context-assembly problem, not a retrieval problem.

I'm also exploring opportunities in AI research and engineering. If you're building at the frontier of agent memory, retrieval systems, or long-context AI, whether that's a frontier lab, a startup, or a research team, I'd love to talk.

The full system is open source under MIT. Code, results, run logs, and a full legitimacy audit are all in the repository.

GitHub: github.com/JordanMcCann/agentmemory

Built by Jordan McCann.