How I took LongMemEval oracle from 62% to 82.8% without touching the retriever

#ai #llm #benchmarks #memory

I've been building a memory layer for AI agents (MnemoPay) and LongMemEval is the public benchmark I've been beating my head against for the last two weeks.

Started at 62-64% (Sonnet-4 answerer, GPT-4o judge). Ended today at 82.8%. Here's what actually moved the number and what didn't.

Scoreboard

500-question oracle variant, GPT-4o as judge.

Run	Overall	Notes
Baseline	62-64%	Sonnet-4 answerer, default prompt
Session summarizer	~72%	Compressed each session to a tight recap before feeding into context
Entities + spreading	77.2%	Entity graph + 1-hop spreading over recalled chunks
Azure gpt-4o answerer	81.4%	Same pipeline, swapped the model
Preference-fix prompt	82.8%	Classify the question before answering

Biggest single lift was the last one. 1.4 points overall, but it's a 20-point lift on the preference bucket (60% -> 80%) and it's basically free.

What moved the number

Session summarization. The judge doesn't read your haystack, it reads your answer. If the answer is a 2KB dump with the fact buried in paragraph 3, the judge is generous but the model's own attention is not. Pre-summarizing each session before recall made the answers sharper.

Entity graph. A question like "what did I tell you about Denver in February" doesn't hit on the token "Denver" if the session talks about Red Rocks and Brandon Flowers. Spreading activation one hop through an entity graph fixes most of the multi-session recall misses. Not magic, just a dict of {entity: [session_ids]} built at ingest.

Question classification in the system prompt. This is the one nobody talks about. The preference bucket was sitting at 60% for a week. I dug into the failures: 9 out of 12 were spurious refusals. Questions like "any tips for my Tokyo trip?" got "the conversation history doesn't include tips about Tokyo."

Of course it doesn't. The user isn't asking you to recall tips, they're asking you to generate tips using what you remember about them.

Fix was one paragraph added to the system prompt:

First, classify the question. (A) Factual lookup: answer strictly from context, refuse if not present. (B) Recommendation / advice: extract signals about the user from context, ground your suggestion in named specifics. Never refuse a recommendation question with "history doesn't include recommendations about X."

Preference: 60% -> 80%. Didn't touch the retriever.

What didn't move the number

HyDE + rerank. Classic RAG trick. I ran a full 500q with HyDE + a cross-encoder reranker. Result: 74.6%. Down 2.6 points from no-HyDE. The hallucinated query was too close to the benchmark's distractors. Killed it.

Bigger context window. Went from 8k to 32k on the same pipeline. Gained 0.4 points. The model has the information either way. Judge-graded accuracy is about sharpness, not recall breadth.

More retrieval candidates. Top-k 5 -> 20 was flat. More noise cancels more recall.

The 1M stress test

Separate from the benchmark, I run a 1,000,000 transaction stress harness on the SDK on every release. Same pipeline, same ledger invariants, 100 concurrent agents doing charges, settles, refunds, disputes, and memory ops.

Latest run:

Agents:              100
Total ops:           1,036,685
Throughput:          2,904 ops/sec
Latency P50/P95/P99: 29 / 70 / 93 ms
Errors:              0.31%
Adversarial txs:     5,798 injected, 5,598 blocked (96.6%)
Ledger imbalance:    $0.00 (100/100 agents balanced)

Accuracy numbers lie if the system can't hold the line under load. These are the two tests I trust: a public benchmark for the intelligence, a stress harness for the plumbing.

Takeaways if you're tuning your own RAG

Look at your failures, not your aggregate score. A 20-point bucket lift hides behind 1.4 aggregate points.
Read the failure samples. Don't run another hyperparameter sweep until you've read 20 of them.
Most "recall" misses are actually answer-shape misses. The model has the information, it just packaged it wrong.
Swapping the answerer model is high leverage. Retrievers are load-bearing, but answerers are the voice.
Never trust a benchmark score without sample-level failures next to it.

Happy to share the harness if anyone wants to repro against their own stack. The benchmark repo is xiaowu0162/LongMemEval on GitHub if you're starting cold.

MnemoPay is the memory and payments layer I'm building. npm i @mnemopay/sdk if you want to poke at it. Python too: pip install mnemopay.