A common objection to agent memory is that you don't need it: context windows are huge now, so just put the whole history in the prompt. We wanted a real answer, not a vibe, so we ran two public long-term-memory benchmarks against a full-context baseline. Here's what we found — including the case where the baseline wins.
The setup
We compared two configurations on the same questions. The full-context baseline stuffs the entire conversation history into the prompt. Eidentic memory ingests the history into its four-tier engine and retrieves only what each question needs. Both use the same model and the same LLM judge. We ran the full sets — no sampling — and we're publishing wins and losses together.
LongMemEval: memory wins across the board
LongMemEval uses long histories — roughly 115k tokens across ~50 sessions, 500 questions. This is where memory should help, and it does: 55.2% overall vs 41.0% for full context, a 14.2-point gain, winning all six question types.
| Question type | Full context | Eidentic memory |
|---|---|---|
| Single-session · user | 67.1% | 84.3% |
| Single-session · assistant | 73.2% | 92.9% |
| Single-session · preference | 3.3% | 26.7% |
| Multi-session | 27.8% | 42.1% |
| Temporal reasoning | 20.3% | 34.6% |
| Knowledge update | 66.7% | 70.5% |
| Overall | 41.0% | 55.2% |
The cost difference is the other half of the story. Memory answers each question with about 2,550 tokens of retrieved context; the baseline spends about 99,435 re-reading the whole history every time — up to ~39× fewer tokens for the better score. Retrieval isn't just more accurate here, it's dramatically cheaper.
LoCoMo: where full context still wins
LoCoMo has a much smaller haystack. When the entire history comfortably fits in the window, brute force is hard to beat: the model can see everything at once, and single- and multi-hop questions don't need retrieval. Here the full-context baseline comes out 7.8 points ahead. Memory still uses far fewer tokens (~893 vs ~19,030), but on a small history that trade-off doesn't pay for itself on accuracy.
The larger the history, the more memory wins — on accuracy and on cost. On small histories, full context stays competitive. We'd rather you know both numbers than just the flattering one.
What this means in practice
If your agent's conversations are short and bounded, you may not need a memory engine at all — and we'll tell you that. But the moment histories grow past what you want to pay to re-read on every turn, retrieval-based memory wins twice: better answers, far fewer tokens. That crossover arrives quickly in real products.
Full methodology, the harness, and the raw per-question records are in the benchmarks docs, and the runner lives in the repo. Reproduce it, and tell us where we're wrong.
Top comments (1)
This is a really valuable empirical study demonstrating the practical benefits of retrieval-based memory over full-context prompts for long histories. I appreciate how you present both accuracy and token-efficiency trade-offs—showing a 14.2-point gain with 39× fewer tokens in LongMemEval, while honestly noting that full context still wins on smaller histories like LoCoMo.
I’d love to collaborate and explore further—experimenting with hybrid memory strategies, dynamic context sizing, or cross-session memory optimizations. Sharing approaches for temporal reasoning, multi-session tracking, and knowledge updates could help developers design memory engines that balance cost and accuracy effectively.
Would you be open to discussing a joint experiment or benchmarking extension to test memory strategies on other long-horizon conversational tasks?