DEV Community

wontopos
wontopos

Posted on

WMB-100K: We built the first 100,000-turn benchmark for AI memory systems

Most AI memory benchmarks are surprisingly small.

LOCOMO tests 600 turns. LongMemEval tests around 1,000. That's roughly one week of casual usage.

But real AI companions, assistants, and memory systems don't get used for a week — they get used for months. Years. What happens to memory accuracy at that scale? Nobody had tested it.

So we built WMB-100K.

What it is

WMB-100K is an open-source benchmark that tests AI memory systems at 100,000 turns — roughly a year of heavy usage. It measures one thing: can your memory system find the right information when it matters?

Not LLM reasoning. Not response quality. Just memory.

What makes it different

Three things set WMB-100K apart from existing benchmarks:

Scale — 100,000 turns across 10 life categories (daily life, relationships, health, career, finances, and more)
Difficulty levels — 5 levels from simple fact lookup to multi-hop reasoning across 3,134 questions
False memory probes — 430+ questions about things that were never mentioned. "I don't know" is the correct answer. Confidently giving wrong information = -0.25 pts penalty
Why false memory matters

Every other benchmark rewards correct answers. WMB-100K also punishes hallucination.

An AI that forgets something is annoying. An AI that remembers something that never happened is dangerous.

How to run it

Dataset is already included. You just need an OpenAI API key for scoring.

Total cost: ~$0.07

Results so far

We tested memory systems at 100,000 turns. Accuracy drops to near zero at this scale — in ways smaller benchmarks never catch. Systems that score 66% at 600 turns flatline at 100K.

We're testing more systems and will be updating the leaderboard. If you run it against your own system, drop your results in the GitHub issues — or leave a comment below. Would love to see how different systems hold up.

If you find it useful, a ⭐ on GitHub goes a long way.

GitHub: https://github.com/Irina1920/WMB-100K

Top comments (0)