My AI memory benchmark said 98.3%. The number was true — and worthless.

Daniel Nevoigt — Sat, 04 Jul 2026 10:06:49 +0000

In my last post I introduced Bastra Recall — an MIT-licensed MCP memory server that gives Claude persistent memory as plain Markdown in a local Obsidian vault. I promised a follow-up on retrieval and benchmarking.
Here it is. It starts with me being wrong.
The 98.3% that meant nothing
Early on, I ran an eval against my real vault: 59 memories, and for each one I used its own trigger phrase as the query. Result:

Recall@1: 98.3% (58/59)
Recall@3: 100%
MRR: 0.992

I was pleased for about a week. Then it sank in: this benchmark is a tautology. Every memory in Recall carries a recall_when field — trigger phrases describing when it should resurface. Querying each memory with its own trigger is like testing a search engine by searching for the exact title of the page you want. Of course it wins.
The number was real. It just didn't measure the thing that matters: does the right memory come back when a future session describes the situation in completely different words? Nobody re-types their trigger phrase weeks later. They paraphrase, they switch languages, they half-remember.
So I built a benchmark designed to hurt.
A benchmark that can actually fail
The setup, on my real 381-memory vault:

6 persona agents — separate LLM agents, each with a fixed voice: terse German, verbose German, junior-dev English, senior-dev English, German/English code-switching ("Denglisch"), and one that quotes error messages verbatim. One agent per voice, specifically to avoid mode collapse where every paraphrase sounds the same.
Each persona rewrites queries for 30 stratified memories → 180 queries total, evaluated at k=3.
Queries are split into near (close to the original wording) and far (heavily paraphrased) — because near-queries were already at ceiling and tell you nothing.

This is meant to simulate the real failure mode: an AI session weeks later, describing a stored situation in its own words.
What the honest numbers look like
Lexical search (BM25) alone: 63.1% Recall@3 on far queries. That's the truth behind the 98.3%. On heavily paraphrased queries, pure keyword search misses more than a third of the time.
Four findings surprised me more:

Embeddings rescue exactly the hard cases. Adding a local embedding layer (Ollama + embeddinggemma, hybrid with BM25) lifted far-recall from 63.1% to 79.6% (+16.5pp), and cut "not retrieved at all" from 20 cases to 7 out of 103. The hardest voices gained the most — the junior-dev-English persona jumped from 40.0% to 73.3%. If your users phrase things differently than you do (different language, different experience level), that's where vectors earn their keep.
My favorite feature did nothing here. recall_when trigger phrases are the highest-weighted search field in Recall, and on near-queries they're great. On paraphrased far-queries at k=3, their measured lift was approximately zero — in every arm of the test. The tautology cut both ways: the feature looked heroic in the old benchmark precisely because the old benchmark was rigged in its favor.
Write-time paraphrases didn't help either. Recall can optionally generate paraphrases of a memory's triggers at save time (doc2query) and index them alongside — the idea being that the wording a future session will type might already be sitting in the index. Sounds like exactly the right lever against paraphrased queries. In this far-query profile, that arm produced no lift over plain BM25 (~63% Recall@3, level with the lexical baseline). Only dense vectors closed the gap. Lesson: a plausible retrieval idea is not a lift — measure it before you believe it.
The remaining gap isn't recall — it's ranking. In the hybrid arm, 96–97 of the 103 far-query targets were in the candidate pool, sitting at a mean rank around 2.3–2.6. The index finds them; the ordering doesn't always surface them first. That's a precision/re-ranking problem, which is a different (and later) fight.
One caveat, because honest benchmarking means stating it: the persona queries were generated from memory digests, so absolute numbers aren't comparable across different runs — the robust signal is the cross-arm comparison on identical queries.
What this changed in the product

BM25 stays the default. A fresh npx bastra-recall install gives you zero-setup lexical search — no model downloads, no daemon dependencies. For queries anywhere near your original wording, it's already at ceiling.
Embeddings are one config line away, fully local. If you run Ollama, hybrid search switches on and your far-recall jumps ~16 points. No cloud, no API key — the vectors are computed on your machine.
Re-ranking is on the roadmap, gated behind vault scale, because the data says that's where the remaining points live.

Takeaways if you're building retrieval for anything

If your eval queries are derived from your index fields, your benchmark is a tautology. You're measuring string overlap, not retrieval.
Test paraphrase survival. The realistic query is written weeks later, by someone (or something) that doesn't remember your exact words. Multiple voices, multiple languages if that's your reality.
Separate "not retrieved" from "mis-ranked." They look identical in a Recall@k number and need completely different fixes.
Publish the number that hurts. 63.1% is a more useful fact than 98.3% ever was.

Try it

npx bastra-recall install

That starts a guided setup — pick your vault, your AI clients, and (optionally) semantic recall from selection menus, no flags needed. If you'd rather skip the questions:

npx bastra-recall install all

Still early (0.7.6), still macOS/Apple Silicon/Node 22+, still MIT: github.com/n0mad-ai/bastra-recall

If you've benchmarked retrieval for an AI memory system — or think my methodology has a hole in it — tell me in the comments. The last time I questioned my own numbers, the product got measurably better. And if honest benchmarks are your thing, a star helps other people find the repo.

Claude forgets everything between sessions. Here's how I fixed it.

Daniel Nevoigt — Tue, 23 Jun 2026 07:25:18 +0000

Every Claude session starts from zero.
You spend an hour explaining your architecture, your naming conventions, the three decisions you already made and don't want re-litigated. You close the tab. Next morning you open a new chat and Claude greets you like a stranger. You explain it all again.
After the fortieth time, I stopped re-explaining and built a fix. It's open source, MIT-licensed, and installs with one command. This post is the 5-minute version of how it works and how to run it yourself.
The actual problem
LLMs are stateless. Each conversation is a clean slate — by design. "Memory" features that do exist usually mean one of two things:

A Redis/Valkey server you have to stand up and keep running, or
A managed cloud service where you sign up, get an API key, and your context lives on someone else's infrastructure.

Both work. But both mean your project decisions, code snippets, and the occasional credential you pasted while debugging now sit on a server you don't control. For a tool whose entire job is to remember everything you tell your AI, that tradeoff bothered me.
I wanted memory that stays on my disk.
The approach: your notes are the database
Bastra Recall is an MCP server (Model Context Protocol — the open standard Claude uses to talk to external tools). Instead of a database, it writes memories as plain Markdown into a local Obsidian vault — a folder of .md files on your machine.
That design choice does a few things at once:

The data is yours and it's readable. Open any memory in a text editor. No export tool, no lock-in. If you delete the vault folder, the memory is gone — fully under your control.
One daemon, every tool. The same daemon feeds Claude Code, Claude Desktop, and Cursor. A decision you store in one shows up in the others.
No server to babysit. No Redis, no cloud account, no API key.

When you tell Claude "remember that we use Drizzle, not Prisma, on this project," that fact lands as a Markdown note. Next session — new tab, days later — Claude retrieves it automatically before answering.
Install it (the whole thing)
One command patches the MCP config for every AI tool it detects, idempotently and with a backup:
bashnpx bastra-recall install all --vault /absolute/path/to/your/vault
Then verify the registrations:
bashnpx bastra-recall doctor
Restart Claude Code / Desktop / Cursor, and memory is live. That's it.
Honest constraints, up front:

macOS, Apple Silicon, Node 22+ for now. Linux/Windows are on the roadmap.
It's early — currently 0.7.0-beta.1. Working, in daily use by me, but beta.
Expect rough edges, including during install. This is genuinely early software and something may break on your setup that never broke on mine. If it does, that's useful to me — please tell me exactly what went wrong, either as a comment on this post or as a GitHub issue. The more precise (OS version, Node version, the command you ran, the error you saw), the faster I can fix it.

How retrieval works (the 30-second version)
Storing is easy — anything is a file. The hard part is pulling the right memory back without flooding Claude's context with junk. Recall ranks stored memories by relevance to the current conversation and injects only the top matches, so you get the decision you need without burning your context window on everything you've ever said.
If you want to go deeper on the retrieval and benchmarking, that's the next post. This one is just: here's the problem, here's a thing that fixes it, here's how to run it.
Try it / tear it apart
Repo (MIT): github.com/n0mad-ai/bastra-recall
If you've solved AI memory a different way, I want to hear it — especially if you think the local-Markdown approach is wrong. And if it saves you from explaining your stack for the forty-first time, a star helps other people find it.
It works the same in Cursor and any other MCP client, not just Claude. But Claude is where I felt the problem first.

DEV Community: Daniel Nevoigt

My AI memory benchmark said 98.3%. The number was true — and worthless.

Claude forgets everything between sessions. Here's how I fixed it.