I maintain mnemo, an MCP-native embedded memory database for agents. Its read path is retrieval: hybrid search (vector + BM25 + graph + recency) fused with RRF. This week two papers argued that retrieval-from-a-bank is the wrong default for long-horizon agents. Here is how I'm reading them as the person whose product is implicated.
The two papers
Mem-π (ServiceNow + Mila, arXiv:2605.21463) trains a separate model to generate guidance on demand instead of retrieving static entries. It decides when to emit guidance and what to emit, and it can abstain. Result: >30% relative improvement on web-navigation tasks over retrieval-based and prior RL memory baselines.
MINTEval (UNC, arXiv:2605.18565, code) benchmarks memory under interference: facts get revised and contradicted across contexts up to 1.8M tokens. Across 7 systems (long-context, RAG, memory frameworks): 27.9% average accuracy, worst on multi-target aggregation. Diagnosis: the bottleneck is retrieval + memory construction, and it gets worse as updates pile up.
What they get right
Static recall is the easy half. The hard half is the stale-fact case:
t0: user budget = 5000
t1: budget = 7000
t2: budget = 4000 <- current truth
query: "what is the budget?"
naive top-k similarity -> returns all three, ranks by cosine, not by recency
A vector index knows "similar," not "current." That gap is where MINTEval's 27.9% lives, and I've hit it in production.
What I'm not switching for
Generation isn't free:
- a model call on the hot path of every recall
- more tokens
- a failure mode retrieval structurally cannot have: a memory that was never stored
A retrieval system can return the wrong entry. It cannot return a nonexistent one. For DPDP/HIPAA workloads with an audit requirement, an auditable retrieval log with a hash-chain beats an unauditable generation. On web navigation, where there's no auditor, generation may win. Different workloads, different defaults.
What I'm actually changing
Two narrow changes, both pointed at by the papers:
- Interference-eval harness — reproduce MINTEval's setup at small scale: revise a fact K times, query the latest, measure current-fact accuracy under K revisions instead of recall@k on a static set.
- Which-fact-is-current resolver — before candidates hit the LLM, resolve version conflicts on the timeline the DB already stores: prefer the most recent uncontradicted write, surface the supersession chain as evidence. Governed retrieval, not generation. Audit log intact.
Takeaway
Retrieval isn't dead. Naive retrieval is. The product is the governed middle: retrieval that knows which fact is current and can prove where every answer came from.
If you run agent memory in prod, drop a comment: more "couldn't find it" failures, or more "found the wrong version" failures? That answer decides what to build first.
Top comments (0)