I build a retrieval-first agent memory DB. Two papers just said retrieval is the wrong default.

#ai #agents #mcp #machinelearning

I maintain mnemo, an MCP-native embedded memory database for agents. Its read path is retrieval: hybrid search (vector + BM25 + graph + recency) fused with RRF. This week two papers argued that retrieval-from-a-bank is the wrong default for long-horizon agents. Here is how I'm reading them as the person whose product is implicated.

The two papers

Mem-π (ServiceNow + Mila, arXiv:2605.21463) trains a separate model to generate guidance on demand instead of retrieving static entries. It decides when to emit guidance and what to emit, and it can abstain. Result: >30% relative improvement on web-navigation tasks over retrieval-based and prior RL memory baselines.

MINTEval (UNC, arXiv:2605.18565, code) benchmarks memory under interference: facts get revised and contradicted across contexts up to 1.8M tokens. Across 7 systems (long-context, RAG, memory frameworks): 27.9% average accuracy, worst on multi-target aggregation. Diagnosis: the bottleneck is retrieval + memory construction, and it gets worse as updates pile up.

What they get right

Static recall is the easy half. The hard half is the stale-fact case:

t0:  user budget = 5000
t1:  budget = 7000
t2:  budget = 4000   <- current truth
query: "what is the budget?"
naive top-k similarity -> returns all three, ranks by cosine, not by recency

A vector index knows "similar," not "current." That gap is where MINTEval's 27.9% lives, and I've hit it in production.

What I'm not switching for

Generation isn't free:

a model call on the hot path of every recall
more tokens
a failure mode retrieval structurally cannot have: a memory that was never stored

A retrieval system can return the wrong entry. It cannot return a nonexistent one. For DPDP/HIPAA workloads with an audit requirement, an auditable retrieval log with a hash-chain beats an unauditable generation. On web navigation, where there's no auditor, generation may win. Different workloads, different defaults.

What I'm actually changing

Two narrow changes, both pointed at by the papers:

Interference-eval harness — reproduce MINTEval's setup at small scale: revise a fact K times, query the latest, measure current-fact accuracy under K revisions instead of recall@k on a static set.
Which-fact-is-current resolver — before candidates hit the LLM, resolve version conflicts on the timeline the DB already stores: prefer the most recent uncontradicted write, surface the supersession chain as evidence. Governed retrieval, not generation. Audit log intact.

Takeaway

Retrieval isn't dead. Naive retrieval is. The product is the governed middle: retrieval that knows which fact is current and can prove where every answer came from.

If you run agent memory in prod, drop a comment: more "couldn't find it" failures, or more "found the wrong version" failures? That answer decides what to build first.

Top comments (2)

Harjot Singh • May 31

This is a great honest-builder post - building a retrieval-first memory DB and then engaging with papers that challenge your core default is exactly the right move, instead of defending the architecture you already shipped. The nuance the papers are usually pointing at: retrieval-by-default treats memory as a flat similarity lookup, but a lot of agent tasks need structure (recency, causality, "what did I already try and fail"), and pure top-k semantic retrieval can surface plausible-but-irrelevant context that actively misleads the agent. The fix isn't "retrieval is wrong," it's "retrieval is one strategy, and the policy for when to retrieve vs carry-forward vs summarize matters more than the retriever."

The thing I'd hold onto from your own work: a retrieval layer you control still beats a black-box context window, because you can measure and tune what gets surfaced. That measure-and-gate-what-you-surface instinct is core to how I build Moonshift, the thing I work on - a multi-agent pipeline that takes a prompt to a deployed SaaS, where what context an agent gets is deliberate and verified, not just "stuff the nearest chunks in." Multi-model routing keeps a build ~$3 flat, first run's free no card. Really like that you're updating on the evidence. Which paper's argument moved you most - the recency/structure one, or the "retrieval adds noise" one? Curious whether it changes your default or just adds a routing layer on top.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.