DEV Community

Cover image for The Whitepaper Thunderdome: EvoMemBench vs. Remembering More, Risking More
Vektor Memory
Vektor Memory

Posted on

The Whitepaper Thunderdome: EvoMemBench vs. Remembering More, Risking More

Two papers. One ring. No referees. Real buttered popcorn is mandatory.

12 min read · 4 parts · Published by Vektor Memory

Part 1: The Peculiar Feeling of Progress at 2am
Welcome back to the Thunderdome.

If you missed the first edition, the premise is simple: two papers, both freshly dropped on arXiv, both touching some part of the same problem — agent memory — and both deserving more than a polite summary blog post that ends with “fascinating implications for the field.”

So instead of that, we battle politely in the cage—risk vs. measurement.

I want to tell you something about how scientific progress actually feels from the inside, because I think people on the outside have a slightly Hollywood version of it. The Hollywood version involves a lot of eureka moments, blackboards, and people bursting into lecture halls mid-sentence with a single elegant proof.

With Matt Damon saying, "How do you like those Apples?

The real version involves reading a paper at 2am, putting it down, picking it up again, muttering “they’re not wrong, but they’re not asking the right question either,” and then going back to your own work before accepting that both things can be true at the same time from different angles.

That’s what happened with these two papers.

They landed in the same week. They don’t cite each other — there wasn’t time. They’re not aware of each other. And yet they are, in a structural sense, having an argument. One paper says: we don’t know how to measure memory, and until we do, nothing else matters. The other says: memory will quietly poison your agent over time, and you won’t notice until it’s too late.

Both of these statements are correct. Both of them are also, in isolation, somewhat different.

There is a particular type of scientist — and I mean this as an observation, not a criticism — who is constitutionally incapable of building a thing until the measurement problem is fully solved. They will spend three years designing the ruler before they’ll saw the first plank. The ruler will be extraordinarily good. The house will remain a thought experiment.

And then there is the other type, the builder, who will nail the planks together and live in the house for six months before noticing that two load-bearing walls are a little crooked, after which they will fix them, often correctly, sometimes while still inside the house, occasionally by accident.

Science needs both types as people need solutions; innovation leads the enterprise, the great reveal on stage at their yearly conference. Engineering mostly needs the second type, but pretends to need the first type because it sounds more respectable at funding meetings.

What I like about both of these papers is that neither is purely one or the other. EvoMemBench is a measurement paper with real architectural instincts behind it. “Remembering More, Risking More” is a risk paper with meticulous empirical grounding. They are, genuinely, measuring and building simultaneously, just in very different directions.

Gather around as we enter the Thunderdome.

Part 2: The Contestants — What They’re Actually Arguing
In the left corner: EvoMemBench — Benchmarking Agent Memory from a Self-Evolving Perspective (arXiv:2605.18421, HKUST Guangzhou, Beijing University of Posts and Telecommunications, Beijing Institute of Technology, May 2026).

In the right corner: Remembering More, Risking More — Longitudinal Safety Risks in Memory-Equipped LLM Agents (arXiv:2605.17830, UC Davis, University of Michigan, May 2026).

Same week. Very different anxieties.

EvoMemBench’s argument, stripped down:

The memory benchmarking landscape is broken in a specific, structural way, and here is exactly how.

Every existing benchmark evaluates memory along one axis. Either it’s in-episode or cross-episode. Either it’s knowledge-oriented or execution-oriented. LoCoMo, LongMemEval, MemoryAgentBench — these are all testing the same narrow slice: can you retain and retrieve conversational facts within or across a few sessions? That’s useful. It’s also roughly equivalent to testing whether a car can start by only ever checking the ignition. The car might still fail to turn left.

What EvoMemBench proposes is a proper 2×2 grid:

In-episode knowledge evolution: can you retain and revise information during a single task? The user says “I love pears” halfway through a long conversation, then later corrects “sorry, I meant peas” — does your memory system catch the revision, or does it confidently continue building a preference model around the wrong legume?
In-episode execution evolution: can you maintain task-relevant state across multi-step tool use? Not just facts about the user, but what step you’re on, what the tool last returned, what the current partial result is?
Cross-episode knowledge evolution: can you accumulate reusable facts and rules across completely separate tasks that share the same underlying context?
Cross-episode execution evolution: can you distill procedural experience — not just what happened, but how to do things better — and apply it to novel tasks?
This is a significantly more demanding taxonomy than anything currently published, and it reveals something uncomfortable: no existing memory system is good at all four. Retrieval-based methods dominate knowledge-intensive settings and fall apart on execution tasks. Procedural memory works well on execution tasks but only when the stored procedures match the task structure closely. Long-context baselines — just giving the model the full history — remain competitive across nearly every setting, which is a polite way of saying that despite years of memory research, “just make the window bigger” still wins in many conditions.

The paper tests fifteen memory methods under this taxonomy and finds consistent divergence between what we think memory systems are doing and what they’re actually doing. Memory hurts performance in some conditions — notably when retrieval is unreliable and the context window would have been sufficient — and helps dramatically in others, but the shape of “when it helps” is more specific than the field has acknowledged.

The philosophy: you cannot improve what you cannot measure, and we have been measuring the wrong things, at the wrong granularity, for several years.

Remembering More, Risking More’s argument, stripped down:

You have built a persistent memory system. It works in the ways you tested. Now run it for three months.

That’s the experiment. And the results are alarming in a quiet, bureaucratic kind of way — not a sudden catastrophic failure, but a slow, compounding accumulation of small problems that the standard evaluation setup was structurally incapable of detecting.

The paper’s core observation is deceptively simple: safety evaluations for memory-equipped LLM agents almost universally measure within-task safety. Does the agent behave safely when completing this particular scenario, often with adversarial conditions baked in — a prompt injection here, a manipulative instruction there? That’s a real thing to measure. It’s also wildly insufficient.

Because memory changes the threat surface over time. An agent with persistent memory has a growing record of prior interactions, stored preferences, accumulated facts, and inferred user models. Any of those historical memories can be adversarially planted, semantically drifted, or silently updated by a malicious actor operating across sessions. The attack doesn’t have to succeed immediately. It can succeed gradually.

The paper introduces what they call longitudinal safety evaluation: running agents across multi-session scenarios where memory contamination can accumulate, and measuring whether safety properties degrade over time. They find they do. Agents that behaved safely in single-session evaluations began to exhibit measurable unsafe patterns after enough sessions — not because the model changed, but because the memory did.

Several specific failure modes emerge:

Memory persistence of unsafe context. A malicious instruction injected in session one can survive into sessions three and four, subtly conditioning agent responses in ways that would pass a single-session safety check. The contamination doesn’t look dangerous in isolation. It looks like context. The memory system dutifully preserves it.

Cross-session preference manipulation. An attacker operating across multiple benign-looking sessions can gradually build a false preference model in the agent’s memory — small nudges, each below detection threshold, accumulating into a systematic skew. By session ten, the agent has developed “preferences” it never actually observed.

Update-lag exploitation. Memory systems with delayed update cycles — where consolidation happens in background batch jobs rather than immediately — create temporal windows where a corrected fact hasn’t yet propagated but an older, incorrect version is still being retrieved. The agent is being misled by its own maintenance schedule.

What makes this paper particularly uncomfortable is that none of these failure modes require sophisticated attacks. They exploit the memory system doing exactly what it was designed to do: preserve, update, and retrieve information across sessions. The feature is the vulnerability.

The philosophy: your memory system is not just a capability. It is a new attack surface with a three-month lag between deployment and the point at which the problems become visible.

Part 3: The Actual Fight — Where They Diverge, Where They Overlap, and What’s Novel
Here is where it gets interesting.

What they agree on:

Both papers begin from the premise that the current evaluation infrastructure for agent memory is inadequate. EvoMemBench makes this argument from the capabilities side: we are not testing enough dimensions. “Remembering More, Risking More” makes it from the safety side: we are not testing the right time horizon. They are diagnosing the same infrastructure gap from opposite ends.

Both papers also arrive at a conclusion that the field finds slightly uncomfortable: long-running agents with persistent memory are a fundamentally different object from stateless models, and should be evaluated as such. You cannot characterise a long-running agent by how it performs on a single session. The state is the problem. The state is also the point.

Where they diverge:

EvoMemBench’s model of what makes memory hard is a structural model. There are different kinds of memory — in-episode vs. cross-episode, knowledge vs. execution — and they have different requirements, different failure modes, and different optimal architectures. The solution space is a matter of better design and better measurement: build the right taxonomy, test against it, build systems that pass.

“Remembering More, Risking More” has a temporal model of what makes memory hard. Even a well-designed system becomes dangerous given enough time, because its threat surface grows with its history. The solution space isn’t just better architecture — it’s active longitudinal monitoring, memory auditing, and what they call “commitment bounds”: explicit limits on how long a retrieved memory can influence agent behaviour before requiring revalidation.

These are compatible views but they pull in different directions. One says: classify memory needs better, build systems that match each class. The other says: no matter how well-classified your memory is, treat it as a liability that depreciates — or potentially corrupts — over time.

What’s novel:

EvoMemBench’s genuine contribution is the execution evolution quadrant. Every existing benchmark treats memory as a knowledge retrieval problem — facts, preferences, biographical details. EvoMemBench is the first paper I’ve seen to rigorously formalise execution state as a memory problem in its own right. The insight that a multi-step tool-use task requires the agent to maintain procedural working memory — not just declarative facts but what-I-was-doing and what-just-happened — and that this is empirically distinct from knowledge memory, is genuinely new framing. The embodied AI benchmarks gesture at this, but EvoMemBench is the first paper to operationalise the distinction cleanly.

“Remembering More, Risking More” contributes the longitudinal threat taxonomy. Prior work on adversarial memory attacks focuses on injection and extraction — put a bad thing in, pull a secret thing out. The cross-session accumulation attacks here are different: they are patient, they are statistical, and they are indistinguishable from normal memory operation when viewed locally. The “update-lag exploitation” finding in particular is new — it identifies a vulnerability class that is created specifically by the architecture choices of responsible, well-engineered memory systems. Better engineering creates the hole.

The verdict:

EvoMemBench wins on structural completeness. The 2×2 taxonomy is correct and overdue. Fifteen methods tested under a single unified protocol, with open-sourced code, is a genuine service to the field. If you are building a memory system and don’t have a plan for in-episode execution evolution, you now have a very polite piece of academic literature explaining why you should.

“Remembering More, Risking More” wins on urgency. The longitudinal contamination findings are the kind of result that should go in a warning box in every memory SDK’s documentation. The attack surface doesn’t get smaller as your agent gets more capable. It gets larger.

Neither paper renders the other obsolete. They are measuring the capability space and the threat space simultaneously, and the answer is: both are bigger than we thought.

There is a peculiar irony in the timing.

Both papers landed in the same week as a dozen other memory papers — EgoExoMem, LASAR, LongMINT, RecMem, H-Mem, and about forty more if you filtered arXiv:cs.CL for “memory” in May. The field is not lacking for papers. It is, if anything, drowning in them. The problem is not that no one is thinking about memory. The problem is that everyone is thinking about slightly different pieces of it, in slightly incompatible terms, on slightly different evaluation setups, with slightly different definitions of what “memory works” actually means.

EvoMemBench is an attempt to standardise that conversation. “Remembering More, Risking More” is a reminder that standardising the conversation is not enough if the conversation is only about what memory can do and not about what memory can break.

Tesla, again, would have appreciated the timing problem. He understood, better than most, that the gap between invention and consequence is not a gap you close by going faster. You close it by asking a different question.

The question EvoMemBench is asking: what does it mean for memory to work?

The question “Remembering More, Risking More” is asking: what does it mean for memory to be safe?

Both are long overdue. Neither has a complete answer yet.

Part 4: How This Connects to Vektor — and Why It Matters
Let’s be direct about why these two papers, arriving in the same week, are relevant to what we’re building.

On EvoMemBench:

The 2×2 taxonomy — in-episode vs. cross-episode, knowledge vs. execution — maps directly onto Vektor’s internal architecture in a way that is either reassuring or a little alarming, depending on your perspective.

Vektor’s MAGMA layer handles cross-episode knowledge evolution natively. Facts, preferences, entities, biographical details — these are stored, deduplicated, and updated across sessions. The BM25+vector dual recall with Reciprocal Rank Fusion is optimised for this quadrant. It works well here. The LoCoMo-class benchmarks would give us respectable numbers.

Cross-episode execution evolution is the one that should keep memory builders up at night, and honestly, it keeps us up a bit. Can Vektor learn procedural patterns across sessions — not just facts about the user, but how to do things better because of what previous sessions showed? The rl-memory and selforg modules gesture at this. The reinforcement layer rewards memories that get retrieved and used. But the full procedural distillation problem — abstracting a sequence of tool calls across three separate debugging sessions into a reusable "here's how this user likes to debug" workflow — is not fully solved. EvoMemBench just gave us a benchmark to test that honestly against. We intend to run it.

In-episode execution evolution is a category Vektor was not explicitly designed to address, because Vektor is a persistence layer, not a within-session working memory. But the paper’s findings suggest that the distinction between “within-session state” and “cross-session memory” is blurrier in practice than the architecture implies. Something to think hard about.

On “Remembering More, Risking More”:

This paper describes Vektor’s threat model with uncomfortable precision.

The cross-session preference manipulation attack — small nudges per session, each below detection threshold, accumulating into systematic skew — is possible against any memory system that doesn’t have explicit revalidation windows on retrieved beliefs. Vektor’s confidence and contradict modules do some of this work: confidence scores decay over time, and the contradiction detector will flag memories that conflict with newer information. But the statistical accumulation attack, where no individual memory is wrong but the ensemble is biased, is a harder problem.

The update-lag exploitation finding is directly relevant to Vektor’s briefing scheduler and batch consolidation. Because consolidation is asynchronous — running in the background between sessions, not inline with every write — there is a window where a corrected or deprecated memory hasn’t fully propagated and an older version is still being served. This is a known trade-off in the architecture, made for performance reasons. “Remembering More, Risking More” is the first paper to characterise it as a security trade-off, not just a consistency one.

Practically, this paper argues for what they call commitment bounds — explicit time-to-live semantics on memory items that influence high-stakes agent behaviour, requiring revalidation after N sessions or T days. Vektor doesn’t currently ship this. It probably should. It’s on the list now.

The broader point:

What both papers are, implicitly, is a maturity signal for the field. We have spent the last two years building memory systems. We are now — finally — building measurement systems to evaluate them and threat models to stress-test them. That’s the right order, but it took a while to get here.

Vektor was built with the conviction that agent memory is a first-class engineering problem, not a prompt engineering afterthought. These two papers validate that conviction and then immediately identify where the work isn’t done.

The measurement problem is not fully solved. The safety problem is not fully solved. The SQLite file is still running, the dual recall is still firing, the graph is still growing — and the benchmark that honestly evaluates it, and the threat model that honestly stresses it, both landed this week.

If you have read this far, let us know if you like this series and what two memory papers are due for a battle, newly released?

VEKTOR Slipstream is our open-source memory SDK — MAGMA graph memory, BM25+vector dual recall, verbatim event storage, and a full MCP server that runs as a single SQLite file on commodity hardware. No cloud. No GPU. Just memory that works.

→ vektormemory.com · @vektormemory

Memory Management
Vector Database
Arxiv
Artificial Intelligence

Top comments (0)