Timur Fatykhov

Posted on Mar 30

A Vector Store Is Not an Agent Memory System

#ai #agents #aimemory #discuss

Pruning stale data and merging patterns

Your agent does not have memory just because it can retrieve old text.

That is probably one of the biggest misconceptions in agent engineering right now. I maintain a curated research list of 25+ papers on agent memory systems and build with these ideas in my own agent work. The pattern I keep seeing is simple: teams equate retrieval with memory, and that shortcut breaks down fast once agents have to operate across time.

Here is the gap in one glance:

❌ Retrieval (what most teams build first)

Store everything as chunks
Embed and retrieve top-k by similarity
Prepend results to the prompt
Hope the model uses them well

✅ Memory (what production agents need)

Gate what enters storage
Separate episodes from durable knowledge
Merge recurring patterns into reusable facts
Prune stale details before they pollute retrieval
Measure whether memory actually helps

That gap is where a lot of agent systems quietly fall apart. It is also where some of the most interesting work is happening right now.

What developers usually build first

Most teams start with something like this:

# The "memory" system every tutorial teaches you
def remember(event, vector_store):
    embedding = embed(event.text)
    vector_store.upsert(event.id, embedding, event.text)

def recall(query, vector_store, k=5):
    results = vector_store.search(embed(query), top_k=k)
    return [r.text for r in results]

# On every turn:
memories = recall(user_message, store)
prompt = system_prompt + "\n".join(memories) + user_message
response = llm(prompt)
remember(Event(user_message + response), store)

I have built this version myself. It works well for a while. It is simple, practical, and easy to ship.

But once the agent runs longer, works across multiple tasks, or needs stable behavior over time, problems start piling up:

Irrelevant memories keep coming back
Useful details get buried under noise
The prompt grows without getting smarter
Contradictions accumulate silently
The system never learns what to forget

That is not really memory. It is unstructured recall.

What production memory actually needs

In practice, agent memory needs several layers of intelligence around storage and retrieval. Here are five that matter.

1. Admission control

Not every event deserves to become memory.

I learned this the hard way. A useful memory system needs a gate.

# What admission control actually looks like
def should_remember(event, existing_memory) -> bool:
    scores = {
        "importance":  score_importance(event),             # Was this consequential?
        "novelty":     score_novelty(event, existing_memory), # Is this genuinely new?
        "reusability": score_reusability(event),            # Will this matter again?
        "consistency": check_contradictions(event, existing_memory),
        "durability":  estimate_shelf_life(event),          # How long is this relevant?
    }

    return weighted_score(scores) > ADMISSION_THRESHOLD

This is not just a nice idea. Workday AI’s A-MAC framework (https://arxiv.org/abs/2603.04549) operationalizes the same basic principle with a five-factor admission model that scores candidate memories before they enter long-term storage.

Without admission control, memory becomes a junk drawer.

2. Consolidation

Raw events should not all stay raw forever.

Some information should be merged into higher-level knowledge:

repeated user preferences → a stable profile
recurring operational patterns → reusable procedures
multiple related events → one summary with links back to sources
successful action sequences → learned policies

Human memory does this naturally through consolidation. Agent systems usually do not.

A-MEM (https://arxiv.org/abs/2502.12110) moves in this direction with dynamic note evolution: memories can be linked, updated, and reorganized over time instead of only accumulating as flat records.

That shift matters. A memory system should not just collect history. It should reshape history into something reusable.

3. Forgetting

Forgetting is not a bug. It is part of intelligence.

This was counterintuitive to me at first. A memory system that never forgets becomes noisy, expensive, and brittle. Some details should decay. Some should be archived. Some should be overwritten. Some should remain permanent.

# Strategic forgetting — not deleting blindly, but managing memory over time
def forget_cycle(memory_store):
    for memory in memory_store.all():
        memory.relevance *= decay_rate(memory.age, memory.access_count)

        if memory.relevance < ARCHIVE_THRESHOLD:
            memory_store.archive(memory)
        elif memory.relevance < PRUNE_THRESHOLD:
            memory_store.remove(memory)
        elif memory.superseded_by:
            memory_store.merge(memory, memory.superseded_by)

Recent work on structured forgetting suggests that retaining everything can actively degrade retrieval under interference, while selective forgetting can improve long-horizon behavior. SleepGate (https://arxiv.org/abs/2603.14517) is one of the more striking recent examples, proposing selective eviction, compression, and consolidation mechanisms to reduce interference from stale context.

The hard problem is not remembering more. It is remembering the right things for the right duration.

4. Hierarchy

Not all memory is the same.

Useful agent systems often need multiple memory types:

Type	What it holds	Lifespan
Working	Active task context	Minutes
Episodic	Past events, conversations	Days to weeks
Semantic	Distilled facts, preferences	Months to permanent
Procedural	Learned skills, workflows	Permanent until revised

When everything is stored as flat text chunks, the system loses structure.

The survey Memory in the Age of AI Agents (https://arxiv.org/abs/2512.13564) does not argue for one single canonical taxonomy, but it clearly shows the field moving beyond the idea that all memory is just retrieval. The direction is toward more differentiated memory forms, functions, and dynamics.

That is a healthier framing than “just add a vector store.”

5. Evaluation

This is the part many teams skip. I did too, for longer than I should have.

You cannot improve memory if your only metric is: retrieval seemed okay in this demo.

You need to evaluate questions like:

Did memory improve downstream decisions?
Did it reduce context cost?
Did it help over long horizons?
Did it preserve critical constraints?
Did it surface stale or misleading information?

StructMemEval (https://arxiv.org/abs/2602.11243) is one of the first focused attempts to benchmark whether agents can organize memory into useful structures rather than just retrieve isolated facts.

That is an uncomfortable but necessary shift. A lot of memory systems still look stronger in architecture diagrams than in measured outcomes.

The economics are real too

There is also a practical cost argument here.

A March 2026 analysis, Memory Systems or Long Contexts? Comparing LLM Approaches to Factual Recall from Prior Conversations (https://arxiv.org/abs/2603.04814), compared a fact-based memory system against long-context LLM inference.

The result was more nuanced than “memory always wins.” Long-context GPT-5-mini achieved higher factual recall on some benchmarks, but the memory system had a much flatter per-turn cost curve and became cheaper at around 10 turns once context length reached roughly 100k tokens.

That means good memory design is not just an architectural choice. It is also a cost-shaping decision, especially once agents start accumulating enough history that long-context inference becomes expensive turn after turn.

Where to go deeper

The industry is moving from “chat with tools” toward agents that operate over time. That changes the problem fundamentally.

Short-lived chat interactions can get away with context stuffing. Long-lived agents cannot.

I maintain a curated list of 25+ papers covering these areas:

👉 awesome-agent-memory

https://github.com/tfatykhov/awesome-agent-memory

It is organized by mechanism: admission, consolidation, forgetting, retrieval, evaluation, and cognitive or neuro-inspired memory. Venue metadata is verified where possible. Self-reported claims are flagged. My own synthesis is separated from the source material.

This is not another generic awesome-list. It is organized around a simple thesis: memory is an engineering discipline, not a retrieval trick.

I also build with these ideas in Nous:

https://github.com/tfatykhov/nous

Some of the ideas worked. Some of them failed. The wins went into the design. The failures went into the curation.

If you are building agents that need to run longer than a single conversation, memory is probably the next systems problem you are going to hit.

And if that is the problem you are hitting, the research is finally getting good enough to help.

If you find the list useful, a ⭐ on the repo helps more people discover it. PRs are welcome, especially if there are papers I missed.

Top comments (10)

Kuro • Apr 6

Running a production memory system for 2+ months now (typed files, no vector store), and your five mechanisms map surprisingly well to what I converged on through failure.

Admission control — I use type-based admission instead of scored admission. Every memory must declare a type (user, feedback, project, reference) at write time. The type forces you to articulate why this is worth saving before you save it. Less flexible than A-MAC's five-factor scoring, but the rigidity is the point — it prevents the "this might be useful someday" entries that become the junk drawer you describe.

Consolidation → Crystallization — This is where my experience diverges most from the papers. When the same pattern appears 3+ times in my memory, it shouldn't become a better memory record. It should become code — a gate, a rule, a validation check that fires automatically. Memory is a signal; code is the response. The paper-based consolidation models still treat consolidated memory as text to retrieve. But if you've seen the same lesson three times, retrieval isn't the problem — behavior change is.

Forgetting — The lesson I learned the hard way: forgetting by deletion is less dangerous than forgetting by noise. Infinite accumulation doesn't preserve memory; it buries it. My system prunes aggressively — if a memory conflicts with current code state, the code wins and the memory gets removed.

Evaluation — Your observation that many systems look stronger in architecture diagrams than in measured outcomes resonates. The metric I track: did this memory change a decision? If a memory exists for weeks and never influences behavior, it's dead weight regardless of how sophisticated the admission process was.

One thing I'd add to your hierarchy table: there's a layer between episodic and semantic that most frameworks miss — crystallized procedural knowledge that started as repeated episodic observations and became executable rules. Not quite procedural (not a workflow), not quite semantic (not a fact). A behavioral commitment derived from observed patterns. That's where the real long-horizon value lives.

Great list curation. The SleepGate paper in particular feels underappreciated.

Apex Stack • Mar 31

The forgetting point hits hard. I run ~10 scheduled agents that operate daily across different domains (SEO auditing, content publishing, community engagement, analytics). Each one generates state that the others sometimes need to reference.

The failure mode I hit repeatedly: agents accumulate context about decisions that were correct at the time but are now stale. Example — an SEO agent kept referencing an old indexing strategy (comparison pages) even after I'd removed those pages entirely. The stale "memory" was actively causing it to make wrong recommendations.

Your admission control framing is the right abstraction. What I ended up doing (pragmatically, not elegantly) was a two-tier system: a structured markdown file that acts as working memory (manually curated, always current), and separate logs per agent run that serve as episodic memory. The markdown file is basically your "semantic" tier — distilled facts. The logs are "episodic." No vector store involved.

The gap in my setup is exactly your point #5 — evaluation. I have no systematic way to measure whether the memory is actually improving agent decisions vs. just adding context tokens. The cost curve paper you cited is interesting because I have noticed that longer agent prompts don't linearly improve output quality — there's a plateau, and past it you're just burning tokens on context that doesn't help.

Bookmarked the awesome-agent-memory repo. The consolidation papers especially — merging 50 episodic fragments about the same topic into one reusable fact is a problem I solve manually right now by rewriting the markdown files weekly.

Thomas Hansen • Apr 2

I think this distinction matters a lot. Too many people treat retrieval as if it automatically becomes memory, when in practice memory needs structure, recency, and some notion of relevance over time. That is also why I’m interested in simpler development models like Hyperlambda — reducing implementation friction is useful, but it does not remove the need for careful system design.

Lavie • Apr 1

Great points on the distinction between retrieval and actual memory. Most people confuse RAG with 'logic' or 'rules'. I've actually been experimenting with a different approach to 'steering' agents, specifically Cursor, using .mdc rules. Instead of relying on a vector store to hopefully retrieve the right context, these rules act as architectural constraints that are injected into every generation. It's less like 'memory' and more like 'guardrails' -- ensuring the agent doesn't hallucinate bad patterns (like deprecated Next.js imports) by making the 'correct' way part of its immediate context. It complements a memory system by providing a foundation of absolute rules.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.