Greg Mate

Posted on Jun 11

I Built a Python Agent That Uses a Vector DB as Memory, Not Retrieval

#ai #python #vectordatabase #llm

Vector databases are almost always talked about in the context of RAG. Store your documents, embed them, retrieve the relevant chunks at inference time. That's the default pattern and it works — until it doesn't.

I've been working on Actian VectorAI DB and started wondering: what if the vector DB isn't a document store at all? What if it's a memory layer for an agent?

So I built it to find out.

The Idea

The distinction sounds subtle but it matters. In a classic RAG setup, you pre-load a vector store with documents. The corpus is static. The agent queries it but never changes it.

What I wanted to build was different. An agent that writes to the vector store as it runs — storing every interaction as a vector — and then searches its own past conversations semantically when it needs context. The corpus is built from the agent's own history, not from documents you loaded upfront.

The agent is the author of its own knowledge base.

The Stack

Everything runs locally. No cloud, no external API calls, nothing leaving the machine:

Actian VectorAI DB: vector store and semantic search
Ollama + llama3.2: local LLM
BAAI/bge-small-en-v1.5: embedding model
Python: the glue

The fully local constraint wasn't just a preference, rather the core to the premise. If the agent is storing personal memory, it shouldn't be doing it in someone else's cloud.

How It Works

Every time you send the agent a message, it does four things:

Embeds your message as a vector
Searches VectorAI DB for semantically similar past interactions
Injects the relevant memories into the system prompt
Responds, then stores the full exchange back into VectorAI DB

See:

def chat(self, user_message: str) -> str:
    """Process a user message and return the assistant reply."""
    # 1. Embed the incoming message for semantic search
    query_vec = embed(user_message)

    # 2. Recall semantically relevant memories (cross-session by default).
    # score_threshold=0.50 prevents loosely-related memories from being injected
    # as context. min_importance=0.5 excludes low-confidence episodic fragments
    # (episodes are stored at 0.3, explicit facts at 0.9).
    past_memories = self.memory.recall(
        query_vector=query_vec,
        limit=5,
        score_threshold=0.30,
    )

    # 3. Build system prompt with injected memories
    system_prompt = self._build_system_prompt(past_memories)

    # 4. Extend short-term conversation window
    self.conversation.append({"role": "user", "content": user_message})

    # 5. Call the local LLM via Ollama
    messages = [{"role": "system", "content": system_prompt}] + self.conversation
    response = self.llm.chat.completions.create(
        model=self.model,
        messages=messages,
    )
    assistant_reply = response.choices[0].message.content

    # 6. Append reply to short-term window
    self.conversation.append({"role": "assistant", "content": assistant_reply})

    # 7. Persist this exchange as an episodic long-term memory
    # Episodic importance is kept low (0.3) intentionally: the agent's own
    # replies may contain errors or hallucinations. Explicit facts stored via
    # remember_fact() use importance=0.9 and will always rank above episodes.
    memory_text = f"User said: {user_message}\nAgent replied: {assistant_reply}"
    memory_vec = embed(memory_text)
    self.memory.remember(
        content=memory_text,
        vector=memory_vec,
        session_id=self.session_id,
        memory_type="episode",
        importance=0.3,
    )

    return assistant_reply

The search is cross-session by default. A memory from last Tuesday will surface today if it's semantically close enough to what you're asking. The collection lives on disk via Docker volume so it persists across restarts.

There's also a remember: <fact> command to store explicit high-importance facts at a higher importance score, separately from the episodic conversation log.

What Broke Along the Way

The embedding model defaulted to a HuggingFace download on first run, which immediately broke the fully local setup. Fixed it by loading the model with local_files_only=True and requiring a one-time manual download before the first run — so the embedding step is fully offline on every subsequent run.

The Memory Decay Problem

The first version had a flat importance score for every interaction. Every exchange stored at 0.6, explicit facts at 0.9. No decay, no forgetting — the collection just grew indefinitely. That's fine as a proof of concept but it's not how memory actually works. Old, rarely referenced memories shouldn't compete equally with recent, frequently accessed ones.

So I added importance-weighted decay. Every memory now gets scored on four signals before being returned:

age_hours = (now - timestamp) / 3600
recency   = exp(-age_hours / 168)          # half-life ~1 week
freq      = min(access_count / 10.0, 1.0)  # saturates at 10 accesses

final_score = (
    0.6 * cosine_similarity
  + 0.2 * importance
  + 0.15 * recency
  + 0.05 * access_frequency
)

Cosine similarity still does the heavy lifting — it has to, otherwise semantically irrelevant memories would surface. But recency and access frequency now influence ranking. A memory from six weeks ago that's never been referenced again will lose ground to a recent one, even if the raw cosine similarity is similar.

The weights and half-life are module-level constants so they're easy to tune without touching the logic.

The recall path also tracks access — every time a memory surfaces in a query, its access_count increments and last_accessed updates. Memories that keep coming up stay relevant. Ones that don't, fade.

Here's what the ranked output looks like against four synthetic test memories:

Rank  Score    Imp   Content
  1   0.9135   0.9   recent + high access (1 hr old, 8 accesses)
  2   0.6776   0.9   old + high importance (30 days, 0 accesses)
  3   0.6704   0.3   recent + no access (2 hrs old, 0 accesses)
  4   0.5112   0.3   old + low importance (60 days, 0 accesses)

The recent, frequently accessed memory dominates. The old, low-importance one drops to the bottom regardless of semantic similarity. That's the behavior you want from something calling itself memory.

The Hallucination Problem

Persistent memory introduces a risk that RAG pipelines don't have in the same way: if the agent hallucinates something and stores it, that hallucination gets recalled as a confident memory in the next session. The wrong information compounds.

Three risks needed fixing.

The LLM had no instruction to stay within recalled memories. The original system prompt said "use these memories when relevant" — permissive enough that the model would freely supplement from its training data when memory was thin. Three explicit rules were added: only use facts from the listed memories for personal claims, say "I don't know" when no memory covers a question, and never infer or guess personal details.

Hallucinated replies were stored and recalled as truth. Every exchange was stored at importance=0.6, meaning a hallucinated reply could be recalled next session and treated as a confident memory. Episodic importance was lowered to 0.3 — well below explicit facts at 0.9 — so bad replies can never outrank things the user deliberately told the agent.

Weakly-matched memories were being injected as context. The recall threshold was low enough to pull in semantically distant memories that could mislead the LLM. The threshold was raised and a min_importance filter added so episodic fragments are excluded from injection entirely. Only explicitly stored facts ever reach the LLM.

The importance ladder now looks like this:

importance=0.9  ->  explicit facts (remember: <fact>)   always recalled if score ≥ 0.50
importance=0.5  ->  the min_importance gate             <- filter line
importance=0.3  ->  episodic exchanges (chat history)   never recalled, never injected

A test suite with 5 offline pytest tests guards all three risks — mocking both the memory store and the LLM call, then inspecting the messages array sent to the model before it responds.

5 passed in 10.56s ✓

What I Found

When I examined how VectorAI DB was actually being used in the implementation, the key finding was this:

The corpus is built dynamically from the agent's own past conversations, not from a pre-loaded document index. The agent is the author of its own knowledge base, which accumulates at runtime.

That's the thing that makes this memory rather than retrieval. It's a small shift in how you think about what a vector DB is for: not a document store you query at inference time, but a persistent layer that grows with the agent over time, and now one that forgets appropriately too.

The agent works. Cross-session recall is functioning, decay is verified, the stack is fully local.

What's Next

Testing retrieval quality as the memory grows over longer periods
Exploring what other use cases this pattern unlocks beyond conversation memory

Find the repo here. If you're working on anything in this space — agentic memory, local-first AI stacks, or just fighting with MCP setup — I'd love to hear what you're seeing in the comments.

Top comments (8)

Max Quimby • Jun 16

The "agent is the author of its own knowledge base" framing is the right one, and it's also where it gets dangerous. The moment an agent stores its own outputs as memory, it can recall a past mistake as authoritative context and compound it — I've watched a pipeline reinforce a wrong assumption for an entire run because it "remembered" its own earlier hallucination at score ~0.6.

Two things I'd love to know how you handle: staleness and contradiction. Your importance tiers (episodes 0.3, facts 0.9) are smart, but a fact stored at 0.9 today can be false next week. Do you decay importance over time, or version/supersede a memory when a newer interaction contradicts it? And does recall ever dedup near-identical memories, or can the same fact get injected five times and crowd the context window?

The local-first constraint is underrated too — personal memory living in someone else's cloud is a non-starter for a lot of use cases.

Greg Mate • Jun 22

the hallucination compounding problem is exactly what the importance tiers are designed to prevent. episodic exchanges at 0.3 never get injected as context, so a wrong answer from session A can't be recalled as authoritative in session B. only things explicitly stored via remember: reach 0.9 and get injected.
but you got it right: a fact at 0.9 today can be false next week and there's currently no contradiction detection or supersession. decay helps with staleness over time but doesn't handle "this directly contradicts something already stored." deduplication is also not in there yet. near-identical memories can stack and crowd the context window, which becomes a real problem as the collection grows. both are on the list.
and agreed on local-first. personal memory in someone else's cloud is a different product entirely.

Alex Shev • Jun 12

Using a vector DB as memory changes the product problem. Retrieval asks "what documents are relevant?" Memory asks "what should affect future behavior?" Those are not the same contract.

I would be careful about write rules and expiration. If the agent can store every convenient observation as memory, the system slowly becomes biased by stale or accidental context. Good memory needs admission criteria, provenance, and a way to revoke or downgrade old beliefs.

Greg Mate • Jun 22

"what should affect future behavior" vs "what's relevant" is a sharper way to frame it than how I was thinking about it. going to carry that framing forward.
on admission criteria and expiration though, that's mostly what the importance ladder and decay section covers. episodic exchanges start at 0.3 and fade with age and access frequency, only explicit facts reach 0.9 and persist. the honest gap is provenance and revocation. so if a fact is stored at 0.9 today and turns out to be wrong, there's no mechanism to downgrade it short of manually deleting the point. that's the next real problem.

Alex Shev • Jun 22

That distinction between retrieval and memory is the useful one. Once the vector store starts holding outcomes, preferences, and corrections, the question becomes lifecycle: when to promote a memory, when to decay it, and when a new run should override the old belief.

Luis Cruz • Jun 11

This is a fascinating approach to agent memory. I love the distinction you make between a traditional RAG setup and a dynamic, self-authored memory—treating the vector DB as a persistent, evolving knowledge layer rather than just a static retrieval store. The importance-weighted decay and min_importance filtering are elegant solutions to prevent hallucinations from contaminating long-term memory.
I’d love to collaborate and experiment with similar local-first AI agent workflows. It would be interesting to explore cross-session memory quality, vector DB memory structures, and safe memory injection strategies for complex agentic tasks. If you’re open to it, we could exchange test strategies, code patterns, and best practices for building robust, fully offline agent memory systems.
Have you thought about integrating episodic memory with structured knowledge graphs or tool execution histories? I’d be happy to help prototype some of those ideas.

Greg Mate • Jun 22

thanks! the knowledge graph angle is something I've been thinking about. episodic memory works well for conversation context but structured knowledge (relationships between entities, tool execution histories) probably needs a different representation. a hybrid where the vector store handles fuzzy semantic recall and a graph handles explicit relationships is where I'd want to take it next. open to experimenting with that, feel free to reach out.

Luis Cruz • Jun 22

Thanks for sharing this — I fully agree with your direction.

The hybrid setup makes a lot of sense: vector store for semantic recall and a graph layer for explicit structure like entity relations and tool execution history feels like the right separation of concerns. I’ve also seen that episodic memory alone starts to break down once you need consistent reasoning over multi-step tool use or cross-entity dependencies.

I’d be interested in experimenting with this as well. Especially around how we can log tool executions into a graph in a way that stays queryable without becoming too rigid or expensive.

Feel free to reach out — happy to collaborate and explore ideas here.
+1 (361) 332-6512