DEV Community

Cover image for I Built a Python Agent That Uses a Vector DB as Memory, Not Retrieval
Greg Mate
Greg Mate

Posted on

I Built a Python Agent That Uses a Vector DB as Memory, Not Retrieval

Vector databases are almost always talked about in the context of RAG. Store your documents, embed them, retrieve the relevant chunks at inference time. That's the default pattern and it works — until it doesn't.


I've been working on Actian VectorAI DB and started wondering: what if the vector DB isn't a document store at all? What if it's a memory layer for an agent?

So I built it to find out.

The Idea

The distinction sounds subtle but it matters. In a classic RAG setup, you pre-load a vector store with documents. The corpus is static. The agent queries it but never changes it.

What I wanted to build was different. An agent that writes to the vector store as it runs — storing every interaction as a vector — and then searches its own past conversations semantically when it needs context. The corpus is built from the agent's own history, not from documents you loaded upfront.

The agent is the author of its own knowledge base.

The Stack

Everything runs locally. No cloud, no external API calls, nothing leaving the machine:

  • Actian VectorAI DB: vector store and semantic search
  • Ollama + llama3.2: local LLM
  • BAAI/bge-small-en-v1.5: embedding model
  • Python: the glue

The fully local constraint wasn't just a preference, rather the core to the premise. If the agent is storing personal memory, it shouldn't be doing it in someone else's cloud.

How It Works

Every time you send the agent a message, it does four things:

  1. Embeds your message as a vector
  2. Searches VectorAI DB for semantically similar past interactions
  3. Injects the relevant memories into the system prompt
  4. Responds, then stores the full exchange back into VectorAI DB

See:

def chat(self, user_message: str) -> str:
    """Process a user message and return the assistant reply."""
    # 1. Embed the incoming message for semantic search
    query_vec = embed(user_message)

    # 2. Recall semantically relevant memories (cross-session by default).
    # score_threshold=0.50 prevents loosely-related memories from being injected
    # as context. min_importance=0.5 excludes low-confidence episodic fragments
    # (episodes are stored at 0.3, explicit facts at 0.9).
    past_memories = self.memory.recall(
        query_vector=query_vec,
        limit=5,
        score_threshold=0.30,
    )

    # 3. Build system prompt with injected memories
    system_prompt = self._build_system_prompt(past_memories)

    # 4. Extend short-term conversation window
    self.conversation.append({"role": "user", "content": user_message})

    # 5. Call the local LLM via Ollama
    messages = [{"role": "system", "content": system_prompt}] + self.conversation
    response = self.llm.chat.completions.create(
        model=self.model,
        messages=messages,
    )
    assistant_reply = response.choices[0].message.content

    # 6. Append reply to short-term window
    self.conversation.append({"role": "assistant", "content": assistant_reply})

    # 7. Persist this exchange as an episodic long-term memory
    # Episodic importance is kept low (0.3) intentionally: the agent's own
    # replies may contain errors or hallucinations. Explicit facts stored via
    # remember_fact() use importance=0.9 and will always rank above episodes.
    memory_text = f"User said: {user_message}\nAgent replied: {assistant_reply}"
    memory_vec = embed(memory_text)
    self.memory.remember(
        content=memory_text,
        vector=memory_vec,
        session_id=self.session_id,
        memory_type="episode",
        importance=0.3,
    )

    return assistant_reply
Enter fullscreen mode Exit fullscreen mode

The search is cross-session by default. A memory from last Tuesday will surface today if it's semantically close enough to what you're asking. The collection lives on disk via Docker volume so it persists across restarts.

There's also a remember: <fact> command to store explicit high-importance facts at a higher importance score, separately from the episodic conversation log.

What Broke Along the Way

The embedding model defaulted to a HuggingFace download on first run, which immediately broke the fully local setup. Fixed it by loading the model with local_files_only=True and requiring a one-time manual download before the first run — so the embedding step is fully offline on every subsequent run.

The Memory Decay Problem

The first version had a flat importance score for every interaction. Every exchange stored at 0.6, explicit facts at 0.9. No decay, no forgetting — the collection just grew indefinitely. That's fine as a proof of concept but it's not how memory actually works. Old, rarely referenced memories shouldn't compete equally with recent, frequently accessed ones.

So I added importance-weighted decay. Every memory now gets scored on four signals before being returned:

age_hours = (now - timestamp) / 3600
recency   = exp(-age_hours / 168)          # half-life ~1 week
freq      = min(access_count / 10.0, 1.0)  # saturates at 10 accesses

final_score = (
    0.6 * cosine_similarity
  + 0.2 * importance
  + 0.15 * recency
  + 0.05 * access_frequency
)
Enter fullscreen mode Exit fullscreen mode

Cosine similarity still does the heavy lifting — it has to, otherwise semantically irrelevant memories would surface. But recency and access frequency now influence ranking. A memory from six weeks ago that's never been referenced again will lose ground to a recent one, even if the raw cosine similarity is similar.

The weights and half-life are module-level constants so they're easy to tune without touching the logic.

The recall path also tracks access — every time a memory surfaces in a query, its access_count increments and last_accessed updates. Memories that keep coming up stay relevant. Ones that don't, fade.

Here's what the ranked output looks like against four synthetic test memories:

Rank  Score    Imp   Content
  1   0.9135   0.9   recent + high access (1 hr old, 8 accesses)
  2   0.6776   0.9   old + high importance (30 days, 0 accesses)
  3   0.6704   0.3   recent + no access (2 hrs old, 0 accesses)
  4   0.5112   0.3   old + low importance (60 days, 0 accesses)
Enter fullscreen mode Exit fullscreen mode

The recent, frequently accessed memory dominates. The old, low-importance one drops to the bottom regardless of semantic similarity. That's the behavior you want from something calling itself memory.

The Hallucination Problem

Persistent memory introduces a risk that RAG pipelines don't have in the same way: if the agent hallucinates something and stores it, that hallucination gets recalled as a confident memory in the next session. The wrong information compounds.

Three risks needed fixing.

The LLM had no instruction to stay within recalled memories. The original system prompt said "use these memories when relevant" — permissive enough that the model would freely supplement from its training data when memory was thin. Three explicit rules were added: only use facts from the listed memories for personal claims, say "I don't know" when no memory covers a question, and never infer or guess personal details.

Hallucinated replies were stored and recalled as truth. Every exchange was stored at importance=0.6, meaning a hallucinated reply could be recalled next session and treated as a confident memory. Episodic importance was lowered to 0.3 — well below explicit facts at 0.9 — so bad replies can never outrank things the user deliberately told the agent.

Weakly-matched memories were being injected as context. The recall threshold was low enough to pull in semantically distant memories that could mislead the LLM. The threshold was raised and a min_importance filter added so episodic fragments are excluded from injection entirely. Only explicitly stored facts ever reach the LLM.

The importance ladder now looks like this:

importance=0.9  ->  explicit facts (remember: <fact>)   always recalled if score0.50
importance=0.5  ->  the min_importance gate             <- filter line
importance=0.3  ->  episodic exchanges (chat history)   never recalled, never injected
Enter fullscreen mode Exit fullscreen mode

A test suite with 5 offline pytest tests guards all three risks — mocking both the memory store and the LLM call, then inspecting the messages array sent to the model before it responds.

5 passed in 10.56s ✓
Enter fullscreen mode Exit fullscreen mode

What I Found

When I examined how VectorAI DB was actually being used in the implementation, the key finding was this:

The corpus is built dynamically from the agent's own past conversations, not from a pre-loaded document index. The agent is the author of its own knowledge base, which accumulates at runtime.

That's the thing that makes this memory rather than retrieval. It's a small shift in how you think about what a vector DB is for: not a document store you query at inference time, but a persistent layer that grows with the agent over time, and now one that forgets appropriately too.

The agent works. Cross-session recall is functioning, decay is verified, the stack is fully local.

What's Next

  • Testing retrieval quality as the memory grows over longer periods
  • Exploring what other use cases this pattern unlocks beyond conversation memory

Find the repo here. If you're working on anything in this space — agentic memory, local-first AI stacks, or just fighting with MCP setup — I'd love to hear what you're seeing in the comments.

Top comments (1)

Collapse
 
topstar_ai profile image
TopStar AI

This is a fascinating approach to agent memory. I love the distinction you make between a traditional RAG setup and a dynamic, self-authored memory—treating the vector DB as a persistent, evolving knowledge layer rather than just a static retrieval store. The importance-weighted decay and min_importance filtering are elegant solutions to prevent hallucinations from contaminating long-term memory.
I’d love to collaborate and experiment with similar local-first AI agent workflows. It would be interesting to explore cross-session memory quality, vector DB memory structures, and safe memory injection strategies for complex agentic tasks. If you’re open to it, we could exchange test strategies, code patterns, and best practices for building robust, fully offline agent memory systems.
Have you thought about integrating episodic memory with structured knowledge graphs or tool execution histories? I’d be happy to help prototype some of those ideas.