Vector databases are almost always talked about in the context of RAG. Store your documents, embed them, retrieve the relevant chunks at inference time. That's the default pattern and it works — until it doesn't.
I've been working on Actian VectorAI DB and started wondering: what if the vector DB isn't a document store at all? What if it's a memory layer for an agent?
So I built it to find out.
The Idea
The distinction sounds subtle but it matters. In a classic RAG setup, you pre-load a vector store with documents. The corpus is static. The agent queries it but never changes it.
What I wanted to build was different. An agent that writes to the vector store as it runs — storing every interaction as a vector — and then searches its own past conversations semantically when it needs context. The corpus is built from the agent's own history, not from documents you loaded upfront.
The agent is the author of its own knowledge base.
The Stack
Everything runs locally. No cloud, no external API calls, nothing leaving the machine:
- Actian VectorAI DB: vector store and semantic search
- Ollama + llama3.2: local LLM
- BAAI/bge-small-en-v1.5: embedding model
- Python: the glue
The fully local constraint wasn't just a preference, rather the core to the premise. If the agent is storing personal memory, it shouldn't be doing it in someone else's cloud.
How It Works
Every time you send the agent a message, it does four things:
- Embeds your message as a vector
- Searches VectorAI DB for semantically similar past interactions
- Injects the relevant memories into the system prompt
- Responds, then stores the full exchange back into VectorAI DB
See:
def chat(self, user_message: str) -> str:
"""Process a user message and return the assistant reply."""
# 1. Embed the incoming message for semantic search
query_vec = embed(user_message)
# 2. Recall semantically relevant memories (cross-session by default).
# score_threshold=0.50 prevents loosely-related memories from being injected
# as context. min_importance=0.5 excludes low-confidence episodic fragments
# (episodes are stored at 0.3, explicit facts at 0.9).
past_memories = self.memory.recall(
query_vector=query_vec,
limit=5,
score_threshold=0.30,
)
# 3. Build system prompt with injected memories
system_prompt = self._build_system_prompt(past_memories)
# 4. Extend short-term conversation window
self.conversation.append({"role": "user", "content": user_message})
# 5. Call the local LLM via Ollama
messages = [{"role": "system", "content": system_prompt}] + self.conversation
response = self.llm.chat.completions.create(
model=self.model,
messages=messages,
)
assistant_reply = response.choices[0].message.content
# 6. Append reply to short-term window
self.conversation.append({"role": "assistant", "content": assistant_reply})
# 7. Persist this exchange as an episodic long-term memory
# Episodic importance is kept low (0.3) intentionally: the agent's own
# replies may contain errors or hallucinations. Explicit facts stored via
# remember_fact() use importance=0.9 and will always rank above episodes.
memory_text = f"User said: {user_message}\nAgent replied: {assistant_reply}"
memory_vec = embed(memory_text)
self.memory.remember(
content=memory_text,
vector=memory_vec,
session_id=self.session_id,
memory_type="episode",
importance=0.3,
)
return assistant_reply
The search is cross-session by default. A memory from last Tuesday will surface today if it's semantically close enough to what you're asking. The collection lives on disk via Docker volume so it persists across restarts.
There's also a remember: <fact> command to store explicit high-importance facts at a higher importance score, separately from the episodic conversation log.
What Broke Along the Way
The embedding model defaulted to a HuggingFace download on first run, which immediately broke the fully local setup. Fixed it by loading the model with local_files_only=True and requiring a one-time manual download before the first run — so the embedding step is fully offline on every subsequent run.
The Memory Decay Problem
The first version had a flat importance score for every interaction. Every exchange stored at 0.6, explicit facts at 0.9. No decay, no forgetting — the collection just grew indefinitely. That's fine as a proof of concept but it's not how memory actually works. Old, rarely referenced memories shouldn't compete equally with recent, frequently accessed ones.
So I added importance-weighted decay. Every memory now gets scored on four signals before being returned:
age_hours = (now - timestamp) / 3600
recency = exp(-age_hours / 168) # half-life ~1 week
freq = min(access_count / 10.0, 1.0) # saturates at 10 accesses
final_score = (
0.6 * cosine_similarity
+ 0.2 * importance
+ 0.15 * recency
+ 0.05 * access_frequency
)
Cosine similarity still does the heavy lifting — it has to, otherwise semantically irrelevant memories would surface. But recency and access frequency now influence ranking. A memory from six weeks ago that's never been referenced again will lose ground to a recent one, even if the raw cosine similarity is similar.
The weights and half-life are module-level constants so they're easy to tune without touching the logic.
The recall path also tracks access — every time a memory surfaces in a query, its access_count increments and last_accessed updates. Memories that keep coming up stay relevant. Ones that don't, fade.
Here's what the ranked output looks like against four synthetic test memories:
Rank Score Imp Content
1 0.9135 0.9 recent + high access (1 hr old, 8 accesses)
2 0.6776 0.9 old + high importance (30 days, 0 accesses)
3 0.6704 0.3 recent + no access (2 hrs old, 0 accesses)
4 0.5112 0.3 old + low importance (60 days, 0 accesses)
The recent, frequently accessed memory dominates. The old, low-importance one drops to the bottom regardless of semantic similarity. That's the behavior you want from something calling itself memory.
The Hallucination Problem
Persistent memory introduces a risk that RAG pipelines don't have in the same way: if the agent hallucinates something and stores it, that hallucination gets recalled as a confident memory in the next session. The wrong information compounds.
Three risks needed fixing.
The LLM had no instruction to stay within recalled memories. The original system prompt said "use these memories when relevant" — permissive enough that the model would freely supplement from its training data when memory was thin. Three explicit rules were added: only use facts from the listed memories for personal claims, say "I don't know" when no memory covers a question, and never infer or guess personal details.
Hallucinated replies were stored and recalled as truth. Every exchange was stored at importance=0.6, meaning a hallucinated reply could be recalled next session and treated as a confident memory. Episodic importance was lowered to 0.3 — well below explicit facts at 0.9 — so bad replies can never outrank things the user deliberately told the agent.
Weakly-matched memories were being injected as context. The recall threshold was low enough to pull in semantically distant memories that could mislead the LLM. The threshold was raised and a min_importance filter added so episodic fragments are excluded from injection entirely. Only explicitly stored facts ever reach the LLM.
The importance ladder now looks like this:
importance=0.9 -> explicit facts (remember: <fact>) always recalled if score ≥ 0.50
importance=0.5 -> the min_importance gate <- filter line
importance=0.3 -> episodic exchanges (chat history) never recalled, never injected
A test suite with 5 offline pytest tests guards all three risks — mocking both the memory store and the LLM call, then inspecting the messages array sent to the model before it responds.
5 passed in 10.56s ✓
What I Found
When I examined how VectorAI DB was actually being used in the implementation, the key finding was this:
The corpus is built dynamically from the agent's own past conversations, not from a pre-loaded document index. The agent is the author of its own knowledge base, which accumulates at runtime.
That's the thing that makes this memory rather than retrieval. It's a small shift in how you think about what a vector DB is for: not a document store you query at inference time, but a persistent layer that grows with the agent over time, and now one that forgets appropriately too.
The agent works. Cross-session recall is functioning, decay is verified, the stack is fully local.
What's Next
- Testing retrieval quality as the memory grows over longer periods
- Exploring what other use cases this pattern unlocks beyond conversation memory
Find the repo here. If you're working on anything in this space — agentic memory, local-first AI stacks, or just fighting with MCP setup — I'd love to hear what you're seeing in the comments.
Top comments (1)
This is a fascinating approach to agent memory. I love the distinction you make between a traditional RAG setup and a dynamic, self-authored memory—treating the vector DB as a persistent, evolving knowledge layer rather than just a static retrieval store. The importance-weighted decay and min_importance filtering are elegant solutions to prevent hallucinations from contaminating long-term memory.
I’d love to collaborate and experiment with similar local-first AI agent workflows. It would be interesting to explore cross-session memory quality, vector DB memory structures, and safe memory injection strategies for complex agentic tasks. If you’re open to it, we could exchange test strategies, code patterns, and best practices for building robust, fully offline agent memory systems.
Have you thought about integrating episodic memory with structured knowledge graphs or tool execution histories? I’d be happy to help prototype some of those ideas.