Julien L for WiScale

Posted on Jun 29

Your AI agent forgets. Mine doesn't - and it works on a plane, in a hospital, with wifi off.

#mcp #tutorial #python #ai

Six months ago you recommended switching your client's invoicing tool. Last week they asked why. You have no idea - the conversation happened in three meetings, a Slack thread, and a spreadsheet comparison no one archived. Your AI assistant is useless here too: it only knows what you paste into the prompt.

This is not a context-window problem. It is a memory architecture problem.

Why vector search alone is not enough

Most "persistent memory" solutions for LLMs work by storing past exchanges as text chunks and retrieving them by cosine similarity. Ask "what did we decide about the invoicing tool?" and a chunk mentioning the decision floats to the top - if your query looks like the answer.

It breaks the moment you ask why. The reason the CFO pushed back on the original tool was buried in a budget meeting note that shares no words with "invoicing decision". Pure vector search is blind to it by construction.

What you actually need is three distinct memory structures - the same ones cognitive science has described since the 1970s:

+-----------------+------------------------------+----------------------------------+
| Type            | What it stores               | Answers                          |
+-----------------+------------------------------+----------------------------------+
| Semantic        | Facts, decisions             | What? Why? What is our position? |
| Episodic        | Events with a timestamp      | When? Who said what?             |
| Procedural      | Learned patterns + steps     | How do we usually handle this?   |
+-----------------+------------------------------+----------------------------------+

velesdb-memory is an MCP server that exposes exactly these three subsystems - as five high-level tools your agent can call without knowing anything about vectors, graphs, or databases.

What velesdb-memory actually is

It is a single binary that speaks the Model Context Protocol over stdio. Client and server run on the same machine. Memory never leaves your machine.

+------------------+        stdio/MCP        +-------------------+
| Claude Code      |  ───────────────────►   | velesdb-memory    |
| Cursor           |                         | (one binary)      |
| Cline / Zed      |  ◄───────────────────   |                   |
| Codex / opencode |                         | vector + graph    |
+------------------+                         | + columnar store  |
                                             +-------------------+
                                                      │
                                               ~/.velesdb-memory/
                                               (stays on your disk)

Five tools, all JSON:

Tool	What it does
`remember`	store a fact, optionally tagged and linked to other memories
`recall`	semantic search, with optional metadata filter
`relate`	create a typed edge between two memories
`forget`	delete a memory by id
`why`	recall + multi-hop graph traversal (the differentiator)

There is a sixth tool, remember_extracted, that passes raw text through a local LLM and builds the graph automatically - but you do not need it to understand the core idea.

A scenario: Sofia, management consultant

Sofia advises companies on digital transformation. She runs three to five simultaneous engagements, each lasting six months. She needs her AI assistant to remember:

Strategic decisions and their rationale (semantic)
Key conversations: the CFO meeting, the risk workshop, the board presentation (episodic)
Learned procedures: how she runs a vendor selection, her risk assessment checklist (procedural)

Let us build her memory layer.

Setting up

# Recommended — install directly from crates.io (v0.3.1)
cargo install velesdb-memory

# Or build from source
cargo build --release -p velesdb-memory

The default build is dependency-free. For real semantic recall, build with Ollama support:

cargo build --release -p velesdb-memory --features ollama
ollama pull all-minilm

Then configure your client. For Claude Code:

claude mcp add velesdb-memory \
  --env VELESDB_MEMORY_PATH="$HOME/.velesdb-memory" \
  -- /path/to/velesdb-memory

For Cursor (~/.cursor/mcp.json), Cline (cline_mcp_settings.json), or any other MCP client:

{
  "mcpServers": {
    "velesdb-memory": {
      "command": "/path/to/velesdb-memory",
      "env": { "VELESDB_MEMORY_PATH": "/home/you/.velesdb-memory" }
    }
  }
}

Zed uses a slightly different key (context_servers), Codex uses codex mcp add or a TOML config - full snippets in the README.

Once configured, the agent discovers the tools automatically. No restarts, no plugins, no API keys.

What the agent does with the tools

Storing a strategic decision

At the end of a vendor selection meeting, Sofia's agent calls:

// remember - store a fact with metadata and a typed link to another memory
remember {
  "fact": "We recommended Pennylane over Sage for Acme Corp invoicing because Sage lacks multi-currency support and Pennylane's API team offered a 6-month implementation guarantee.",
  "metadata": { "project": "acme-corp", "type": "decision", "author": "sofia" },
  "links": [ { "target": 4820193847, "relation": "follows_from" } ]
}
→ { "id": 9876543210 }

The returned id is stable and derived from the content - storing the same fact twice is idempotent.

Recording a key conversation

// remember - the CFO meeting that triggered the re-evaluation
remember {
  "fact": "CFO at Acme Corp: budget cap is 12k EUR per year. Sage renewal is 14.8k. This is the hard constraint that ruled out Sage.",
  "metadata": { "project": "acme-corp", "type": "meeting", "date": "2026-01-15" },
  "links": [ { "target": 9876543210, "relation": "motivated" } ]
}
→ { "id": 4820193847 }

Storing a learned procedure

remember {
  "fact": "Vendor selection for SME finance tools - step 1: map hard constraints (budget, compliance, integration). Step 2: shortlist to 3. Step 3: run a 2-week pilot on live data. Step 4: present with a documented decision matrix.",
  "metadata": { "type": "procedure", "domain": "vendor-selection" }
}
→ { "id": 1122334455 }

Linking things together

After the client signed the contract:

relate {
  "from": 9876543210,
  "to": 4820193847,
  "relation": "decided_in"
}
→ { "edge_id": 7 }

The `why` query: what changes everything

Six months later, Acme Corp asks Sofia why they switched invoicing tools. She asks her agent:

why {
  "decision": "why did we switch from Sage to Pennylane",
  "filter": { "project": "acme-corp" },
  "max_hops": 2
}

The response:

{
  "nodes": [
    { "id": 9876543210, "hop": 0, "content": "We recommended Pennylane over Sage... multi-currency... 6-month implementation guarantee." },
    { "id": 4820193847, "hop": 1, "content": "CFO at Acme Corp: budget cap is 12k EUR... Sage renewal is 14.8k. This is the hard constraint that ruled out Sage." }
  ],
  "edges": [
    { "from": 9876543210, "to": 4820193847, "relation": "decided_in" }
  ]
}

A plain recall query would have returned the decision text (hop 0, shares words with the query). It would not have returned the CFO meeting note (hop 1) - that note contains "budget cap" and "14.8k", no words in common with "why did we switch from Sage to Pennylane".

The graph reaches it because the relation exists. That is the gap.

How big is the gap, exactly?

The why wedge is not a claim - it is measured. The repo ships three reproducible benchmarks with no LLM in the scoring loop (pure retrieval metrics on public datasets):

Multi-hop recall (graph engine) - HotpotQA, 3000 dev questions:

vector only:   both bridge facts recalled  →  baseline
vector + graph: both bridge facts recalled →  +7.2 percentage points on bridge questions

The win replicates on 2WikiMultiHopQA (+3.1pp on bridged types).

Time-scoped recall (ColumnStore) - TimeQA (real Wikipedia bios):

vector only:   gold-sentence recall  →  baseline
vector + filter: year-range predicate →  +9.7 percentage points

A pure cosine score cannot distinguish "she won the award in 1987" from "she won the award in 2003". A numeric filter can.

The engines compound (tri-engine benchmark):

On a task that requires both multi-hop traversal and time-scoped filtering:

graph alone:       +7.2pp
columnstore alone: +9.7pp
both together:     +29pp  (more than the sum)

Run any of these yourself:

# multi-hop benchmark
cargo run --release -p velesdb-memory --example bench_multihop

# time-scoped benchmark
cargo run --release -p velesdb-memory --example timeqa

What "offline" means in practice

The default binary has zero network dependencies. The memory store is a directory on your disk (~/.velesdb-memory/). The binary is around 9 MB.

With the default hash embedder, recall is keyword-style (deterministic, good for why because the graph does the heavy lifting). For real semantic recall, add Ollama - the model runs locally, so memory still never reaches the internet:

VELESDB_MEMORY_EMBEDDER=ollama \
VELESDB_MEMORY_OLLAMA_MODEL=all-minilm \
  /path/to/velesdb-memory

This is not "privacy-preserving mode" - it is the only mode. There is no cloud path.

The auto-extraction shortcut

If you do not want to call remember and relate manually, the remember_extracted tool does it in one step. It sends raw text to a local LLM (via Ollama), extracts individual facts, wires the entity graph automatically, and stores everything:

remember_extracted {
  "text": "Met Yannick from the Acme procurement team. He confirmed the board approved the Pennylane migration. The CFO's concern about training cost has been resolved by the vendor's onboarding package."
}
→ { "ids": [11122233, 44455566, 77788899] }

Three facts stored, entity relationships auto-wired, all reachable by why. To enable it:

cargo build --release -p velesdb-memory --features extract
VELESDB_MEMORY_EXTRACTOR=ollama \
VELESDB_MEMORY_EXTRACTOR_MODEL=qwen3:8b \
  /path/to/velesdb-memory

The standard build does not include this - it keeps the default binary tiny and offline.

Using the Python library directly

If you prefer to embed memory into your own application rather than use the MCP server, the same engine is available as a Python package:

import velesdb
import numpy as np

db = velesdb.Database("./sofia_memory")
memory = db.agent_memory(384, snapshot_dir="./sofia_memory/snapshots")  # 384-dim embeddings

# store a fact
def embed(text):
    # use sentence-transformers, Ollama, or any embedder
    from sentence_transformers import SentenceTransformer
    m = SentenceTransformer("all-MiniLM-L6-v2")
    return m.encode(text, normalize_embeddings=True).tolist()

memory.semantic.store(
    id=1,
    content="Pennylane chosen over Sage: multi-currency support + budget fits 12k EUR cap",
    embedding=embed("Pennylane Sage invoicing decision")
)

# query
results = memory.semantic.query(embed("why Pennylane"), top_k=3)
for r in results:
    print(f"[{r['score']:.2f}] {r['content']}")

# episodic: the CFO meeting
import time
memory.episodic.record(
    event_id=2,
    description="CFO confirmed: Sage renewal quote is 14.8k, over 12k cap",
    timestamp=int(time.time()) - 30 * 86400,  # 30 days ago
    embedding=embed("CFO budget constraint Sage renewal")
)

# procedural: a reusable pattern
memory.procedural.learn(
    procedure_id=3,
    name="SME vendor selection",
    steps=["map hard constraints", "shortlist to 3", "run 2-week pilot", "present decision matrix"],
    embedding=embed("vendor selection SME procedure"),
    confidence=0.9
)

# reinforce if the pattern worked well
memory.procedural.reinforce(procedure_id=3, success=True)

# snapshot to survive restarts
memory.snapshot()

pip install velesdb
python3 -c "import velesdb; print(velesdb.__version__)"
# 3.4.0

Using the Node.js package directly

The same engine ships as an npm package with prebuilt platform binaries — no Rust toolchain needed at install time:

npm install @wiscale/velesdb-memory-node

The API is a single async class — no subsystems, no embeddings to manage yourself:

import { MemoryService } from '@wiscale/velesdb-memory-node'

// Open (or create) a persistent store. Sync factory, all methods are async.
const mem = MemoryService.open('./sofia_memory', 'hash')
// Use 'ollama' as second arg for real semantic recall (requires Ollama running locally)

// Store a fact — returns its id as a decimal string
const decisionId = await mem.remember(
  'We recommended Pennylane over Sage: multi-currency support + 12k EUR budget cap',
  [],
  { project: 'acme-corp', type: 'decision' }
)

// Store the reason and link it
const reasonId = await mem.remember(
  'CFO confirmed: Sage renewal quote is 14.8k EUR, over the 12k annual cap',
  [],
  { project: 'acme-corp', type: 'meeting', date: '2026-01-15' }
)

// Typed link: decision was motivated by the CFO meeting
await mem.relate(decisionId, reasonId, 'decided_in')

// Plain recall — vector similarity
const hits = await mem.recall('why Pennylane', 3)
hits.forEach(h => console.log(`[${h.score.toFixed(2)}] ${h.content}`))

// why() — vector seed + multi-hop graph traversal
const { nodes, edges } = await mem.why('why did we switch from Sage to Pennylane', 2)
nodes.forEach(n => console.log(`hop ${n.hop}: ${n.content}`))
// hop 0: the decision  →  hop 1: the CFO meeting (no shared words — graph found it)

One feature is exclusive to the Node.js binding: recallWhere, which combines vector search with ColumnStore range filters in a single call — no Python counterpart:

// Recall meetings from the last 90 days only
const recent = await mem.recallWhere(
  'budget constraint',
  [{ field: 'date', op: 'ge', value: '2026-01-01' }],
  5
)

What it is not

velesdb-memory is a single-process embedded library. It is not designed for concurrent access from multiple processes, nor for storing millions of memories on behalf of many users. It fits one agent, one user, one machine - which is exactly the shape the use cases above require.

Extraction quality depends on the local model you point remember_extracted at. A smaller model extracts noisier facts than a larger one. The graph and the retrieval engine are solid; the extraction layer is as good as the model you bring.

Getting started

# No clone needed — install directly from crates.io
cargo install velesdb-memory
velesdb-memory --help

# Or from source
git clone https://github.com/cyberlife-coder/VelesDB
cargo build --release -p velesdb-memory

Documentation and examples are at velesdb.com. If this was useful, a star on the GitHub repo helps other developers find the project, and we are always looking for partners with local-first or sovereign data requirements - details on velesdb.com.

Which use case resonates most with you - knowledge work (consulting, research, legal), coding assistance, or something else entirely? Drop a comment below.

Top comments (3)

Khalid Salameh • Jun 30

ummm two questions 1 how is the performance on the why query when you have thousands of memories 2 in edge cases like when a user shares two conflicting facts about the same topic how does it behave does it surface both pick the newest or let the agent decide i wanna understand plz

Julien L WiScale • Jun 30 • Edited

Good questions, these are exactly the two things people hit first.

On 1: the why query is really two steps. It first does a normal vector recall to grab a few entry points, then walks the graph from there. The vector part runs on the same HNSW index as everything else, so it stays sub-millisecond even when the store grows (our scaling bench is basically flat, ~116µs at 100K and ~129µs at 1M, index only). The graph walk only depends on max_hops and how connected your nodes are, not on the total number of memories. So a few thousand is a non-issue. What actually slows it down is a very dense graph with a high max_hops, not the raw count.

On 2: it keeps both, it won't pick a winner for you. remember is idempotent on the text (same string, same id), but two genuinely different statements are two memories and both stay. recall ranks them by similarity, and why follows whatever links you created. There's no built-in truth resolution on purpose: the engine gives you the evidence, the agent decides. In practice people do one of three things, add a date in the metadata and prefer the most recent (the Node binding has recallWhere, which does the vector search + a range filter in one call), or add a supersedes link from the old fact to the new one so why surfaces the change, or just hand both back to the agent and let it sort it out.

Which is basically the other half of your second question. Since that policy lives in the agent, the cleanest way to keep it consistent is to wrap the MCP in a small skill on the agent side (just an instructions file your coding agent loads) that decides when and how to call the tools: always tag a remember with project/type/timestamp so you can filter by time later, always add the typed link after a decision (that's what turns a flat recall into a working why), resolve conflicts the same way every time, and route the query (a "why" goes to why(), a "what did we know as of X" goes to a time-filtered recall, the rest to recall). The MCP gives you the raw tools, the skill gives the agent the discipline to use them consistently.

Happy to dig into either one. If you want to test the why path on your own data, the bench_multihop example in the repo is a good place to start.

Julien L WiScale • Jun 30

Quick follow-up to make the conflict case (point 2) concrete:

// store the old fact
remember { "fact": "Acme uses Sage for invoicing" }
→ { "id": 4820193847 }

// store the new fact, linked to the old one in the same call
remember { "fact": "Acme switched to Pennylane",
"links": [ { "target": 4820193847, "relation": "supersedes" } ] }
→ { "id": 9876543210 }

why { "decision": "what invoicing tool does Acme use?", "max_hops": 2 }
→ { "nodes": [ { "id": 9876543210, "hop": 0, "content": "Acme switched to Pennylane" },
{ "id": 4820193847, "hop": 1, "content": "Acme uses Sage" } ],
"edges": [ { "from": 9876543210, "to": 4820193847, "relation": "supersedes" } ] }

Both facts come back, and the supersedes edge tells the agent which one wins. One thing worth clearing up: the id you pass in links is just what the previous remember returned, so the agent threads it through from its own context, you're not typing ids by hand. And if you'd rather not manage links at all, remember_extracted takes raw text and wires the graph for you.