Build an MCP server that gives any LLM long-term memory

#ai #mcp #python #tutorial

Your LLM forgets everything after each session. MCP lets you fix that with 3 tools and zero infrastructure.

No database cluster. No Redis. No "memory service" running in the background. Just a Python script and a 3MB embedded engine that persists to disk.

What is MCP (in 30 seconds)

MCP (Model Context Protocol) is the standard way LLMs connect to external tools. Donated to the Linux Foundation, it's now supported by Claude, GPT, Gemini, and most agent frameworks. You write a server that exposes "tools" - the LLM discovers them and calls them when needed.

Think of it as a USB port for AI: plug in a memory server, and every LLM that speaks MCP gets long-term memory.

The full server: 60 lines of Python

Here's a working MCP server that gives any LLM three memory tools: store facts, recall facts, and record events.

from fastmcp import FastMCP
from sentence_transformers import SentenceTransformer
from datetime import datetime
import time
import velesdb

# --- Setup ---
model = SentenceTransformer("all-MiniLM-L6-v2")
db = velesdb.Database("./llm_memory")
memory = db.agent_memory(384)

mcp = FastMCP("LLM Memory Server")

def embed(text: str) -> list[float]:
    return model.encode(text).tolist()

# --- Tool 1: Store a fact ---
@mcp.tool
def remember_fact(content: str) -> str:
    """Store a fact in long-term semantic memory. Use this when you learn
    something important about the user, their project, or their preferences."""
    vec = embed(content)
    fact_id = int(time.time() * 1000) % 2_000_000_000
    memory.semantic.store(fact_id, content, vec)
    return f"Stored: {content}"

# --- Tool 2: Recall relevant facts ---
@mcp.tool
def recall_facts(query: str, top_k: int = 5) -> list[dict]:
    """Search long-term memory for facts relevant to a query.
    Use this at the start of conversations or when you need context."""
    results = memory.semantic.query(embed(query), top_k=top_k)
    return [{"content": r["content"], "score": round(r["score"], 4)} for r in results]

# --- Tool 3: Record a timestamped event ---
@mcp.tool
def record_event(description: str) -> str:
    """Record a timestamped event in episodic memory.
    Use this to log important moments: decisions, errors, milestones."""
    vec = embed(description)
    event_id = int(time.time() * 1000) % 2_000_000_000
    now = int(time.time())
    memory.episodic.record(event_id, description, now, vec)
    ts = datetime.fromtimestamp(now).strftime("%Y-%m-%d %H:%M")
    return f"Recorded at {ts}: {description}"

if __name__ == "__main__":
    mcp.run()

Save this as memory_server.py. That's the entire server.

What each tool does

remember_fact(content) takes any string and stores it as a vector in semantic memory. The embedding is generated automatically using all-MiniLM-L6-v2 (384 dimensions). When the LLM learns that you prefer dark mode, it calls this tool.

recall_facts(query, top_k) searches memory by semantic similarity. The LLM calls this at the start of a conversation to load context about you, or mid-conversation when it needs to check what it already knows.

record_event(description) logs a timestamped event. "User deployed to production." "Fixed the CSS bug." "User said the recursive chunking approach worked." This gives the LLM a timeline of what happened.

Connect it to Claude Desktop

Add this to your claude_desktop_config.json:

{
  "mcpServers": {
    "memory": {
      "command": "python",
      "args": ["memory_server.py"],
      "cwd": "/path/to/your/project"
    }
  }
}

On macOS, the config file lives at ~/Library/Application Support/Claude/claude_desktop_config.json. On Windows: %APPDATA%\Claude\claude_desktop_config.json.

Restart Claude Desktop. You should see the memory tools appear in the tools list.

A real conversation

Here's what it looks like when the LLM actually uses the memory:

You: I'm building a RAG pipeline for legal documents. We're using LangChain with a recursive text splitter, chunk size 512.

Claude (calls remember_fact):

remember_fact("User is building a RAG pipeline for legal documents using LangChain, recursive text splitter, chunk size 512")
→ Stored: User is building a RAG pipeline for legal documents...

Claude (calls record_event):

record_event("Started discussing legal RAG pipeline architecture")
→ Recorded at 2026-03-28 10:15: Started discussing legal RAG pipeline architecture

The next day, you open a new chat:

You: Can you help me with my project?

Claude (calls recall_facts):

recall_facts("user's current project")
→ [{"content": "User is building a RAG pipeline for legal documents using LangChain, recursive text splitter, chunk size 512", "score": 0.4218}]

Claude: Sure! Last time we talked about your legal RAG pipeline with LangChain. You were using a recursive text splitter with chunk size 512. Where did you leave off?

No context window stuffing. No manually copy-pasting previous conversations. The LLM just remembers.

Let's verify the similarity scores

The scores above are real. Here's a standalone test you can run:

from sentence_transformers import SentenceTransformer
import velesdb
import time

model = SentenceTransformer("all-MiniLM-L6-v2")
db = velesdb.Database("./test_memory")
memory = db.agent_memory(384)

def embed(text):
    return model.encode(text).tolist()

# Store some facts
memory.semantic.store(1, "User is building a RAG pipeline for legal documents", embed("User is building a RAG pipeline for legal documents"))
memory.semantic.store(2, "User prefers dark mode in all applications", embed("User prefers dark mode in all applications"))
memory.semantic.store(3, "User's timezone is Europe/Paris", embed("User's timezone is Europe/Paris"))

# Query
results = memory.semantic.query(embed("user's current project"), top_k=3)
for r in results:
    print(f"[{r['score']:.4f}] {r['content']}")

[0.4218] User is building a RAG pipeline for legal documents
[0.1340] User prefers dark mode in all applications
[0.0892] User's timezone is Europe/Paris

The most relevant fact surfaces first with a clear score gap. That's cosine similarity doing its job with a model that runs locally in 384 dimensions.

Building this from scratch vs. using VelesDB®

What you need	From scratch	VelesDB MCP server
Vector storage	Pinecone/Weaviate/Chroma + API keys	`db.agent_memory(384)`
Embedding	OpenAI API calls ($) or local model setup	`sentence-transformers` (local, free)
Timestamped events	PostgreSQL + schema design	`memory.episodic.record()`
Persistence	Managed DB or Docker volume	Automatic (file on disk)
MCP integration	Custom protocol handler	`@mcp.tool` decorator
Infrastructure	3+ services running	Zero. One Python file
Binary size	500MB+ Docker images	~3MB embedded engine

Getting started

Install:

pip install velesdb fastmcp sentence-transformers

The source is on GitHub. VelesDB is source-available under the Elastic License 2.0.

The full MCP server code from this article is ready to copy-paste. Save it as memory_server.py, point Claude Desktop at it, and your LLM will never forget again.

What's your biggest pain point with LLM memory right now? Are you building something with MCP? I'd love to hear about it in the comments.