Elvis Mørales Fdz

Posted on Feb 19

How to give Claude Code persistent memory with a self-hosted mem0 MCP server

#ai #opensource #programming #selfhosted

Last week I spent two hours with Claude Code debugging a token refresh race condition. I traced it through the auth middleware, tested four approaches, and finally found that the session timeout window overlaps with the token refresh cycle on my setup. Three-line fix. The next day, a similar auth timing issue appeared in a different service. Claude suggested some of the exact approaches we'd already tried and rejected the day before.

That's the kind of knowledge that falls through the cracks between Claude Code sessions. Yes, CLAUDE.md stores static rules and Auto Memory saves compressed summaries. But neither captures the full diagnostic path, which approaches you tried, why three of them failed, and the specific conditions that made the fourth one work. That detail disappears when the session ends.

I went looking for MCP memory servers and other solutions that could fill that gap. Most either depended on running in the cloud, gave me too little control over the local setup, or required adding a separate API key for their internal LLM operations. Claude Code already authenticates through an OAuth Access Token (OAT), and the SDK supports it, so adding another key felt redundant and came with extra API costs.

During that search I came across mem0. I went through their documentation, tried the OpenClaw plugin to see how the library handles memory extraction and semantic search, and liked the approach. I patched it to reuse Claude Code's existing OAT token instead of requiring a separate key and submitted the change upstream. Their official MCP integration server is cloud-only though, so I built mem0-mcp-selfhosted, a local version backed by infrastructure I can fully control.

The stack runs on Qdrant for vector storage, Ollama for local embeddings, and optional Neo4j for a knowledge graph that I added later. I also set it up to route different operations to the best LLM for each task. It provides eleven tools for your Claude Code instance to manage long-term memory operations, and your memories data never leaves your machine.

This article covers how this MCP server works, how to set it up in about 15 minutes, and how to get Claude using memory automatically without you triggering it.

Why Claude Code's built-in memory falls short for accumulated knowledge

Does Claude Code remember between sessions?

Partially. Claude Code has three persistence mechanisms that carry context forward: CLAUDE.md files you write yourself, Auto Memory where Claude saves notes about your project patterns and preferences, and Session Memory that extracts summaries from past conversations. All three load at session start, and they cover a lot of ground.

Static rules, project conventions, and compressed summaries of past work carry forward just fine. If you told Claude to use PostgreSQL last week, it might remember it.

What doesn't carry forward is the detailed reasoning behind your decisions. When you spend an afternoon choosing between Redis and database-backed sessions, weighing operational complexity and infrastructure costs, and ultimately picking database sessions because your traffic doesn't justify a separate Redis instance yet, that full reasoning chain gets compressed into a one-line summary at best. The next session, Claude might suggest Redis for caching and you have to walk through the tradeoff analysis again.

Three categories of knowledge get lost or compressed beyond usefulness:

Decision reasoning. Not what you decided, but why and under what conditions. "We chose in-memory caching over Redis because at current scale it's premature optimization. Revisit at 10k rps." Auto Memory might note the decision, but the conditional logic that makes it useful, the part about when to revisit, gets lost in compression.
Debugging insights. "The flaky test failures on CI were caused by state leakage between test groups, not async issues. We proved this by isolating test groups last Tuesday." Session Memory might summarize "fixed flaky tests" but not the three-hour diagnostic path that saves you from repeating the same investigation.
Cross-project patterns. You build JWT middleware on Project A. Two weeks later, Project B needs authentication. Auto Memory and Session Memory are project-scoped, and while a global CLAUDE.md can carry some context across repos, it's a static file, not a searchable knowledge base. The pattern exists in a different repo, but Claude has no way to find it.

The built-in memory helps, but it has structural limits

CLAUDE.md works well for project rules. Auto Memory adds automatic note-taking, which is a real improvement over manual curation alone. I use both, and I recommend them.

But they share three structural limitations:

No search. Everything loads at session start regardless of relevance. At 200+ entries, you're burning context tokens on information Claude doesn't need for this particular task.
Summaries, not reasoning. Auto Memory and Session Memory compress multi-hour sessions into short notes. The compression loses the detail that matters most, which approaches failed and why.
Mostly project-scoped. Auto Memory is strictly per-project. A global CLAUDE.md can carry rules across repos, but it's a flat file you maintain by hand, not a searchable store of accumulated knowledge.

That's the gap Claude Code persistent memory with semantic search fills. Ask "what went wrong with Redis last month?" and get back the full reasoning: "rejected Redis for session storage because the operational overhead wasn't justified at our traffic levels. Switched to database-backed sessions. Revisit if we hit 10k concurrent users." The words don't match at all, but the meaning does.

What mem0 gives Claude Code: persistent memory with semantic search

This MCP server for Claude Code uses mem0ai as a library and exposes 11 MCP tools that Claude Code calls directly.

Here's what it looks like in practice:

Session 1 -- debugging a test suite:

> Remember: flaky test failures in CI were caused by state leakage between
  test groups, not async timing. Fixed by resetting database between groups.
  Took 3 hours to isolate. Don't chase the async red herring again.

Session 2 -- two weeks later, different project, tests start flaking:

> Search my memories for flaky test debugging
-> "flaky test failures in CI were caused by state leakage between test
   groups, not async timing. Fixed by resetting database between groups."

Claude retrieves the debugging insight and skips the three-hour investigation. It starts with the proven fix.

The difference from a flat file: semantic vector search. "Flaky test debugging" matches "state leakage between test groups" even with completely different wording. The server embeds memories using Ollama's bge-m3 model and stores them in Qdrant for approximate nearest neighbor search. Claude finds memories by meaning, not keywords.

The 11 tools

Tool	What it does
`add_memory`	Store text or conversations. The LLM extracts key facts automatically.
`search_memories`	Semantic vector search with filters, threshold, and reranking.
`get_memories`	Browse and filter stored memories (non-search).
`get_memory`	Fetch a single memory by UUID.
`update_memory`	Replace memory text. Re-embeds and re-indexes.
`delete_memory`	Delete a single memory.
`delete_all_memories`	Safe bulk delete (never nukes your collection).
`list_entities`	List which users/agents/runs have stored memories.
`delete_entities`	Cascade-delete an entity and all its memories.
`search_graph`	Search Neo4j entities by substring (optional).
`get_entity`	Get all relationships for a specific entity (optional).

The last two require Neo4j, which is entirely optional. You get full Claude Code persistent memory with the first nine tools and nothing but Qdrant + Ollama running.

Check out the full source and documentation at the mem0-mcp-selfhosted GitHub repo.

How the MCP server delivers Claude Code persistent memory

Claude Code <-- stdio --> FastMCP Server
                            |-- auth.py          <- OAT token auto-discovery
                            |-- config.py        <- Env vars -> config
                            |-- helpers.py       <- Error handling, safe bulk-delete
                            |-- graph_tools.py   <- Direct Neo4j Cypher queries
                            '-- server.py        <- 11 MCP tools + prompt
                                  |
                                  |-- mem0ai Memory class
                                  |     |-- Vector: LLM fact extraction -> Ollama embed -> Qdrant
                                  |     '-- Graph: LLM entity extraction -> Neo4j (optional)
                                  |
                                  '-- Infrastructure
                                        |-- Qdrant     <- Vector store
                                        |-- Ollama     <- Embeddings (local)
                                        '-- Neo4j      <- Knowledge graph (optional)

The server is 7 modules, each with a specific responsibility. FastMCP handles the MCP protocol layer. The mem0ai library handles memory operations. Everything else is configuration, auth, and safety wrappers. Each Claude Code session connects via stdio, so the memory tools are available the moment you start working.

The vector memory path

When Claude calls add_memory:

The text goes to Anthropic's API for fact extraction (using your Claude subscription)
The extracted facts get embedded locally via Ollama (bge-m3, 1024 dimensions)
The embedding vectors get stored in Qdrant

When Claude calls search_memories, Ollama embeds the query and Qdrant finds the nearest vectors by cosine similarity. The whole pipeline runs in 2-5 seconds.

Zero-config auth with OAT auto-discovery

Most memory MCP servers require separate API key configurations. This one reads your existing OAT (OAuth Access Token) directly from ~/.claude/.credentials.json. No configuration needed, and your persistent memory setup works the moment you connect.

The server uses a 3-tier fallback chain:

MEM0_ANTHROPIC_TOKEN env var (explicit override)
~/.claude/.credentials.json (auto-discovery, zero config)
ANTHROPIC_API_KEY env var (standard API key)

It detects whether the token is an OAT (sk-ant-oat...) or an API key (sk-ant-api...) and configures the SDK accordingly. OAT tokens use your existing Claude subscription. No separate billing, no additional API key to manage.

Setting up Claude Code persistent memory in 15 minutes

Prerequisites

Two services running locally:

Qdrant -- self-hosted vector database (one Docker command)
Ollama -- local embeddings (native install or Docker)

And Claude Code with an active subscription.

Step 1: start the infrastructure

# Start Qdrant
docker run -d -p 6333:6333 -p 6334:6334 \
  -v qdrant_storage:/qdrant/storage \
  --name qdrant qdrant/qdrant

# Start Ollama (skip if already installed natively)
docker run -d -p 11434:11434 \
  -v ollama:/root/.ollama \
  --name ollama ollama/ollama

# Pull the embedding model
docker exec ollama ollama pull bge-m3

If Ollama is already running natively on your machine, skip the Docker container and run ollama pull bge-m3 directly. That's it for infrastructure. Your self-hosted AI memory backend is ready for Claude Code to connect. See the full configuration guide for all available environment variables.

Step 2: add the MCP server to Claude Code

One command, available across all your projects:

claude mcp add --scope user --transport stdio mem0 \
  --env MEM0_QDRANT_URL=http://localhost:6333 \
  --env MEM0_EMBED_URL=http://localhost:11434 \
  --env MEM0_EMBED_MODEL=bge-m3 \
  --env MEM0_EMBED_DIMS=1024 \
  --env MEM0_USER_ID=your-user-id \
  -- uvx --from git+https://github.com/elvismdev/mem0-mcp-selfhosted.git mem0-mcp-selfhosted

uvx downloads, installs, and runs the server in an isolated environment. No manual pip install, no virtual env, no dependency conflicts.

Or add it to a single project with .mcp.json in the project root:

.mcp.json for project-scoped setup

{
  "mcpServers": {
    "mem0": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/elvismdev/mem0-mcp-selfhosted.git", "mem0-mcp-selfhosted"],
      "env": {
        "MEM0_QDRANT_URL": "http://localhost:6333",
        "MEM0_EMBED_URL": "http://localhost:11434",
        "MEM0_EMBED_MODEL": "bge-m3",
        "MEM0_EMBED_DIMS": "1024",
        "MEM0_USER_ID": "your-user-id"
      }
    }
  }
}

Step 3: make it automatic with CLAUDE.md

Add this to ~/.claude/CLAUDE.md (global) so Claude uses memory without you asking:

## MCP Servers

- **mem0**: Persistent memory across sessions. At the start of each session,
  `search_memories` for relevant context before asking the user to re-explain
  anything. Use `add_memory` whenever you discover project architecture, coding
  conventions, debugging insights, key decisions, or user preferences. Use
  `update_memory` when prior context changes. When in doubt, save it -- future
  sessions benefit from over-remembering.

With this, Claude proactively searches memory at session start and saves things it learns as it goes. You stop re-explaining. Sessions build on each other. Your Claude Code memory across sessions is now fully automatic.

Step 4: try it

Restart Claude Code, then:

> Search my memories for authentication decisions
> Remember that we rejected Redis for caching because connection pooling
  caused issues at our scale. Revisit at 10k concurrent users.
> Show me all entities in my memory

That's it. Qdrant stores your vectors, Ollama generates embeddings locally, and Claude Code now has persistent memory across every session and project.

Optional: add a knowledge graph with Neo4j

Vector search handles the core memory use case. If you want structured entity relationships on top, Neo4j adds a second dimension.

When you store "I prefer TypeScript with strict mode," the graph layer extracts entities and relationships:

user -> PREFERS -> TypeScript
user -> PREFERS -> strict_mode

You can then ask "what does this user prefer?" and traverse the graph for structured answers rather than relying on text similarity alone.

Quick setup

docker run -d -p 7687:7687 -e NEO4J_AUTH=neo4j/mem0graph neo4j:5

Add to your MCP config:

MEM0_ENABLE_GRAPH=true
MEM0_NEO4J_URL=bolt://127.0.0.1:7687
MEM0_NEO4J_PASSWORD=mem0graph

The quota cost and how to avoid it

Each add_memory with graph enabled triggers 3 additional LLM calls: entity extraction, relationship generation, and contradiction resolution. That's a real quota cost on your Claude subscription.

To protect your quota, route graph operations to a cheaper model:

Provider	Cost	Quality	Notes
Ollama (Qwen3:14b)	Free	0.971 tool-calling F1	~7-8GB VRAM (Q4_K_M)
Gemini 2.5 Flash Lite	Near-free	85.4% entity extraction	Cloud
`gemini_split`	Gemini + Claude	Best combined accuracy	85.4% extraction + 100% contradiction

With the Ollama path, the entire graph pipeline runs locally. Zero cloud dependency.

Environment variables for each graph provider

Ollama (free, local):

MEM0_GRAPH_LLM_PROVIDER=ollama
MEM0_GRAPH_LLM_MODEL=qwen3:14b

Gemini (near-free):

MEM0_GRAPH_LLM_PROVIDER=gemini
GOOGLE_API_KEY=your-google-api-key

Split-model (best accuracy):

MEM0_GRAPH_LLM_PROVIDER=gemini_split
GOOGLE_API_KEY=your-google-api-key
MEM0_GRAPH_CONTRADICTION_LLM_PROVIDER=anthropic

Neo4j is entirely optional. You get useful self-hosted AI memory with Qdrant and Ollama alone. See the project README for the complete list of environment variables.

How self-hosted mem0 compares to other Claude Code persistent memory solutions

Developers I talked to on Reddit had an interesting setup: an Obsidian vault connected to Claude via MCP, with all their chat logs and notes organized by project. When they needed context, they tell Claude to load a specific project folder. It works, but every load pulled in full transcripts, and as their vault grew, the context cost grew linearly with it.

One of the developers posted a good question: "Isn't this setup I have the same as what you built?" Not quite. The retrieval model is fundamentally different.

Approach	Search	Storage	Curation	Cross-project
CLAUDE.md + Auto Memory	None (loads all)	Markdown files	Mixed (manual + auto)	Per-project (global option)
mem0-mcp-selfhosted	Semantic vector	Qdrant vectors	Automatic	Global
Graphiti (Zep)	Hybrid graph + vector	Graph DB (required)	Automatic	Depends
Obsidian + MCP	Keyword or semantic	Vault files	Manual	Per-vault

When each approach fits

CLAUDE.md + Auto Memory is perfect for small projects with manageable context. Zero setup, immediate value, and Auto Memory adds automatic note-taking on top. I let Claude Code do its thing and use both alongside mem0, and they complement each other well.

The CLAUDE.md tells Claude Code how to use memory tools. mem0 handles the semantic storage and retrieval.

mem0-mcp-selfhosted makes sense when you need LLM long-term memory that works across multiple projects, accumulating knowledge over weeks, or when your preferences have outgrown what a flat file handles gracefully. Semantic search is the differentiator at scale.

Graphiti is worth evaluating if structured temporal relationships are your primary need. It's graph-first, meaning a graph database is required, not optional. Neo4j is the primary backend, with FalkorDB, Kuzu, and Amazon Neptune also supported. It offers bi-temporal tracking that mem0 doesn't, recording both when a fact became true and when the system learned it. The infrastructure is heavier, and depending on your LLM provider you may need separate API keys for LLM and embedding operations.

Obsidian + MCP works well if you're already an Obsidian power user who wants visual browsing and manual editing of notes. Basic implementations use keyword search over vault files, though some servers like obsidian-mcp-tools add semantic search via the Smart Connections plugin. All implementations store full documents rather than distilled facts, so context costs scale with vault size.

Get started and let me know how it goes

Here's what we covered:

Claude Code's built-in memory captures rules and summaries, but not detailed reasoning chains. Claude Code persistent memory with semantic search requires an external tool.
mem0-mcp-selfhosted gives Claude Code 11 memory tools backed by self-hosted Qdrant + Ollama.
Semantic vector search finds memories by meaning, not keywords.
The CLAUDE.md integration makes memory usage automatic. No manual triggering needed.
Neo4j adds structured entity relationships, but it's entirely optional.
Zero-config auth reads your existing OAT token. No API key setup.

The setup takes about 15 minutes: two Docker containers, one claude mcp add command, and a CLAUDE.md snippet. After that, Claude Code persistent memory builds up knowledge over time across all your projects.

claude mcp add --scope user --transport stdio mem0 \
  --env MEM0_QDRANT_URL=http://localhost:6333 \
  --env MEM0_EMBED_URL=http://localhost:11434 \
  --env MEM0_EMBED_MODEL=bge-m3 \
  --env MEM0_EMBED_DIMS=1024 \
  --env MEM0_USER_ID=your-user-id \
  -- uvx --from git+https://github.com/elvismdev/mem0-mcp-selfhosted.git mem0-mcp-selfhosted

Full source code, documentation, and issue tracker for this self-hosted MCP memory server mem0-mcp-selfhosted are on GitHub. If you're interested in more Claude Code tooling, check out my WordPress performance review skill.

I'd love to know:

Does Claude use memory proactively with the CLAUDE.md setup in your experience?
What would you want Claude to remember that it currently forgets?
How's the setup experience? Too many pieces or manageable?

Install it, search for something, and open an issue or drop a comment if the results surprise you.

elvismdev / mem0-mcp-selfhosted

Self-hosted mem0 MCP server for Claude Code. Run a complete memory server against self-hosted Qdrant + Neo4j + Ollama while using Claude as the main LLM.

mem0-mcp-selfhosted

Self-hosted mem0 MCP server for Claude Code. Run a complete memory server against self-hosted Qdrant + Neo4j + Ollama, with your choice of Anthropic (Claude) or Ollama as the main LLM.

Uses the mem0ai package directly as a library, supports both Claude's OAT token and fully local Ollama setups, and exposes 11 MCP tools for full memory management.

Prerequisites

Service	Required	Purpose
Qdrant	Yes	Vector memory storage and search
Ollama	Yes	Embedding generation (`bge-m3`) and optionally local LLM
Neo4j 5+	Optional	Knowledge graph (entity relationships)
Google API Key	Optional	Required only for `gemini`/`gemini_split` graph providers

Python >= 3.10 and uv.

Authentication: The default setup uses Claude (Anthropic) as the LLM for fact extraction. No API key needed, the server automatically uses your Claude Code session token. For fully local setups, set MEM0_PROVIDER=ollama. See Authentication for advanced options.

Quick Start

Default (Anthropic)

Add the…

View on GitHub

Top comments (1)

Kim Metzger • Jun 29

Self-hosting mem0 fixes where the memory lives. It doesn't fix what happens when a stored fact stops being true. Once the agent has written "prod runs on Postgres 14" and you migrate to 16, both rows sit in the store and similarity search hands back whichever scores higher — usually the older one with more reinforcement. The setup here is clean; the staleness is the part that bit me later.What changed it was gating retrieval on eligibility instead of score: mark the old fact superseded so it stops surfacing but stays in history for audit. I run that locally with ctx (npx @promptowl/contextnest-cli), an MCP server over versioned markdown that Claude Code reads directly. When you self-hosted, did you add any way to retire a memory short of deleting the row?