I Built a Local RAG for Claude Code: Semantic Search Over Your Own Project

#claudecode #ai #python #opensource

More than five hundred markdown files.

That's what one of my projects has, and it's not even the largest (that clocks in at almost 2,500!). ROADMAP.md, ARCHITECTURE.md, CLAUDE.md, CHANGELOG.md, task folders with notes and lessons learned, editorial notes, half-complete drafts, memory files from past sessions. Each one holds a piece of the project's history — a decision, a rationale, a thing that broke and how it got fixed.

Claude Code can't see any of it unless I point it at the right file — or it reads them on its own, burning tokens on retrieval before the real work starts.

Claude Code isn't completely amnesiac — it has session memory, it reads CLAUDE.md, and with the right governance documents it can recover a lot of context at session start. For smaller projects, that's enough. But once you're past a few dozen files of accumulated institutional knowledge, the gap between "what the agent can reasonably read at startup" and "what the project actually knows" grows wider every week.

So I built pmem — a local RAG that gives Claude Code semantic search over your project's full history. No external APIs. No data leaves your machine. Setup in two minutes.

The numbers

I ran the same query — "identify governance-related blog posts" — both ways on a project with 500+ markdown files:

	pmem (index-based)	Fresh search (Explore agent)
Results	18 posts	11 posts
Time	~20 seconds	~90 seconds
Token cost	~5,500	~20,000–24,000

The fresh search cost roughly 4× the tokens (cries in tokens) and found 7 fewer posts. The posts it missed were the ones where governance was a supporting theme rather than the headline — exactly the kind of semantic connection that keyword search can't make.

The agent's overhead — its own system prompt, tools, multi-step reasoning — is the hidden cost. It's worth it for open-ended exploration, but for a targeted retrieval question, the index was both cheaper and more thorough.

The prompt that built it

Before I show the architecture, I want to show two prompts — because the contrast illustrates something about working with AI agents that I think a lot of people miss.

The vague prompt:

"I want to give agents better memory."

This goes nowhere useful. No constraints, no architecture, no scope. The agent could build anything from a flat JSON file to a Kubernetes-deployed vector database with a React frontend. It would probably pick something in the middle and spend four hours building infrastructure you didn't need.

The prompt I actually used (simplified for readability):

I need to enhance the memory capabilities of Claude Code. Since I use Claude Code for more than just writing code — managing tasks, building documentation, maintaining infrastructure — I can generate thousands of files and folders. While they do get archived regularly, digging through them is a token and time sink, and can sometimes prove inaccurate, especially with larger projects.

We will use Ollama embeddings and build a RAG that the agent can use to query the entire project's files.

The tool must also be able to connect to a local LLM (optional) in order to further reduce token usage when parsing results.

For now, we are going to be focused on TXT and MD files, and will expand as needed.

The difference isn't length. It's that the second prompt contains a discovery phase. It names the problem, specifies the technology, defines the integration point, sets constraints, and draws an explicit scope boundary. The agent doesn't need a better prompt template. It needs you to finish thinking before you start asking.

What pmem does

The flow is simple: Claude asks a question, pmem finds the answer in your project's files, and returns it with source citations.

Under the surface:

Indexing. pmem index walks your project's markdown and text files, splits them into semantic chunks using header-aware parsing (a section stays with its heading), and embeds each chunk locally using nomic-embed-text via Ollama. Chunks are stored in ChromaDB, a file-based vector database that requires no server process. Indexing is incremental — SHA-256 hashes track which files changed.
Querying. Claude calls the memory_query MCP tool with a natural language question. pmem embeds the question, searches the vector store for semantically similar chunks using cosine similarity (ChromaDB's default), and returns results with source paths and relevance scores. Optionally, a local LLM synthesizes the chunks into a concise answer before returning it.
Session rituals. Three slash commands turn memory into a workflow: /welcome refreshes the index at session start. /sleep captures changes at session end. /reindex refreshes mid-session. The index stays current because maintaining it is a side effect of the session workflow, not a separate chore.

No data leaves your machine. No API keys required for core functionality. The entire system runs on Ollama, ChromaDB, and Python.

Architecture decisions

No LangChain. Not out of ideology — out of simplicity. pmem is around 2,000 lines of Python. The RAG pipeline is: embed → store → search → (optionally) synthesize. Four operations don't need a framework.

ChromaDB over everything else. File-based, no server process, persistent. I considered LanceDB but never formally evaluated it — ChromaDB was already working and the evaluation wasn't worth the detour. I also considered plain JSON with numpy cosine similarity, which works for small projects but doesn't scale — brute-force linear scan is O(n) per query. ChromaDB hit the sweet spot: real vector search without operational overhead.

Header-aware chunking. Most RAG tutorials split text by character count. That destroys semantic units. A section titled "Why we chose CloudFront over Fastly" that gets split between two chunks loses meaning in both. pmem uses markdown headers as natural split points, with a size-based fallback for sections that are too long. The heading becomes metadata on each chunk, so search results carry their context.

CWD walk-up for project detection. Same pattern git uses — walk up until you find a .memory directory. pmem init creates it, and from that point forward, any subdirectory just works.

Setup

Prerequisites: Python 3.11+, Ollama running locally, and the nomic-embed-text model pulled.

pip install pmem-project-memory
ollama pull nomic-embed-text

Initialize any project:

cd ~/your-project
pmem init
pmem index

Install the session skills:

pmem install-skills

Register the MCP server in ~/.claude.json (global) or .mcp.json (per-project). The README has the exact config block.

First index takes a few seconds for small projects, up to a minute for large ones. After that, incremental indexing only re-embeds changed files — typically under a second.

⭐ Star pmem on GitHub

What's next

Phase 2 is mostly complete: pmem watch for auto-reindexing, global config defaults, one-command skill installation, better error messages. Phase 3 is where it gets interesting — multi-collection support, non-markdown file support with language-aware chunking, optional image processing, and pmem diff to show how answers change over time.

The tool is open source, MIT licensed. It exists because I needed it, and I suspect anyone running Claude Code on a project with more than a few dozen files needs it too.

Sources: ChromaDB — Distance Functions · ANN Benchmarks (Aumüller, Bernhardsson & Faithfull)