DEV Community

Kunal Jaiswal
Kunal Jaiswal

Posted on

Claude Code Was Getting Dumber. Semantic Memory Fixed It.

I use Claude Code as my primary development tool. It manages a home automation stack spread across five machines — camera monitors, WhatsApp agents, LLM inference pipelines, job scrapers, diet trackers. Over two months, the codebase grew to 30+ services with their own ports, configs, credentials, and war stories.

To keep Claude informed, I maintained documentation in .md files. camera_monitor.md. dgx_inference.md. openclaw_agents.md. One file per system, each containing architecture decisions, port numbers, credentials, known bugs, and fix history.

It worked great at 5 files. At 30, Claude started losing the plot.

The Problem With Files

Claude Code reads CLAUDE.md at session start. I added rules there: "read camera_monitor.md before touching the camera system." "Check dgx_inference.md for port mappings." Reasonable instructions.

But Claude's context window is finite. Each .md file averaged 200-400 lines. Loading 10 of them for a cross-system task consumed 3,000-4,000 tokens before Claude wrote a single line of code. And it had to decide which files to read — sometimes guessing wrong, reading 5 files before finding the answer in the 6th.

The symptoms were subtle:

  • Claude would re-explore code I'd already documented
  • It would suggest ports that were already in use
  • It would miss dependencies between services ("that endpoint moved to the Dell server last week")
  • It would launch Explore agents to grep the codebase for answers that existed in a .md file it hadn't loaded

Each session started with a tax: Claude burning context on orientation instead of doing work. The more documentation I wrote, the worse it got. A classic information retrieval problem disguised as a context window problem.

What RAG Doesn't Solve

The obvious answer is "just use RAG." Embed the docs, retrieve the relevant chunks, inject them into context. Every AI wrapper does this.

But RAG over documentation files has a specific failure mode: your retrieval unit is the wrong size. A .md file is either too big (wastes context with irrelevant sections) or you chunk it and lose the structural relationships between sections. The camera monitor doc has 15 sections — contour detection constants, RTSP URLs, the KV cache leak fix, the PyAV deadlock bug. A chunk retriever might return the RTSP URLs when you asked about the deadlock, because they share keywords like "camera" and "connection."

What I actually needed was a memory system — not document retrieval, but a knowledge base where each entry is a self-contained fact with metadata, and search is semantic, not keyword.

The Memory Server

I built a 767-line Python server that gives Claude six MCP tools: memory_search, memory_save, memory_list, memory_update, memory_delete, memory_stats.

Claude Code ──MCP/SSE──▶ memory_server.py (port 8042)
                              │
                              ├── sentence-transformers (all-MiniLM-L6-v2)
                              │     └── 384-dim embeddings, runs on CPU
                              │
                              ├── TurboQuant (4-bit vector compression)
                              │     └── in-memory index, persisted to disk
                              │
                              └── MySQL
                                    └── content, tags, category, agent_id, timestamps
Enter fullscreen mode Exit fullscreen mode

Each memory is a self-contained knowledge unit — a bug fix, a port mapping, an architecture decision, a credential, a "never do this" lesson. Not a document chunk. A fact.

The server exposes both MCP (for Claude Code) and REST (for other agents). Search is cosine similarity on MiniLM-L6-v2 embeddings, compressed to 4-bit with TurboQuant for a smaller memory footprint. Metadata lives in MySQL for filtering by category, tags, and agent.

The Memory-First Rule

The server alone isn't enough. Claude needs to be told to use it. In CLAUDE.md:

### MANDATORY: Memory-First Rule
**BEFORE reading any files, exploring code, or launching agents
— ALWAYS use `memory_search` first.**

**Order of operations for ANY question:**
1. `memory_search` with relevant keywords
2. Only if memory has no results → then read files/explore code
3. If memory server is down → fall back to local .md files

**DO NOT:** Launch Explore agents, read source files, or grep
the codebase as a first step. Memory has it.
Enter fullscreen mode Exit fullscreen mode

This single rule changed everything. Instead of reading 5 files to find a port number, Claude runs one semantic search and gets a scored result in <100ms. The context window stays clean for actual work.

What Went Into Memory

I wrote an import script that parsed every .md documentation file, split them by ## headers into self-contained sections, and bulk-loaded them as individual memories. 30 files became 200+ memories, each tagged with source file and category.

Some examples of what a single memory entry looks like:

  • "DGX Spark Ollama runs on port 11434, WebSocket proxy on 8765, adapter on 8091. Request flow: gate_monitor → adapter HTTP → WebSocket → proxy → Ollama."
  • "KV Cache Leak Fix: Ollama allocates full KV cache based on model's context_length field. Fix: create derived Modelfile with PARAMETER num_ctx 4096."
  • "Cross-thread RTSP kill bug: analysis_loop calling container.close() from wrong thread → PyAV deadlock at 300% CPU. Fix: threading.Event, rtsp_loop checks between frames."

Each one is a complete thought. No "see section 3.2 of camera_monitor.md." No dependency on having read the parent document. Claude searches "camera monitor RTSP bug" and gets exactly the fix history — nothing more, nothing less.

Per-Agent Isolation

The server enforces agent isolation. Every API call requires an agent parameter — claude, skippy, jot, hermes. Each agent only sees its own memories. agent="global" bypasses filtering for debugging.

if agent != "global":
    sql += " AND agent_id = %s"
    params.append(agent)
Enter fullscreen mode Exit fullscreen mode

This matters because I run multiple AI agents with different roles. Claude Code manages infrastructure. Skippy handled WhatsApp conversations. Each needs different knowledge, and neither should see the other's private data. One memory server, multiple isolated namespaces.

The Before and After

Before (file-based):

  1. Claude reads CLAUDE.md (200 lines)
  2. Claude decides which .md files might be relevant
  3. Claude reads 3-5 files (600-2000 lines)
  4. Claude sometimes reads the wrong files, backtracks
  5. Claude finally has enough context, starts working
  6. Context window: 2,000-4,000 tokens consumed on orientation

After (memory-first):

  1. Claude reads CLAUDE.md (200 lines, includes memory-first rule)
  2. Claude calls memory_search("camera monitor RTSP port") → 3 results, 50 lines
  3. Claude has the answer, starts working
  4. Context window: ~250 tokens consumed on orientation

The difference isn't just speed. It's accuracy. Memory search returns scored results ranked by semantic similarity. File reading returns entire documents and hopes Claude finds the relevant paragraph. Memory search at 0.4+ similarity threshold almost always returns the right answer. File reading sometimes returns the right file but the wrong section.

What I'd Change

Score tuning matters. I started with min_score: 0.3 which returned too many tangential results. Bumping to 0.4 cut noise significantly. Your threshold depends on your embedding model and memory granularity.

Memory hygiene is real work. Memories go stale. Ports change, services get decommissioned, bugs get fixed. You need to memory_update old entries or they'll mislead future sessions. I treat it like documentation — when I change a service, I update both the code and the memory.

The import granularity is critical. Too coarse (full documents) and you're back to RAG's chunking problem. Too fine (individual config values) and you lose relationships. ## header sections turned out to be the right unit for my documentation style — each section is typically one concept with enough context to stand alone.

The Stack

memory_server.py         — 767 lines, Python (Starlette + uvicorn)
sentence-transformers    — all-MiniLM-L6-v2, 384-dim embeddings
turboquant-vectors       — 4-bit vector compression + cosine search
MySQL                    — metadata, tags, categories, agent ownership
MCP SSE transport        — Claude Code native tool integration
REST API                 — /api/search, /api/save, /api/health (for other agents)
Dell R740 (Ubuntu)       — always-on server, port 8042
Enter fullscreen mode Exit fullscreen mode

202 memories. 6 tools. One rule in CLAUDE.md. Claude went from spending its first 30 seconds reading the wrong files to spending 100ms finding the right answer.

The irony isn't lost on me: I built an AI memory system to make a different AI smarter. But that's the actual state of the art — AI systems that get better not from bigger models, but from better access to the right information at the right time.

Top comments (0)