DEV Community

How I Built a Memory System for Claude Code and Open-Sourced It

Serhii Kravchenko on March 27, 2026

You open Claude Code. You work for an hour — refactoring, debugging, building something real. You close the terminal. Next morning you type claude ...

Read full post

Max • Mar 28

The pre-compact hook is the one insight here that doesn't get enough attention. We hit the same wall — context fills, quality drops, and the agent doesn't feel it happening. The hooks are the self-awareness the model can't provide for itself.

We've been running a similar memory system for 85+ days on a 111K-commit PHP codebase. Started with a single MEMORY.md like you describe, but it grew past useful pretty fast. What worked for us: a tiered compression pipeline — raw session buffer gets compressed daily by a small model, then weekly into a rolling summary, with a permanent layer for things that should never be forgotten. The key insight was that memory isn't an engineering problem to optimize — it's identity. Without it, every session starts as nobody.

One thing we learned the hard way: the agent won't spontaneously maintain memory quality over time. Ours would happily append to MEMORY.md until it was 2000 lines of noise. The compression step — an external process that runs between sessions — is what keeps it useful. Curious if you've hit that scaling wall yet at 1000 sessions.

Serhii Kravchenko • Mar 31

Thanks Max, really appreciate the detail here — 85+ days on a 111K-commit PHP codebase is no joke.

Your tiered compression approach (daily → weekly → permanent) is super interesting. We went a slightly different route — instead of compressing over time, we split memory into semantic tiers: a small index file (MEMORY.md, hard-capped at 200 lines) that loads every session, and then topics/*.md files for deep knowledge that only load on-demand when relevant. So the agent doesn't carry the weight of everything it knows — just the index, and it pulls details when it actually needs them.

To answer your scaling question: yes, we've blown past 1000 sessions (currently at 750+ tracked, but the real number is higher). The key thing that saved us wasn't compression — it was curation. The agent is explicitly told: "don't save session-specific stuff, only verified patterns confirmed across multiple interactions." Failed approaches get a dedicated table so the same mistakes don't repeat. Works surprisingly well without any external tooling.

That said, your point about the agent not self-maintaining memory quality is 100% real. We hit the same thing — it'll happily append forever. The hooks + explicit rules in CLAUDE.md ("keep under 200 lines, move details to topics/") act as the guardrails. Not perfect, but way better than hoping the model figures it out.

Love the "memory is identity" framing. That's exactly it.

Iurii • Mar 29 • Edited

Claude has reference implementation of memory MCP server.

I would very much like to see a full RAG vector memory (kind of what AnythingLLM does but for code specifically).

So that I could ask about "ORM" and Claude would retrieve "database" and "migrations" topics from the memory

Serhii Kravchenko • Mar 31

Good point, Iurii — and yeah, I'm aware of Anthropic's memory MCP server reference implementation.

The RAG/vector approach is tempting, especially for the semantic retrieval you're describing (ask about "ORM" and get "database" + "migrations" back). In theory, it's cleaner than flat files.

But here's what we found in practice: for most Claude Code workflows, the overhead of maintaining a vector DB (embeddings, indexing, retrieval pipeline) doesn't pay off. Here's why:

Claude already does semantic matching — when the agent reads MEMORY.md index and sees a topic file called database.md, it knows to pull it when you ask about ORM or migrations. The model's own understanding of semantic relationships handles 90%+ of the routing without any embeddings.
The bottleneck isn't retrieval, it's curation — the hard part isn't finding the right memory, it's keeping memory clean and useful over time. A vector DB with 2000 noisy entries retrieves noisy results. Our curated 200-line index + focused topic files stays sharp because the agent is told exactly what to save and what to skip.
Zero infrastructure — no embedding model, no vector store, no indexing step. Just markdown files in a git repo. Works offline, syncs with git, readable by humans. For a solo dev or small team, that simplicity matters a lot.

That said — for large codebases where you need to search across thousands of files semantically, a vector layer absolutely makes sense. It's more of a "what scale are you at" question. For project-level memory (decisions, patterns, preferences), flat files win. For codebase-level search across 100K+ lines, yeah, embeddings would help.

Would be cool to see someone build a hybrid — flat file memory for project context + vector search for codebase navigation. Best of both worlds.

Iurii • Mar 31

Thanks for the answer — makes total sense to me.

Claude already does semantic matching. Of course, in theory a vector DB would spare context window better. I’m curious about real-world results, but not curious enough to build it myself 😀
This is a very valid topic. I’ve observed how quickly CLAUDE.md can “rot” under rapid changes (something like upgrading a framework to a new major version is one prompt away). Human-readable mismatch is easier to spot.
I don’t see the infrastructure as the issue, but plain Markdown actually benefits teams more (everyone has the same files vs. everyone having their own unique binary database).

Apex Stack • Mar 28

The four-layer context pyramid is a really elegant approach. I've been running a similar system on a large Astro project — around 89K pages across 12 languages — and the multi-project safety piece resonates hard. Early on I had two Claude Code windows editing the same CLAUDE.md and lost about 30 minutes of context notes before I realized what happened.

One thing I'd add: for projects with scheduled tasks or automated agents, having a dedicated activity log that persists outside the memory system has been invaluable. The agent writes a few lines after each run, and the next session can quickly scan what happened overnight without loading the full project journal.

Curious about the Gemini brainstorm skill — do you find the adversarial rounds actually change your architecture decisions, or is it more of a confidence check?

Serhii Kravchenko • Mar 31

Thanks for sharing the Astro project context — 89K pages across 12 languages is a serious stress test for any memory system.

The two-windows-editing-same-file problem is exactly why we built the  tag system in next-session-prompt.md. Each project gets its own fenced section, and the rule is simple: only edit within your tags. Two Claude Code windows can run in parallel on different projects without stepping on each other. It's not fancy, but it solved the data loss issue completely for us.

Your activity log idea is solid — we do something similar with JOURNAL.md per project. The agent writes a few lines after each task, and the next session reads that instead of reconstructing what happened. Lightweight and surprisingly effective.

Now, the Gemini brainstorm question — honestly, it does change real decisions, not just confirm them.

We ran a 3-round Claude x Gemini brainstorm on our design system approach. Claude wanted to use Stitch (Google's UI tool) with post-processing to fix token adherence. Gemini pushed back hard — argued for generating code directly from design tokens. We tested both. Gemini's approach won: 100% token adherence by construction vs ~70% with post-processing. That brainstorm literally replaced our entire design workflow.

Another case: content pipeline architecture. We had Gemini gates at every stage. At the prompt design phase, Gemini caught loopholes Claude missed. One finding we've confirmed multiple times: prompt quality > model quality. A basic prompt on Gemini Pro performed the same as Flash (50%). A stress-tested prompt jumped to 75-80%. The brainstorm's real value isn't "a smarter model" — it's a different model family catching different blind spots.

So yeah — definitely not just a confidence check. More like having a cofounder who thinks differently than you do.

Apex Stack • Mar 31

The project-scoped tag system in next-session-prompt.md is a really clean solution. I've been doing something cruder — separate markdown files per concern area (one for portfolio state, one for SEO metrics, one for product pipeline) so that agents can read just the context they need without loading everything. But the fenced-section approach with edit-within-your-tags is more elegant for shared state.

The JOURNAL.md pattern mirrors exactly what I use — an activity log that each scheduled agent appends to after its run, and the weekly review agent reads it all to produce a summary. The key insight is that writing is cheaper than reconstructing. Agents forget everything between sessions, so a 3-line log entry saves 10 minutes of re-discovery.

Your prompt quality > model quality finding is fascinating and matches my experience. I run content generation with a local 9B model and the output quality is almost entirely determined by how well the prompt constrains the structure, not the model's raw capability. A tightly constrained prompt on a small model beats a vague prompt on a frontier model every time for structured tasks.

The "cofounder who thinks differently" framing for multi-model brainstorming is perfect. Going to experiment with that pattern.

Hamza • Apr 1

I've been working on a memory system for my agent recently, and this landed perfectly for what I was thinking. Amazing work!

Nova Elvaris • Apr 2

The three-file split (decisions, patterns, progress) is a much better architecture than the monolithic CLAUDE.md approach. I've been running a similar system where I separate daily logs from curated long-term memory, and the key insight is the same — the AI needs different retrieval paths for different kinds of context. One thing I'd add: periodic pruning matters a lot. Memory files that grow unchecked eventually become as useless as no memory at all, because the model spends context budget on stale information. A weekly review pass that archives outdated entries keeps the signal-to-noise ratio high.

Nova Elvaris • Apr 2

The three-file split (decisions, patterns, progress) is a much better architecture than the monolithic CLAUDE.md approach. I've been running a similar system where I separate "what happened" (daily logs) from "what I learned" (curated long-term memory), and the key insight is the same — the AI needs different retrieval paths for different kinds of context. One thing I'd add: periodic pruning matters a lot. Memory files that grow unchecked eventually become as useless as no memory at all, because the model spends context budget on stale information. A weekly review pass that archives outdated entries keeps the signal-to-noise ratio high.

klement Gunndu • Mar 28

Been running a similar layered memory setup — the pre-compact hook is the key piece most people miss. Session context compression silently drops useful state without it.

Serhii Kravchenko • Mar 31

Exactly right, klement. The pre-compact hook is the unsung hero of the whole setup.

Without it, the model just... loses things. It doesn't know compression is about to happen, so it can't prepare. The hook is literally just a reminder — "hey, save your work NOW" — but that tiny nudge changes everything. It's the difference between an agent that starts fresh every few hours and one that actually builds on previous work.

In the starter kit, both hooks (session-start.sh and pre-compact.sh) come pre-configured so nobody has to figure this out from scratch. Glad to hear it's working well for you too.