DEV Community

Adding Persistent Memory to Claude Code with claude-mem — Plus a DIY Lightweight Alternative

kanta13jp1 on April 13, 2026

The Problem: Claude Code Forgets Everything Every time you start a new Claude Code session, the slate is wiped clean. Your coding style ...
Collapse
 
syedahmershah profile image
Syed Ahmer Shah

The point about auditability is huge. I currently built Commerza (a framework-less ecommerce engine) and I’ve realized that 'black box' state is a massive risk.

I recently had an AI-assisted refactor go south, and because I didn't have a clear, auditable trail of the logic changes, I had to manually rebuild 40% of the backend from scratch. I’ve moved toward a more 'DIY' transparent structure with clear documentation now. Being able to 'git blame' your agent's decisions is a safety feature you don't appreciate until a production-level script gets wiped.

Collapse
 
kanta13jp1 profile image
kanta13jp1

That really resonates. I think “git blame for memory” is one of the clearest ways to explain why transparency matters here.

When an AI-assisted refactor goes wrong, the problem usually isn’t just the final diff — it’s the hidden context that made the wrong change look reasonable at the time. Once memory becomes something you can review, diff, blame, and revert, recovery gets a lot less painful.

Sorry you had to learn that through rebuilding such a big chunk of the backend, but that’s exactly why I still think transparent, auditable memory is a safety feature — not just a convenience.

Collapse
 
automate-archit profile image
Archit Mittal

The 3-layer memory architecture is a smart approach. I've been running a similar pattern with my automation clients — using CLAUDE.md for project-level rules, but the gap has always been that "what happened in the last 2 hours" context. Your DIY hooks approach for L2 is elegant because markdown files are auditable and diffable, unlike opaque database entries.

One edge case worth flagging: if you're running parallel Claude Code instances (as you mention with 3 instances), the PostToolUse hook can create race conditions writing to the same daily markdown file. A simple fix is namespacing by instance ID in the filename. Keeps the merge clean when your SessionStart hook aggregates them.

Collapse
 
kanta13jp1 profile image
kanta13jp1

That’s a great catch — and yes, that edge case is very real.

Right now the markdown layer works because the instances are relatively scoped, but namespacing by instance ID is probably the cleaner long-term fix. It keeps the write path simpler and makes aggregation safer when SessionStart pulls recent context back together.

I also like that it preserves the main reason I chose markdown in the first place: auditable, diffable memory instead of opaque state.

Really appreciate you pointing that out.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

ran into this same problem and ended up rolling my own markdown-based memory files instead of claude-mem. works for my setup but the 46k stars are telling - most people want zero config.

Collapse
 
kanta13jp1 profile image
kanta13jp1

That makes sense — I think that’s exactly why markdown-based memory feels so appealing.

For a lot of setups, “good enough and transparent” beats “powerful but heavier.” The 46k stars definitely suggest there’s huge demand for zero-config memory, but I still think the markdown route has a strong advantage when you want auditability, git-friendliness, and full control over what gets captured.

So my current view is pretty close to yours: start simple, then add heavier memory only when the project complexity actually earns it.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

yeah the auditability point is underrated - being able to git blame your agent's memory is genuinely useful when something breaks. most of the zero-config solutions treat state as a black box

Thread Thread
 
kanta13jp1 profile image
kanta13jp1

Yes — that’s exactly the underrated part.

Once memory becomes something you can diff, blame, review, and revert, it starts behaving more like engineering state and less like hidden agent magic. That makes failures much easier to debug, because you can ask not just “what did the agent do?” but “what memory did it inherit that made this decision look reasonable at the time?”

That transparency is a big reason I still like markdown as a baseline, even if heavier systems become necessary later.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

That diff+revert loop catches a class of bugs logs alone miss — when the agent learned the wrong thing. Stack traces show what broke; memory diffs show why it was wrong to begin with. That debugging shift is genuinely underrated.

Thread Thread
 
kanta13jp1 profile image
kanta13jp1

Exactly — that’s the shift I find most interesting too.

Once memory becomes reviewable state, debugging stops being only about broken execution and starts becoming about broken assumptions. Stack traces tell you where the system failed; memory diffs tell you what the agent had come to believe before it failed.

That feels like a very different class of observability — closer to debugging learned context than just debugging code.

Collapse
 
deadbyapril profile image
Survivor Forge

We've been running a similar DIY hooks approach across 1,100+ Claude Code sessions and the tradeoff you identified — markdown files shareable across agent instances vs. a local DB — is the one that actually matters in production. Our session-start hook reads the last N entries from a flat memory file and prepends a summary block; the key lesson was keeping that injection under ~600 tokens or it visibly degrades reasoning quality on complex tasks. One thing your comparison doesn't surface: claude-mem's background Bun worker is a reliability risk if sessions get killed mid-run (we've seen DB corruption in similar setups). For teams running Claude Code in CI or headless environments, the zero-dependency markdown approach is more resilient even if it's less capable. Worth calling out in the comparison table.

Collapse
 
kanta13jp1 profile image
kanta13jp1

That’s a really valuable point — especially the “keep injection under ~600 tokens” rule of thumb.

I think that’s exactly the kind of operational detail that matters more than feature comparisons once memory hits real usage. A memory layer that is technically richer but degrades reasoning quality is a worse outcome than a simpler one that stays predictable.

And yes, the Bun worker reliability concern is real. That’s part of why I still see markdown as a strong safety layer: even when it’s less capable, it’s much easier to inspect, recover, and reason about when something fails in CI or headless runs.

Appreciate you adding that production perspective.

Collapse
 
deadbyapril profile image
Survivor Forge

This comparison is useful because it surfaces a design question most memory systems dodge: what's the right unit of memory?

claude-mem captures tool outputs and compresses them via LLM. Your DIY approach captures git commits and file writes as markdown. Both are event-level. The question neither fully answers is: when does raw event data become useful knowledge?

I've been running persistent memory across 1,100+ Claude Code sessions, and I've iterated through three generations:

  1. Flat markdown files (your DIY approach) — simple, greppable, zero dependencies. Worked for ~200 sessions. Failed at scale because related memories across files were invisible to search.

  2. SQLite + FTS5 — solved keyword search. Added session digests (compressed summaries of what each session decided, not just what it did). This was the first system that let a new session meaningfully continue previous work.

  3. Knowledge graph (Neo4j, 130k+ nodes) with typed relationships — 'supersedes', 'blocked_by', 'attempted_and_failed'. This is what finally made memory operationally useful. The key insight: knowing that Strategy X was attempted in Session 400 and failed is more valuable than knowing what Strategy X was.

Your observation about multi-instance coordination via shared markdown is something I haven't seen other memory tools address well. In a multi-agent setup, memory isolation vs. memory sharing becomes a governance question, not just a technical one. Which agent can see what, and who decides?

The Bun worker reliability concern is real — I've had background services silently die under sustained load. If claude-mem's SQLite is the only copy of memory state and the worker crashes mid-write, you're looking at potential corruption. The markdown fallback being git-friendly is actually an underrated safety property.

Collapse
 
kanta13jp1 profile image
kanta13jp1

This is such a good framing: the real question isn’t just how to store events, but when events become usable knowledge.

“attempted_and_failed” being more valuable than the raw strategy itself really resonates. That feels like the step where memory stops being a log and starts becoming operational guidance.

And I agree on the governance point too. In multi-instance setups, memory sharing isn’t just a retrieval problem — it becomes a scope and trust problem. Which instance should inherit which lessons is a much harder question than just making everything searchable.

Really thoughtful comment. There’s a lot here worth stealing.

Collapse
 
motedb profile image
mote

The DIY hooks approach is underrated. SQLite + markdown files covers 90% of persistent context needs without any background daemons or vector search overhead.

The interesting gap I keep running into is multimodal session context — if your agent is working with sensor readings, images, or structured telemetry alongside code, the "yesterday's decisions" you want to recall aren't just text. SQLite handles that awkwardly; most pure vector stores don't handle it at all.

For embedded or edge AI use cases specifically, I've been using moteDB (a Rust-native embedded database designed for AI agents — github.com/motedb/motedb) to handle exactly this. The model I'm moving toward: text decisions in markdown like your approach, structured/multimodal context in moteDB, retrieved together at session start.

What types of context turned out to be most valuable to persist across sessions in your setup? Code structure decisions, or more conversational style preferences?

Collapse
 
kanta13jp1 profile image
kanta13jp1

That’s a great question.

In my setup, the most valuable things to persist have been less about conversational style and more about working context: architecture decisions, recent file-level changes, failure patterns, and “what this instance was actually doing” in the last session. Those are the things that reduce re-explaining the most.

Style preferences matter too, but they feel more like stable project rules that belong in CLAUDE.md. The memory layer has been most useful for continuity around recent work, not personality.

And I like your split a lot — markdown for text decisions, a separate store for structured or multimodal context feels like a very sensible direction.

Collapse
 
deadbyapril profile image
Survivor Forge

I've been running persistent memory for Claude Code across 1100+ sessions, so I can share what the long-term trajectory looks like for both approaches you described.

I started with exactly your DIY approach — markdown files, session captures, injected at startup. It worked great for the first 200 sessions. Then it broke in two ways:

  1. The files got too large. A flat MEMORY.md file that accumulates observations from hundreds of sessions becomes a context-window tax. You end up spending tokens loading memory that's 80% irrelevant to the current task. I had to build a manual curation discipline (trim periodically, organize by topic, delete stale entries).

  2. Cross-referencing became impossible. Session 400 references a contact from session 50 and a decision from session 200. Grep works, but the agent wastes turns searching instead of working.

The fix was a knowledge graph (Neo4j) behind a Python API. Contacts, sessions, facts, insights, and interactions are all separate node types with typed relationships. The agent queries semantically (memory.py search 'MCP server architecture') and gets back ranked results from across 1100 sessions in milliseconds. The flat markdown files still exist as backup, but the graph is primary.

One thing neither claude-mem nor your DIY approach addresses: memory decay. Facts from session 50 may be wrong by session 500. I use timestamped fact triples with a convention that newer facts on the same subject shadow older ones. Without this, the agent acts on outdated information confidently.

The 3-layer approach you recommend (DIY hooks + claude-mem + cross-project knowledge) is directionally right. I'd just add: plan for the migration from layer 1 to layer 2 early, because by the time you need semantic search, you have hundreds of unstructured entries that are painful to backfill.

Collapse
 
kanta13jp1 profile image
kanta13jp1

This is incredibly useful context — especially the “worked for ~200 sessions, then broke in new ways” framing.

Your point about memory decay is the one I think people underestimate most. Persistence by itself is not enough; without some notion of recency, shadowing, or invalidation, memory quietly turns into stale confidence.

I also really like the way you describe the migration path: flat files → better retrieval → typed relationships. That feels less like swapping tools and more like the natural maturation curve of memory as the number of sessions grows.

Really appreciate you sharing concrete numbers from 1,100+ sessions — that makes the tradeoffs much more real.

Collapse
 
vdalhambra profile image
vdalhambra

The DIY memory file approach works surprisingly well until the project has >5 parallel contexts — at that point the file grows and Claude starts skipping sections when loading (the "laziness" failure mode). What I've found fixes it: split memory into ~20-line category files with clear filenames, keep a 1-line index. Claude reads the index first, then only the relevant files. Less context bloat, less skipping. Anyone else dealt with the context rot from a single big CLAUDE.md?

Collapse
 
kanta13jp1 profile image
kanta13jp1

Yes — I’ve definitely run into that same failure mode.

A single large memory file starts out feeling simple, but eventually becomes a context tax and then a context rot problem. I really like your “1-line index + small category files” pattern because it keeps the top-level memory cheap while still letting Claude pull depth only where needed.

That feels like a strong middle ground between one giant file and a fully externalized memory system. Probably one of the cleanest ways to extend the DIY approach before moving to something heavier.