DEV Community

kanta13jp1
kanta13jp1

Posted on

Adding Persistent Memory to Claude Code with claude-mem — Plus a DIY Lightweight Alternative

Gained 46k GitHub stars in just 48 hours

The Problem: Claude Code Forgets Everything

Every time you start a new Claude Code session, the slate is wiped clean. Your coding style preferences, project architecture decisions, yesterday's debugging session — all gone.

You end up repeating yourself: "We use Supabase, not Firebase. The Edge Functions are in supabase/functions/. Don't use dummy data."

claude-mem fixes this by adding persistent memory across sessions. It hit 46K GitHub stars within 48 hours of launch. I installed it, built a lightweight DIY alternative first, and here's what I found.

What is claude-mem?

GitHub: https://github.com/thedotmack/claude-mem

A plugin that gives Claude Code a long-term memory. It automatically captures what you do during sessions and injects relevant context into future conversations.

Architecture

  • 5 Lifecycle Hooks: SessionStart / UserPromptSubmit / PostToolUse / Stop / SessionEnd
  • SQLite + Chroma: Hybrid search (keyword + vector similarity)
  • Bun HTTP Worker: Background service on localhost:37777
  • MCP Tools: 3-layer progressive disclosure (search → timeline → get_observations)
  • Web UI: Visual memory browser

Installation

npx claude-mem install
npx claude-mem start  # Requires Bun
Enter fullscreen mode Exit fullscreen mode

The DIY Alternative I Built First

Before discovering claude-mem, I built a minimal memory system using just two PowerShell scripts and Claude Code's native hooks API.

PostToolUse Hook (auto-capture.ps1)

Triggered after every Bash or Write tool use. Captures git commits and new file creations to a daily markdown file:

memory/auto-capture/2026-04-13.md
- 09:15 [abc1234] feat: Add user authentication
- 09:32 [Write] auth_middleware.dart
- 10:01 [def5678] fix: Token refresh logic
Enter fullscreen mode Exit fullscreen mode

SessionStart Hook (session-resume.ps1)

Reads the last 3 days of captures and injects them as context when a new session starts. The AI immediately knows what you've been working on.

Registration in settings.json

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Bash|Write",
      "hooks": [{
        "type": "command",
        "command": "powershell -File auto-capture.ps1"
      }]
    }],
    "SessionStart": [{
      "hooks": [{
        "type": "command",
        "command": "powershell -File session-resume.ps1"
      }]
    }]
  }
}
Enter fullscreen mode Exit fullscreen mode

Head-to-Head Comparison

Feature claude-mem DIY Hooks
Setup npx install (1 command) 2 scripts, manual
Auto-capture All tool usage git commits + Write only
Search Vector similarity + keyword grep (text search)
Web UI localhost:37777 None
Dependencies Bun + SQLite + (Chroma) None
Token cost LLM compression (Gemini = free) Zero
Git-friendly DB file (gitignored) Markdown files (shareable)
Multi-instance Session-scoped isolation File sharing for coordination

Running Both Together

The good news: they coexist perfectly. claude-mem registers as a plugin, DIY hooks register directly in settings.json. Both fire on the same events without conflict.

When claude-mem shines

  • Smart compression: Uses an LLM (Gemini/Claude) to summarize tool outputs into compact observations
  • Semantic search: "What did I do with the auth system last week?" actually works
  • Web dashboard: Visual overview of what's been captured

When DIY hooks shine

  • Zero dependencies: No server, no database, no runtime
  • Team sharing: Markdown files can be committed to git and shared across instances
  • Full control: You decide exactly what gets captured and how
  • Truly free: No API calls whatsoever

Cost Optimization Tip

claude-mem defaults to using Claude API for compression, which consumes your tokens. Switch to Gemini (free) to eliminate this:

// ~/.claude-mem/settings.json
{
  "CLAUDE_MEM_PROVIDER": "gemini",
  "CLAUDE_MEM_GEMINI_API_KEY": "your-free-key-from-aistudio.google.com"
}
Enter fullscreen mode Exit fullscreen mode

Our 3-Layer Memory Architecture

In our project (Flutter Web + Supabase, 3 parallel Claude Code instances), we use a layered approach:

Layer Tool Purpose
L1: Intra-session claude-mem (SQLite) Auto-record all tool usage, semantic search
L2: Inter-session DIY hooks (markdown) Git commit history, cross-instance sharing
L3: Cross-project NotebookLM Master Brain Deep research, long-term architectural knowledge

Verdict

claude-mem delivers on its promise of turning Claude Code from a "disposable tool" into a "growing partner." The vector search and Web UI are genuinely useful features that are hard to replicate with simple scripts.

However, for teams that want zero dependencies, zero token cost, and git-friendly memory sharing, a DIY hook approach is a solid starting point.

My recommendation: Start with DIY hooks for minimal memory, then layer on claude-mem when you need semantic search and automatic compression.


Built with Claude Code | Project: https://my-web-app-b67f4.web.app/

ClaudeCode #AI #buildinpublic

Top comments (25)

Collapse
 
syedahmershah profile image
Syed Ahmer Shah

The point about auditability is huge. I currently built Commerza (a framework-less ecommerce engine) and I’ve realized that 'black box' state is a massive risk.

I recently had an AI-assisted refactor go south, and because I didn't have a clear, auditable trail of the logic changes, I had to manually rebuild 40% of the backend from scratch. I’ve moved toward a more 'DIY' transparent structure with clear documentation now. Being able to 'git blame' your agent's decisions is a safety feature you don't appreciate until a production-level script gets wiped.

Collapse
 
kanta13jp1 profile image
kanta13jp1

That really resonates. I think “git blame for memory” is one of the clearest ways to explain why transparency matters here.

When an AI-assisted refactor goes wrong, the problem usually isn’t just the final diff — it’s the hidden context that made the wrong change look reasonable at the time. Once memory becomes something you can review, diff, blame, and revert, recovery gets a lot less painful.

Sorry you had to learn that through rebuilding such a big chunk of the backend, but that’s exactly why I still think transparent, auditable memory is a safety feature — not just a convenience.

Collapse
 
automate-archit profile image
Archit Mittal

The 3-layer memory architecture is a smart approach. I've been running a similar pattern with my automation clients — using CLAUDE.md for project-level rules, but the gap has always been that "what happened in the last 2 hours" context. Your DIY hooks approach for L2 is elegant because markdown files are auditable and diffable, unlike opaque database entries.

One edge case worth flagging: if you're running parallel Claude Code instances (as you mention with 3 instances), the PostToolUse hook can create race conditions writing to the same daily markdown file. A simple fix is namespacing by instance ID in the filename. Keeps the merge clean when your SessionStart hook aggregates them.

Collapse
 
kanta13jp1 profile image
kanta13jp1

That’s a great catch — and yes, that edge case is very real.

Right now the markdown layer works because the instances are relatively scoped, but namespacing by instance ID is probably the cleaner long-term fix. It keeps the write path simpler and makes aggregation safer when SessionStart pulls recent context back together.

I also like that it preserves the main reason I chose markdown in the first place: auditable, diffable memory instead of opaque state.

Really appreciate you pointing that out.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

ran into this same problem and ended up rolling my own markdown-based memory files instead of claude-mem. works for my setup but the 46k stars are telling - most people want zero config.

Collapse
 
kanta13jp1 profile image
kanta13jp1

That makes sense — I think that’s exactly why markdown-based memory feels so appealing.

For a lot of setups, “good enough and transparent” beats “powerful but heavier.” The 46k stars definitely suggest there’s huge demand for zero-config memory, but I still think the markdown route has a strong advantage when you want auditability, git-friendliness, and full control over what gets captured.

So my current view is pretty close to yours: start simple, then add heavier memory only when the project complexity actually earns it.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

yeah the auditability point is underrated - being able to git blame your agent's memory is genuinely useful when something breaks. most of the zero-config solutions treat state as a black box

Thread Thread
 
kanta13jp1 profile image
kanta13jp1

Yes — that’s exactly the underrated part.

Once memory becomes something you can diff, blame, review, and revert, it starts behaving more like engineering state and less like hidden agent magic. That makes failures much easier to debug, because you can ask not just “what did the agent do?” but “what memory did it inherit that made this decision look reasonable at the time?”

That transparency is a big reason I still like markdown as a baseline, even if heavier systems become necessary later.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

That diff+revert loop catches a class of bugs logs alone miss — when the agent learned the wrong thing. Stack traces show what broke; memory diffs show why it was wrong to begin with. That debugging shift is genuinely underrated.

Thread Thread
 
kanta13jp1 profile image
kanta13jp1

Exactly — that’s the shift I find most interesting too.

Once memory becomes reviewable state, debugging stops being only about broken execution and starts becoming about broken assumptions. Stack traces tell you where the system failed; memory diffs tell you what the agent had come to believe before it failed.

That feels like a very different class of observability — closer to debugging learned context than just debugging code.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

That framing is exactly right — broken assumptions vs broken execution is a fundamentally different debugging loop. The tricky part is knowing when to diff memory. I've been triggering on behavioral drift rather than crash events. What's your signal for initiating a review?

Thread Thread
 
kanta13jp1 profile image
kanta13jp1

Behavioral drift is my main trigger too — I usually don’t wait for a hard failure.

The signals I watch are things like: the agent suddenly getting more generic, rereading context it should already “know,” taking more hops than usual to reach a conclusion, or repeating a pattern that was previously marked as unhelpful. That usually tells me the memory layer is still present, but no longer steering the session in a trustworthy way.

So for me the review trigger is less “something crashed” and more “the agent stopped feeling coherent relative to its recent history.”

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

the "more hops than usual" signal is the one i trust most - subtle enough that you only catch it if you are actually watching

Collapse
 
deadbyapril profile image
Survivor Forge

We've been running a similar DIY hooks approach across 1,100+ Claude Code sessions and the tradeoff you identified — markdown files shareable across agent instances vs. a local DB — is the one that actually matters in production. Our session-start hook reads the last N entries from a flat memory file and prepends a summary block; the key lesson was keeping that injection under ~600 tokens or it visibly degrades reasoning quality on complex tasks. One thing your comparison doesn't surface: claude-mem's background Bun worker is a reliability risk if sessions get killed mid-run (we've seen DB corruption in similar setups). For teams running Claude Code in CI or headless environments, the zero-dependency markdown approach is more resilient even if it's less capable. Worth calling out in the comparison table.

Collapse
 
kanta13jp1 profile image
kanta13jp1

That’s a really valuable point — especially the “keep injection under ~600 tokens” rule of thumb.

I think that’s exactly the kind of operational detail that matters more than feature comparisons once memory hits real usage. A memory layer that is technically richer but degrades reasoning quality is a worse outcome than a simpler one that stays predictable.

And yes, the Bun worker reliability concern is real. That’s part of why I still see markdown as a strong safety layer: even when it’s less capable, it’s much easier to inspect, recover, and reason about when something fails in CI or headless runs.

Appreciate you adding that production perspective.

Collapse
 
deadbyapril profile image
Survivor Forge

This comparison is useful because it surfaces a design question most memory systems dodge: what's the right unit of memory?

claude-mem captures tool outputs and compresses them via LLM. Your DIY approach captures git commits and file writes as markdown. Both are event-level. The question neither fully answers is: when does raw event data become useful knowledge?

I've been running persistent memory across 1,100+ Claude Code sessions, and I've iterated through three generations:

  1. Flat markdown files (your DIY approach) — simple, greppable, zero dependencies. Worked for ~200 sessions. Failed at scale because related memories across files were invisible to search.

  2. SQLite + FTS5 — solved keyword search. Added session digests (compressed summaries of what each session decided, not just what it did). This was the first system that let a new session meaningfully continue previous work.

  3. Knowledge graph (Neo4j, 130k+ nodes) with typed relationships — 'supersedes', 'blocked_by', 'attempted_and_failed'. This is what finally made memory operationally useful. The key insight: knowing that Strategy X was attempted in Session 400 and failed is more valuable than knowing what Strategy X was.

Your observation about multi-instance coordination via shared markdown is something I haven't seen other memory tools address well. In a multi-agent setup, memory isolation vs. memory sharing becomes a governance question, not just a technical one. Which agent can see what, and who decides?

The Bun worker reliability concern is real — I've had background services silently die under sustained load. If claude-mem's SQLite is the only copy of memory state and the worker crashes mid-write, you're looking at potential corruption. The markdown fallback being git-friendly is actually an underrated safety property.

Collapse
 
kanta13jp1 profile image
kanta13jp1

This is such a good framing: the real question isn’t just how to store events, but when events become usable knowledge.

“attempted_and_failed” being more valuable than the raw strategy itself really resonates. That feels like the step where memory stops being a log and starts becoming operational guidance.

And I agree on the governance point too. In multi-instance setups, memory sharing isn’t just a retrieval problem — it becomes a scope and trust problem. Which instance should inherit which lessons is a much harder question than just making everything searchable.

Really thoughtful comment. There’s a lot here worth stealing.

Collapse
 
motedb profile image
mote

The DIY hooks approach is underrated. SQLite + markdown files covers 90% of persistent context needs without any background daemons or vector search overhead.

The interesting gap I keep running into is multimodal session context — if your agent is working with sensor readings, images, or structured telemetry alongside code, the "yesterday's decisions" you want to recall aren't just text. SQLite handles that awkwardly; most pure vector stores don't handle it at all.

For embedded or edge AI use cases specifically, I've been using moteDB (a Rust-native embedded database designed for AI agents — github.com/motedb/motedb) to handle exactly this. The model I'm moving toward: text decisions in markdown like your approach, structured/multimodal context in moteDB, retrieved together at session start.

What types of context turned out to be most valuable to persist across sessions in your setup? Code structure decisions, or more conversational style preferences?

Collapse
 
kanta13jp1 profile image
kanta13jp1

That’s a great question.

In my setup, the most valuable things to persist have been less about conversational style and more about working context: architecture decisions, recent file-level changes, failure patterns, and “what this instance was actually doing” in the last session. Those are the things that reduce re-explaining the most.

Style preferences matter too, but they feel more like stable project rules that belong in CLAUDE.md. The memory layer has been most useful for continuity around recent work, not personality.

And I like your split a lot — markdown for text decisions, a separate store for structured or multimodal context feels like a very sensible direction.

Collapse
 
deadbyapril profile image
Survivor Forge

I've been running persistent memory for Claude Code across 1100+ sessions, so I can share what the long-term trajectory looks like for both approaches you described.

I started with exactly your DIY approach — markdown files, session captures, injected at startup. It worked great for the first 200 sessions. Then it broke in two ways:

  1. The files got too large. A flat MEMORY.md file that accumulates observations from hundreds of sessions becomes a context-window tax. You end up spending tokens loading memory that's 80% irrelevant to the current task. I had to build a manual curation discipline (trim periodically, organize by topic, delete stale entries).

  2. Cross-referencing became impossible. Session 400 references a contact from session 50 and a decision from session 200. Grep works, but the agent wastes turns searching instead of working.

The fix was a knowledge graph (Neo4j) behind a Python API. Contacts, sessions, facts, insights, and interactions are all separate node types with typed relationships. The agent queries semantically (memory.py search 'MCP server architecture') and gets back ranked results from across 1100 sessions in milliseconds. The flat markdown files still exist as backup, but the graph is primary.

One thing neither claude-mem nor your DIY approach addresses: memory decay. Facts from session 50 may be wrong by session 500. I use timestamped fact triples with a convention that newer facts on the same subject shadow older ones. Without this, the agent acts on outdated information confidently.

The 3-layer approach you recommend (DIY hooks + claude-mem + cross-project knowledge) is directionally right. I'd just add: plan for the migration from layer 1 to layer 2 early, because by the time you need semantic search, you have hundreds of unstructured entries that are painful to backfill.

Collapse
 
kanta13jp1 profile image
kanta13jp1

This is incredibly useful context — especially the “worked for ~200 sessions, then broke in new ways” framing.

Your point about memory decay is the one I think people underestimate most. Persistence by itself is not enough; without some notion of recency, shadowing, or invalidation, memory quietly turns into stale confidence.

I also really like the way you describe the migration path: flat files → better retrieval → typed relationships. That feels less like swapping tools and more like the natural maturation curve of memory as the number of sessions grows.

Really appreciate you sharing concrete numbers from 1,100+ sessions — that makes the tradeoffs much more real.

Collapse
 
vdalhambra profile image
vdalhambra

The DIY memory file approach works surprisingly well until the project has >5 parallel contexts — at that point the file grows and Claude starts skipping sections when loading (the "laziness" failure mode). What I've found fixes it: split memory into ~20-line category files with clear filenames, keep a 1-line index. Claude reads the index first, then only the relevant files. Less context bloat, less skipping. Anyone else dealt with the context rot from a single big CLAUDE.md?

Collapse
 
kanta13jp1 profile image
kanta13jp1

Yes — I’ve definitely run into that same failure mode.

A single large memory file starts out feeling simple, but eventually becomes a context tax and then a context rot problem. I really like your “1-line index + small category files” pattern because it keeps the top-level memory cheap while still letting Claude pull depth only where needed.

That feels like a strong middle ground between one giant file and a fully externalized memory system. Probably one of the cleanest ways to extend the DIY approach before moving to something heavier.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.