Nikita Benkovich

Posted on Apr 1

We Investigated Why Claude’s Memory Fails. Here’s What We Learned

#ai #llm #agents #opensource

Memory is arguably the most fundamental capability for a useful agent. Without it:

The agent can't adapt to a user's preferences over time
It repeats the same mistakes across sessions
The user has to re-prompt the same rules every conversation ("always use strict typing", "never mock the database in tests", "our auth token expiry is 900s")

This isn't a minor UX annoyance — it's the difference between an agent that gets better the more you use it and one that resets to zero every session.

The problem with Claude's native memory

Claude Code gives you two options:

CLAUDE.md — you write it, you maintain it. Rules the agent should follow, project context, conventions. Fully manual.
MEMORY.md — Claude maintains it automatically, appending notes during sessions. No manual work.

MEMORY.md sounds like it solves the problem. In practice it has three hard constraints:

Constraint	What it means
Per-repo scope	Global preferences (code style, team conventions) have to be duplicated into every project
One-time startup injection	The file is read once at session start, not re-surfaced when relevant mid-session
Size-bounded by position	Only the first N lines are read — notes get dropped by position, not by relevance

The result: as MEMORY.md grows, older notes fall off the bottom regardless of how important they are. And a rule written on day 1 of project A doesn't exist in project B.

What we wanted instead

We set a design target before writing any code:

Context-activated — hints surface when relevant to the current work, not at startup
No extra user-facing calls — injection should be automatic, invisible, fast
Unbounded storage — adding notes shouldn't degrade performance or relevance
Global — preferences like "never use duck typing when object structure is defined" should apply across all repos
Cheap — not a $0.10/tool-call overhead
Easy to install — one command installation

The core idea: a dedicated LLM whose context is your memory

Traditional retrieval — keyword search, vector embeddings matches on surface similarity. It doesn't understand what the agent is currently doing.

The approach here is different: instead of a search index, use a separate LLM that holds your stored notes as its context (system prompt), and prompt it with the current run context to retrieve relevant hints.

┌─────────────────────────────────┐
│  Memory LLM                     │
│  system prompt: stored notes    │  ← memory lives here as context
│                                 │
│  user prompt: current transcript│  ← what is the agent doing right now?
│              + upcoming tool    │
│                                 │
│  output: relevant hints         │  ← surfaces only what matters
└─────────────────────────────────┘

The LLM reasons about relevance rather than matching strings. Each note is stored with a plain-language --when activation condition:

cmr-memory write "auth token expiry is 900s not 3600s" --when "working on auth, tokens, config.ts"

A note with --when "working on authentication" will surface when the agent is editing a JWT middleware file, even if neither "authentication" nor "JWT" appears in the transcript at that moment. The memory LLM understands context — it's not doing substring search.

Deduplication keeps context bounded

Once a hint is injected, it's already in the transcript. Re-injecting it on the next tool call adds tokens but zero information. Before running retrieval, we extract hints already present in the conversation and skip anything that's already there — including semantically equivalent hints worded differently.

Context overhead stays bounded by the number of unique topics the agent works on in a session, not by session length or total notes stored.

Second core concept: map-reduce over memory chunks

Memory retrieval runs on every tool call — not occasionally, but every single time the agent does anything. That makes latency and cost non-negotiable. A single LLM call over all notes fails on three dimensions simultaneously: context window limits (1,000 notes × ~400 tokens = ~400k tokens), attention degradation in long contexts ("Lost in the Middle", Liu et al. 2023), and cost at scale.

The solution — split notes into fixed-size chunks and apply a map-reduce pattern:

Notes split into chunks (~50 notes each)
         │
    ┌────┴────┐
  chunk-1  chunk-2  ...  chunk-N     ← map: one parallel Haiku call per chunk
    │         │               │
  hints     hints           hints
    └────┬────┘
       reduce: merge + deduplicate against hints already in context
         │
    inject only new hints as <system-reminder>

Map (scatter): one Haiku call per chunk fires in parallel. Each chunk is small enough for accurate attention. Calls are parallel, so time doesn't grow with memory size.
Reduce (gather): merges all candidates, filters against hints already in the transcript, returns only what's new.

We chose Haiku specifically for this role: it's fast, cheap, and performs well on focused tasks with small contexts — exactly what each chunk call is. You don't need a frontier model to decide whether "auth token expiry is 900s" is relevant to what the agent is currently doing.

Two compounding benefits make this fast and cheap:

Parallelism — 20 chunks takes roughly the same wall-clock time as 1 chunk, because all map calls fire simultaneously.
Prompt caching — each chunk's notes live in the system prompt, which is stable and never changes once the chunk is sealed (full). Anthropic's prompt caching means repeated retrievals against the same chunk are served from cache — dramatically lower cost and faster responses on every subsequent tool call.

Hooks make it fully automatic

The entire retrieval pipeline runs via Claude Code's PreToolUse/PostToolUse hooks — the agent doesn't call memory explicitly at all.

PreToolUse hook: fires before every tool call. Reads the conversation transcript to understand current context, runs scatter-gather retrieval, and injects relevant hints as a <system-reminder> block. The agent sees the hints without doing anything.
PostToolUse hook: fires after every tool call. Sends a static nudge (~15 tokens, no LLM call) asking the agent whether anything noteworthy just happened. If yes, the agent writes a note. No forced writes — the agent decides.

The transcript is the key input: it gives the memory LLM a full picture of what the agent is working on right now, which is what makes context-activated retrieval possible.

Implementation: a single CLI package

Everything ships as one npm package. Running init does three things automatically:

Registers the hooks — wires PreToolUse and PostToolUse into Claude Code's config
Injects memory rules — adds instructions to CLAUDE.md so the agent knows how and when to write notes
Configures the memory LLM — sets up Haiku as the retrieval model with your Anthropic API key

Two commands to get started:

npm install -g @agynio/cmr-memory
cmr-memory init --api-key sk-ant-...

That's it. Now you Claude can use memory.

We built this as part of our open-source research into multi-agent engineering systems agyn.io

DEV Community

We Investigated Why Claude’s Memory Fails. Here’s What We Learned

Top comments (0)