DEV Community

Cover image for 3 Things That Bit Me Before My Claude Code Memory Setup Worked
Digital Craft Workshop
Digital Craft Workshop

Posted on • Originally published at generativeai.pub

3 Things That Bit Me Before My Claude Code Memory Setup Worked

Not a member? Use this link.

I gave Claude Code a memory, and within a couple of days it had eaten every credit on one of my API keys.

That was version three. Version one had built a loop that triggered itself. Version two had my laptop fan screaming every time I closed a terminal. None of them were the clean recall-in, capture-out setup I run now — they were the three wrecks I drove on the way to it.

So this is the part I left out of the clean write-up. Three things that bit me, and what each one is actually warning you about if you go build memory into Claude Code yourself.

None of them were typos or a missing semicolon. Each one looked completely reasonable at the moment I wrote it. That's what makes them worth writing down — these are the mistakes that don't announce themselves until they're running.

1. The hook that triggered itself

My capture step runs on SessionEnd. When I close a Claude Code session, a hook fires, the session's decisions get extracted, and they get written to the memory store. Simple enough.

The first version extracted those decisions by spawning a Claude process to read the transcript. Which meant the capture step was itself a Claude session. Which meant it ended too. Which fired SessionEnd again.

The hook was extracting the act of extracting. Each meta-session closed, triggered another meta-session to capture its decisions, and that one closed and triggered the next. A snake eating its own tail, one link at a time.

A SessionEnd hook spawns a Claude process that is itself a session, which fires SessionEnd again

It did not run away to infinity, because other limits caught it first. But it ran far enough to scare me, and the lesson stuck harder than most: a capture process must never be the same kind of thing as the event that triggers capture.

If your "extract what happened" step is itself a session, and sessions are what trigger extraction, you have a recursion with no base case that you didn't write on purpose.

The fix is to break the type. The capture process has to be something the lifecycle doesn't watch — a plain script, a detached subprocess, anything that isn't itself a session. Once the extractor stopped being a Claude session, the loop had nothing left to feed on.

This is the kind of bug that's obvious in hindsight and invisible while you're writing it, because at write-time "just have Claude read the transcript and pull out the decisions" is the most natural instruction in the world. The recursion is hiding in the word "session," and you don't see it until it's spinning.

I'd actually been bitten by a cousin of this before — an earlier version of the memory setup spawned seven parallel Claude instances from one "helpful" hook. The full story of that first build is in how I added persistent semantic memory to Claude Code in 15 minutes.

2. The embedder that cooked my laptop

Second version. The loop was gone, capture worked, and then I noticed my laptop fan spinning up every single time I closed a session.

The culprit was where I'd put the embedding step. I was generating embeddings locally with all-MiniLM-L6-v2, a small sentence-transformer model. The word "small" is carrying a lot of weight in that sentence.

Small for a model is not small for something that runs on every SessionEnd. The model file is around 80 MB. Loading it meant spinning up a Python runtime, pulling the weights into memory, and running the encode — and the whole spawn footprint landed near 500 MB of RAM and a real bite of CPU, every time I closed a terminal.

Simulated top output: a python process at high CPU and ~500MB RES, spawned at session close

On a desktop you might never notice. On a laptop, closing several sessions an evening, you notice. The machine ran warm, the fan ran loud, and the cause was a memory feature I'd added to make my life easier, quietly taxing the machine every time I stepped away from it.

The thing to internalize here isn't "MiniLM is bad." MiniLM is fine. It's that the cost of a memory pipeline is dominated by where the heavy steps run, and how often they run there. Embedding locally, on CPU, on every session close, is a lot of repeated heavy lifting for a background nicety nobody asked to be expensive.

The move is to get the heavy step out of the hot path. Push embedding to a hosted endpoint, or batch it into one job that runs on a timer instead of once-per-close, or skip local embedding on a laptop entirely. The decision that matters is placement and frequency. The model name barely registers next to it.

Want the Claude Code memory setup I build with? The Claude Code Memory Starter is a free email series.

3. The frontier model that quietly ate my credits

The extraction step — the part that reads a transcript and answers "what did we actually decide here" — needs a language model to do it. When I first wired it, I pointed it at the model I already had keys for: Claude Sonnet. A frontier model, on a key I was also using for a production app.

One transcript through Sonnet is cheap. The problem was the multiplier I'd stopped thinking about. At the end of an evening I close sessions in a burst — three, four, five at once — and each close fired its own extraction against that same shared key.

Within a couple of days the credits on it were gone. Not because extraction is expensive in principle, but because I'd wired the most expensive available model to the most frequent available trigger, drawing from a budget that something real depended on.

Before and after: SessionEnd to Sonnet on a shared prod key drains credits; to Gemini Flash Lite on its own key, free

The fix had two halves, and both are worth stealing.

First, the right model for the job. Transcript extraction is not a frontier-model task. You're asking it to summarize the decisions in a block of text, not to reason about a hard problem. A free-tier Gemini Flash Lite handles it without complaint, and a small OpenAI model — cheaper than Haiku — sits in as a fallback for the days the free tier hits its rate limit. For this specific job the quality difference is nil and the cost difference is the whole ballgame.

This is the same instinct I wrote up separately: when a cheap model clears the bar, don't quietly reach for a pricier one. More on that in how a Claude Code skill turns an hour of setup into 90 seconds — same lens, different chore.

Second, the key. I'd been extracting on the same Anthropic key my production app billed against, so a runaway background hook and a live product were both pulling from one budget. Separating them — memory extraction on its own provider, production on its own — means a mistake in the plumbing can't take down the thing that actually pays for itself.

So the rule, in one line: match the model to the difficulty of the task, and never let a background convenience share a budget with something that matters. A hook firing on every session close is exactly the kind of process that belongs on the cheapest model that clears the bar, on a key that can fail quietly without taking anything else with it.

What the three have in common

A loop, a hot fan, a drained wallet. They look like three unrelated failures. They're the same one in three disguises.

Every one of them came from underestimating how often a session-lifecycle hook fires. A thing that runs "when I close a session" doesn't run occasionally. It runs constantly, in bursts, in the background, while your attention is on something else entirely. Whatever you put in that path — a process type, a model load, an API call — gets multiplied by a number you are not currently looking at.

The setup I run now is boring for exactly that reason: nothing heavy lives in the hot path anymore. Extraction is detached from the session. The model is cheap and isolated on its own key. The embedding doesn't fight my CPU. None of it can loop, cook, or bill, because the expensive parts were moved out of the path that fires the most.

If you build memory into Claude Code, that's the lens I'd hand you. The question to keep asking isn't "does this work when I run it once." It's "what does this cost when it runs in the background, in a burst, while I'm not watching." Get that one right and the memory layer disappears into the background, which is the only place it's any use.

The clean wiring all of this eventually became — the recall-in, capture-out shape, end to end — is its own write-up, linked below, if you'd rather read the version that works than the three that didn't.

Further Reading

Want the Claude Code memory setup I build with? The Claude Code Memory Starter is a free email series.

Top comments (0)