Count how many times your agent told you "you're right" today. Count "good catch." Count "I should have noticed that." Now ask yourself how many of those corrections will survive into tomorrow's session.
The compliments are not praise. They are a confession. Every "you're right" is the agent admitting it just learned something it should have already known, in a context that will evaporate the moment the session ends. The data point is real. The retention is zero.
This is the actual problem people are trying to solve when they reach for elaborate self-improvement architectures: nightly reflection cron jobs, background agents that crawl yesterday's transcripts, autonomous proposal pipelines with grading subagents and dashboards for human review. The instinct is right. The solution is theater.
Why this happens
Claude Code's runtime, like any agent runtime, starts each session from a fresh conversation. Anthropic's own framing for effective agents draws a line between workflows (predefined paths) and agents (LLMs deciding their own tool use). Both reset. The lesson you taught your agent at 3pm is encoded in the message history of that one conversation. Tomorrow morning, that history is gone. The model is the same model. The instructions in CLAUDE.md are the same instructions. But the specific correction — "no, on this codebase you have to use the absolute path because launchd reset PATH on you" — lives only in the transcript.
So you correct it again. And again. And the third time you notice the pattern, you start looking for a fix.
The tempting wrong answer
The fix that gets blogged about goes something like this: build a nightly cron job that reads yesterday's transcripts, extracts candidate lessons, drafts them as JSON proposals with frontmatter, opens a dashboard, and asks a separate grading subagent to score the proposals. Human reviews. Promotes accepted ones into a "skill" file. Repeat.
This is ceremony, not discipline. Three problems with it:
- It substitutes process metrics for outcome. You can run the pipeline every night and ship zero durable improvement. The metric you actually care about is "did the agent stop making the same mistake," not "did we generate ten proposals last week."
- It moves the work from the moment that matters. The right time to write the rule is the moment you notice the agent got it wrong. Not eight hours later, after a reflection agent has interpreted what it thought happened.
- It puts a model in front of the file. The whole reason you're writing this down is that the model is the unreliable component. Layering more model-mediated steps on top of "remember this" is the architectural equivalent of asking the goldfish to file its own memos.
The runtime layer matters. So does the substrate. None of it replaces the rule.
The actually-working answer
A single file. The agent reads it at the start of every session. Append-only. Numbered. Dated. Linked to the actual incident.
NEXUS — my agent setup, specifically the operating layer that wraps Claude Code on my machine — formalizes this in CLAUDE.md as a behavioral protocol called Mistakes Become Rules. The wording is exact and short:
Trigger: Any time Mike corrects your approach, points out an error, or says something like "no, not that" / "don't do X" / "you should have…"
Action: Immediately add a numbered entry to MEMORY.md's "Hard-Won Lessons" section:[next number]. **[short title]** — [what went wrong and the rule to follow going forward].
On session start: Read and internalize all Hard-Won Lessons before beginning work.
That is the entire loop. There is no reflection agent. There is no nightly job. There is no dashboard. The trigger is the correction itself, the action is one append, and the agent reads the file the next time it boots up.
The file currently has twenty entries. Each one came from a specific incident, on a specific date, that cost me time. A few of them, with the context that made them rules:
- Lesson #15 — LaunchAgent log paths must be on local disk, not SMB. On 2026-04-19, six LaunchAgents in my finance service silently broke. macOS TCC was blocking launchd-spawned processes from writing logs to the NAS-mounted path, even though the same SSH user could write there fine. Exit code 78. No log output, because the log path was the problem. Took an afternoon to diagnose. The rule is one sentence. The rule writes itself.
-
Lesson #19 — Never
import()a publish script "to test it." On 2026-04-29, two test imports ofpublish-agent-id-role.tsraced because the script invokesmain()at module top-level. Result: duplicate posts on LinkedIn (twice), X (twice), and Ghost (one extra, deleted via Admin API). Late.dev refuses to delete already-published content, so the cleanup was manual. The rule: validate publish scripts withtsc --noEmit, a--dry-runflag, or by reading them. Never withimport(). -
Lesson #20 — PM2
script: "npm"ignores appenv.PATH. On 2026-05-01, the health-api service kept reportingonlinewhile the port wasn't listening. PM2 was launchingnpmfrom the daemon's PATH, not the app's, which meantbetter-sqlite3(compiled for node 22) was loading under node 25 and crashing onERR_DLOPEN_FAILED. Fix: pinscriptto the absolute path of the desired npm. Same idea as Lesson #16, now for PM2 instead of launchd.
Each entry took less than a minute to write. Each one prevents the same hour-long failure from happening twice. The compounding is the entire point.
Where hooks fit
Lessons live in markdown because that's how the agent absorbs them at session start. But there's a runtime layer underneath, and it has a real role.
Claude Code's hooks — PreToolUse, PostToolUse, SessionStart, and friends — let you intercept tool calls deterministically. If a lesson can be reduced to a regex on a command string ("never run rm -rf outside /tmp"), a hook is a better enforcement point than a markdown bullet, because the markdown bullet relies on the model reading and obeying it. The hook does not.
Same logic for Claude Code Skills: they're great for packaging a procedure with its own tools and supporting files. They are not a substitute for the rule. They're a substrate the rule can sit on top of.
The hierarchy I run with: durable rules in the markdown file, deterministic enforcement in hooks where the rule is regex-shaped, and skills for procedures with multiple steps. None of those is a self-improvement loop. None of them runs at midnight. None of them has a grading subagent. They are all read or executed at the moment they apply.
How you know it's working
The test is simple. You stop hearing the same compliment twice.
If your agent says "good catch" today, look it up tomorrow. Is the lesson in your file? Did the agent read the file before it started working? If yes to both, you should never hear "good catch" on that specific topic again. If you do, the rule is wrong, the file isn't being read, or the lesson didn't generalize. All three are debuggable. None of them require a reflection agent.
Praise without persistence is a leak. Patch the leak, do not build a recycling system for the runoff.
Top comments (0)