Will Flour

Posted on Mar 18

How I Taught My AI Coding Assistant to Stop Forgetting

#ai #tooling #opensource #agentskills

How I Taught My AI Coding Assistant to Stop Forgetting.

Or: what happens when you write 5,000 lines of bash to solve a problem that might not need 5,000 lines of bash.

I lost a full day to a silent bug in my pipeline. Everything looked fine. The query ran, no errors, green across the board. But the data wasn't updating. Turns out the database call was succeeding and returning nothing at the same time, so the pipeline kept thinking there was no work to do. Every ten minutes, for an entire day, it ran and did absolutely nothing. I found the fix, wrote it up in my project instructions so I'd never make the same mistake, and moved on.

Next session, Claude made the same mistake. Word for word, the same pattern I'd just spent a day debugging. I corrected it. Explained why. Next session after that? Same thing. The rule was right there in the instructions. Claude had read it out loud. And then ignored it.

If you've used AI coding tools for any real project, you know this feeling. They're impressively capable. They can architect entire features and refactor complex codebases. But they can't remember what happened yesterday. Every session is a clean slate. You write detailed project rules, Claude reads them, agrees with them, and then does the exact thing you told it not to. It's not a question of intelligence. It's a memory problem. And instructions sitting in a context window are not the same thing as enforcement.

I got tired of being the one who remembers everything. So I built a plugin that does it for me. I call it Cortex.

What It Actually Does

The simplest way to explain it is to show you. A normal Claude Code session starts like this:

> claude
Welcome to Claude Code!
>

Nothing. No memory of yesterday. No idea what you were working on. No guardrails. Just a blank prompt.

With Cortex installed, the same session starts like this:

✏️  0 edits · 📦 0 commits · 🧪❌ · 📄❌
💚 thriving │ 🧠 62 absorbed │ 🧬 0 mutations queued │ → stable

Carry-over from yesterday:
- Scoring pipeline timeout needs investigation
- 3 open Dependabot PRs — review/merge

Lessons surfaced (pipeline work detected):
- PostgREST .update().or().select() returns empty — use read-then-write
- Yahoo+SEC parallel, 600-ticker batches, targeting ~200s

Before I've typed a single character, Cortex has already read yesterday's journal, pulled up the things I didn't finish, taken a guess at what domain I'm about to work in (keyword matching, nothing fancy, literally just grep), and surfaced past mistakes that are relevant to today's work.

On the other end of a session, it won't let me quit until I've committed my changes, run tests, and updated docs if I touched anything architectural. That gets annoying sometimes, I won't lie. But it's also caught me walking away from uncommitted work more times than I'd like to admit.

And then there's the enforcement layer. If Claude tries to write a dangerous pattern into a migration file, the tool call gets rejected before it touches the file. Not a warning in the chat. The write physically doesn't happen.

Why Enforcement and Not Just Instructions

This is the hill I'll die on: instructions don't work as guardrails. I tried, extensively. I wrote clear rules. Claude read them. Then it did the exact thing I told it not to. The rule was sitting right there in context. Didn't matter.

The realization I came to is that there's a fundamental difference between telling someone "don't do this" and making it physically impossible to do. Cortex leans heavily on the latter.

The migration linter is a good example. It checks SQL files for patterns that will break in production, like using now() in a WHERE clause (Postgres needs those functions to be immutable, and now() isn't). If it finds one, the write just doesn't happen. Claude sees the rejection, understands why, and rewrites it correctly. Every single time, without fail.

if echo "$content" | grep -iE \
    'WHERE[^;]*\b(now\(\)|CURRENT_DATE|clock_timestamp\(\))'; then
  printf '{"hookSpecificOutput":{
    "hookEventName":"PreToolUse",
    "permissionDecision":"deny"
  },"systemMessage":"BLOCKED: now()/CURRENT_DATE in WHERE clause.
    Not IMMUTABLE — PostgreSQL rejects these in partial indexes."}'
  exit 0
fi

That's the entire enforcement for this rule. A grep and a JSON response. Nothing clever about it. But it's solved a problem that no amount of instruction writing could fix.

Not everything needs to be a hard block though. There are softer layers too. Commit message format, documentation freshness, test coverage after editing TypeScript files. Those are nudges, not walls. And then there's passive tracking underneath all of it. Edit counts, file modifications, tool call patterns. Data that feeds into the health system without getting in anyone's way.

The stop gates at session end are probably my favorite piece. Four checks run before you can close: uncommitted edits, docs updated, tests run, carry-over items addressed. If any fail, the session gets blocked. But I learned pretty quickly that you need an escape hatch. After two consecutive blocks on the same issue, it lets you through. I added that after the plugin literally wouldn't let me close my laptop at midnight one night.

consecutive=$(read_field "consecutive_blocks" "$STATE_FILE")
if [ "$consecutive" -ge 2 ]; then
  write_field "consecutive_blocks" "0" "$STATE_FILE"
  printf '{"systemMessage":"Force-approved after acknowledgment."}'
  exit 0
fi

One thing I realized pretty early is that not every project needs all of this. Sometimes you're doing a quick fix and you don't want the full organism breathing down your neck. So there are three profiles you can set: minimal, which only runs the enforcement hooks and nothing else. Standard, which adds the learning and tracking systems. And strict, which turns everything on including the sensory checks that watch CI status and remote pushes. You set it with an environment variable or a config file and the bootstrap filters which hooks get injected based on which profile is active.

The Organism Thing

I want to be upfront about something. The "13 biological systems" framing wasn't some grand design decision. I didn't sit down with a whiteboard and architect a living organism. What actually happened is I kept building things. A migration linter one week, a state tracker the next, then context injection, health metrics, a stop gate. Each one solving whatever was annoying me at the time. At some point I stepped back and realized I had about a dozen bash scripts that all read from the same state files without any awareness of each other.

That's when the biological metaphor clicked. These scripts were behaving like independent organs. They didn't call each other. They just read shared signals and did their own thing. If one crashed, the others kept running. So I leaned into it and named them. Nervous system for state tracking, immune system for enforcement, circulatory for context injection, and so on down the line.

I'll be the first to admit some of this is over-engineered. Thirteen systems for a Claude Code plugin is a lot. Some of them barely do anything. The "social system" is maybe five lines of code that track which files keep getting re-edited across sessions. The "feedback system" is basically a grep. I could probably collapse half of them and nothing would change.

But the naming genuinely helps me think about the architecture. "The immune system should handle this" is a more useful thought than "add a regex to pre-dispatch.sh." And when people ask what the plugin does, saying "13 biological systems" communicates the scope immediately in a way that a feature list doesn't.

Underneath all the naming, it's bash scripts hooking into Claude Code's lifecycle events to enforce rules, track state, and inject context. The biology is just how I think about the pieces fitting together.

Health Tracking

The title of this post says "self-improving" so I should probably be honest about what that actually means in practice. It's way less magical than it sounds.

At the end of every session, the plugin tallies things up. How many edits happened between commits, whether tests got run, how many files were touched, whether documentation was updated, how many times I had to step in and correct a mistake. All of that gets written to a health file that tracks rolling averages over the last 10 sessions.

The statusline gives you a read on where things stand:

💚 thriving │ 🧠 62 absorbed │ 🧬 0 mutations queued │ ↗ improving

That heart color actually means something. Green when everything looks healthy, yellow for stable, orange when the plugin decides things have been rough lately and starts injecting extra planning reminders before letting you dive into code. Red when things are genuinely going poorly. It's not complicated logic. If reasoning corrections trend upward, the system gets more cautious. If they trend down, it relaxes.

There's an evolution system too, where an analyzer agent looks at patterns in my corrections across sessions and proposes new rules. In practice I've used this maybe three times. It's the most aspirational part of the whole project. The idea of the system actually learning from my mistakes and proposing its own improvements is compelling, but the reality right now is that I manually run the analyzer and it gives me suggestions that I then decide whether to implement. The health tracking side of things is genuinely useful and I rely on it. The self-modification side is still finding its legs.

What Actually Sucked

Debugging this plugin has been one of the most frustrating things I've ever done as a developer. I'm not exaggerating when I say I spent multiple full days going in circles because of foundational bugs that would intermittently break the entire system. The kind of bugs where everything works, then doesn't, then works again, and you start questioning your own sanity.

The Hooks That Ate Themselves

Early on I built PreToolUse and PostToolUse hooks with conditional logic. "If the file is in this directory, show a reminder. Otherwise, skip." Sounds reasonable. What actually happened is the hook runtime interpreted the skip condition as a failure and started blocking every single tool call. Write? Blocked. Edit? Blocked. Bash? Blocked. Claude couldn't even fix the broken hooks because the hooks were intercepting the repair attempts. I ended up having to write raw JSON through a node -e one-liner because it was the only way to bypass the hook system entirely.

The Hook Bug (The Real One)

Claude Code plugins have a hooks.json file where you declare what events to listen to. It doesn't work. Not in a flaky, sometimes-it-works way. It just straight up doesn't fire for most events. Only SessionStart works. I spent three sessions building increasingly stripped-down test scripts trying to figure out why before I accepted it was a platform bug and filed it.

So the workaround is: on every SessionStart (the one event that actually works), a bootstrap script shoves all the other hook events into the global settings file. It cleans up old entries when versions change, runs every boot. The entire plugin sits on top of a workaround for a platform bug. That's just where we're at.

But even that had layers. At first I thought only PreToolUse and PostToolUse were broken. Then I found out project-level settings files are also unreliable. Only the global ~/.claude/settings.json actually fires hooks consistently. I tested every combination of config file location and event type before I was confident in the workaround.

The Ghost Hooks

The plugin caching system doesn't clean up old versions. So when I updated from v2.0.0 to v3.1.0, both versions existed in the cache simultaneously. The old version still had the broken conditional hooks from the "hooks that ate themselves" incident. Random Write and Edit calls would get intercepted by the ghost version. I spent an entire session troubleshooting why hooks I'd already fixed were still firing before I found the stale cache folder sitting there quietly ruining everything.

A similar thing happened later with the bootstrap system. I had two versions of the stop-gate hook registered in the global settings, v3.4.1 and v3.6.0, both firing on the same event. Each one wrote to a different state file. The escape hatch (which opens after two consecutive blocks) never triggered because each version only ever saw one block. The counter never reached two. I only caught it because I added debug logging that printed which state file was being read on every invocation.

The Version Bump Trap

Here's one that burned me more than once: if you commit code changes to the plugin in a separate commit from the version bump, the marketplace update system sees "already at v3.4.0" and skips the download. Your code changes never make it to users. The fix is to always bump the version number in the same commit as the code. Simple rule. Took me three broken releases to figure it out.

Windows + Bash

This plugin is 5,276 lines of bash running on Windows through Git Bash. Here's a partial list of things that broke:

cut -d: -f1 on C:/Users/... splits at the drive letter, not the path separator.
awk -v path="C:/foo" silently rewrites the path through MSYS translation. You have to use ENVIRON["path"] instead.
set -euo pipefail with ls *.local.md 2>/dev/null | head -1 kills the entire script silently when no files match, because ls returns exit 2 and pipefail propagates it.
Git Bash converts /c/Users to C:\Users in some contexts but not others, with no obvious pattern.

Why bash and not TypeScript or Python? Because bash is already there if you have Git. No runtime to install, nothing to configure. The tradeoff is you eat all of these. I keep a file of every gotcha I've hit so I don't repeat them. That file is 62 entries long and growing.

Testing (The Only Reason Any of This Works)

46% of the codebase is tests. 2,663 lines across 28 test scripts. Fixture factories that generate mock state files, sandbox directories for isolation, regression tests for specific bugs I've been bitten by.

Is that a normal amount of testing for a bash project? Not even close. But this code runs invisibly inside someone else's tool. If a hook crashes, there's no error message. No stack trace. It just silently stops working and you have no idea until you notice something isn't being tracked anymore.

Nothing Can Be Trusted

This is the thing that took me the longest to internalize. Every hook has to assume the worst. Called twice in a row? Handle it. State file doesn't exist yet? Handle it. State file exists but is corrupted? Handle it. Fields that should be there aren't? Handle it. A previous version wrote a format you don't recognize? Handle it.

The healing system runs 9 self-repair checks every time the plugin boots up. Corrupted state files, out-of-range counter values, bloated file lists, missing section headers, oversized health logs, stale temp files, missing fields, orphaned sessions, and legacy format migration. All of that runs before any other system touches the state. Because I learned the hard way that if you don't check, the one time a file gets corrupted is the time that silently breaks your commit tracking for a week before you notice.

By the Numbers

For the people who like to skip to the stats:

v3.7.0, 13 versions over a few months of building
5,276 lines of bash across 49 files (2,613 production, 2,663 tests)
7 hook events, 12 skills, 2 agents, 5 commands
4 stop-gate checks with a 2-block escape hatch
3 profiles: minimal (enforcement only), standard (+ learning), strict (everything)
9 self-repair checks on every boot
62 gotchas catalogued in lessons.md and counting

Where This Goes From Here

The biggest limitation right now is how Cortex decides what context is relevant. It's keyword matching. Literally grepping through file contents for domain words. That works fine for the patterns I've anticipated, but it completely misses anything I didn't think to tag ahead of time.

What I want to build next is semantic search. A local server running embeddings that can find past sessions by what they were about, not by whether I happened to use the right keyword. I want to be able to say "find the session where I debugged the pipeline timeout" and actually get it, instead of hoping past-me tagged it correctly.

I also think a lot about whether this whole thing is over-built for one person working on one project. And honestly, in some places it probably is. But the enforcement hooks pay for themselves every single session without fail. The health tracking has been genuinely useful for catching those stretches where my workflow starts slipping without me realizing it. The 13-system naming is mostly just how I keep things organized in my own head. And the evolution proposals are aspirational at best.

But the core idea, that AI coding tools need real persistent memory and real enforcement and not just bigger context windows and better prompts, I believe in that more now than when I started.

If You Want to Try It

Cortex is open source and free. Two commands to install:

claude plugins marketplace add Undercurrent-Studio/undercurrent-cortex
claude plugins install cortex@undercurrent-studio

Restart your session and you should see the statusline come up on the next boot. If you run into issues or want to see what 5,000 lines of bash looks like when it's trying to be a biological organism, the source is all on GitHub.

GitHub | Built by Will Flour at Undercurrent Studio

If you found this interesting, I write about building with AI tools and the data infrastructure behind Undercurrent, a stock research platform I'm building. You can find me on GitHub.

DEV Community