DEV Community: kanfu-panda

AI Memory Grows Weeds: Why Timely Pruning Matters

kanfu-panda — Thu, 09 Jul 2026 23:40:25 +0000

Last time we talked about how to build a memory system for AI. But building is only the beginning—the longer a project runs, the more memory accumulates, and without cleanup, weeds quietly grow in it. This time, I gave the AI memory across several of my projects a systematic "weeding."

As a project moves forward, the memory gathered during AI collaboration keeps piling up. Each project's memory is like a dedicated notebook: while working, the AI jots down the pitfalls it hit and the rules we settled on. This does deepen its grasp of the project, but the notebook grows thicker and messier—and quite a few entries are long expired, though the AI still treats them as rules it must obey.

A real example. On one project, the AI immediately acted on an old memory. It read "an external call is already configured, use it directly," so it skipped the step where it should have re-verified and pushed straight ahead—and hit a wall. The catch: that memory was just a snapshot from a few weeks earlier, long invalid; yet from start to finish the AI never doubted it for a moment.

That made me realize: a wrong memory can be worse than no memory. With no memory, the AI will at least check honestly first; with a wrong one, it walks off course with full confidence. Once memory grows, it goes stale, gets tangled, contradicts itself—and someone has to clean it up in time, or the AI's efficiency can't really be counted on.

Two things I set out to do

One, seriously clean up the memory library of each project; two, take the chance to survey what kinds of "weeds" memory actually grows.

Neither has a shortcut—both need hands-on human work, deletion above all, which I would never fully hand to the AI. It can propose what to delete and what to merge, but the final call must be a person's. What gets deleted is often historical record, and once the AI gets careless it can wipe out something important along with it. Cleaning memory demands real caution.

But let me say it up front: manual weeding only treats the symptom. Weeds keep coming back because the memory system itself lacks certain mechanisms—pull them today, and they grow back in a while.

First, give the AI the "sleep" it lacks

To see why memory needs cleaning, it helps to look at how humans do it—through sleep.

There is solid science behind this. During sleep, especially deep sleep, the brain "replays" the day's experiences, gradually moving important memories from the hippocampus into the neocortex for archiving; the more important the content, the more often it is replayed, and the more firmly it sticks. At the same time, sleep actively "prunes" the unimportant and outdated, so the brain isn't stuffed with useless information.

The AI has no such automatic tidying. It only stacks memory downward, one entry after another, never going back to reorganize. So this time, I manually gave it that "sleep": I read through each project's memory once, checked it entry by entry, and did five things—merged duplicates, cut the expired, distilled scattered notes into one, linked the related, and the easiest to overlook: hunted for entries that contradict each other.

Six typical memory "weeds"

After systematically going through the memory libraries of several projects, I found that although the weeds differ in form across projects, they boil down to six typical patterns.

One, ghost copies. A global rule gets recorded again inside a specific project, but the AI has no "inheritance" mechanism. When the global rule is updated, these isolated copies can't sync, so an outdated instruction keeps taking effect.

Two, frozen state. Some memory records an in-progress state at a given moment (in development, under review). With no state-update mechanism, even after the task is long finished, the AI still treats it as unresolved and raises useless reminders in unrelated conversations.

Three, stale snapshots. The most dangerous kind. An expired memory not only loses reference value but also misleads the AI into taking wrong information as fact and confidently executing the wrong action.

Four, split context. Some entries are each accurate read alone, but only yield the right conclusion in a specific context. Because the linking note sits in another file, the AI easily misses it while retrieving, and takes things out of context.

Five, conclusion detached from reasoning. When underlying data changes after a bug fix, the old conclusion is merely overwritten by the new one, never explicitly marked obsolete. Since the memory system stores only conclusions, not the reasoning, it can't automatically spot and clear the related conclusions that quietly went invalid.

Six, structural disorder. This includes memory stored in the wrong directory, and silent broken links caused by inconsistent naming (hyphen versus underscore).

Reviewing these six, I reached a counterintuitive conclusion: most of these weeds don't come from being recorded wrong in the first place—they were recorded correctly, then rotted for lack of upkeep. The greatest hazard of a memory system isn't "recording wrong," it's "record and abandon."

One step where I deliberately held back

During cleanup, there was one step I deliberately held back on.

I had planned to merge three similar memories about a "cloud data trap" in one project into a single checklist. But halfway through, I realized the three map to three concrete business risks—duplicate reward payouts, missing leaderboard data, new users failing to be saved. Over-abstracting them into a single "watch the cloud fields" would erase these life-saving details.

Distilling knowledge matters, but over-distilling erases the exceptions. So I changed course: keep all three original memories, and only link them with a tag, so they "cluster together" rather than "boil into one pot."

The hardest part: reconciling memories that "look contradictory"

The most challenging part of the whole cleanup was reconciling logical conflicts.

For example, on retrieving data from an interface, one memory said "returns real data," another said "returns anonymous data." On checking, both were correct—only the trigger differs: automatic retrieval returns anonymous data, while a user's manual button tap returns the real thing.

For such seemingly contradictory memories, simply deleting one only loses information; the right move is to clearly delimit each one's applicable scope. Add that "branch condition" into the memory, and the conflict resolves itself.

This leads to a core takeaway from the review: the most dangerous thing in a memory system isn't "information being inconsistent," it's "information being inconsistent with no scope defined."

Why manual cleanup never finishes

By the later stage I understood: manual weeding alone can't fix the root. These weeds keep coming back because the system lacks three core mechanisms—without "inheritance," global rules get redundantly re-recorded; without "expiry," state-type memory is frozen forever; without "ownership and consistency checks," memory ends up misplaced and links break silently.

Was this cleanup worth it?

On return-on-effort: for small or new memory libraries, the payoff is limited; but for large, aging ones, the value is considerable—it clears out plenty of broken links, stale names, and frozen states.

Yet the real gain from this review isn't the short-term "tidiness," it's building a systematic understanding of how memory decays—and that understanding has long-term value.

Which memories are least likely to grow weeds

While checking, I also noticed that memories resistant to decay share one trait: they record "confirmed decisions and principles" (the rationale, the path, the hard limits), not "in-progress states." Conclusion-type memory is stable across time; state-type memory expires easily.

So when writing to memory, follow one principle: prefer recording "conclusions that won't expire," and avoid recording "states that will."

Three practical lessons

One, memory needs periodic maintenance. A memory library decays by nature; it should be restructured on a regular basis—merge, prune, distill, link, and reconcile conflicts—to keep it from steadily degrading.

Two, high-risk operations need a human in the loop. For irreversible acts like deletion and distillation, human review is a must; and clear out the expired and self-contradictory first, before wrong information misleads the system.

Three, the real fix is systemic. Manual cleanup is only a stopgap; the root solution is to build automatic inheritance, expiry, and consistency checks, so the library metabolizes on its own.

Next: a memory tool that "self-diagnoses"

Based on this review, the next direction is clear: build a memory-management tool with a "self-diagnosis" capability. It needs three core functions—trigger cleanup reminders on schedule, scan and diagnose memory layer by layer, and present the issues it finds in a structured way for a human to decide on. And on the final act, the system only advises; the decision to delete or merge always stays with the person.

This idea didn't come out of nowhere; it's the near-inevitable conclusion after a full review. Next, I plan to actually build it.

If you're also building a memory library for your AI, stay alert to memory's natural decay and the "weeds" it grows. If this review nudges you to look back at your own memory library, a follow, a like, or a share with peers exploring AI memory management would mean a lot.

How to make your AI actually get you: give it a memory

kanfu-panda — Mon, 06 Jul 2026 22:34:40 +0000

I keep tripping over the same AI in the same spot.

Take Ant Design 6 in a front-end project. I've told it, over and over, to use the new syntax—I even make it run context7 to check the API before it writes a line. It nods along, then goes right ahead and writes a pile of long-deprecated old syntax. Same with my release process: I've laid out the rules clearly, and a while later it forgets, doing things its own made-up way.

Once or twice is a slip. After enough times, I wanted to understand why: it's not dumb, it simply has no memory.

(This picture helps: every new conversation, the AI is like a new hire who clocks in fresh and forgets everything when they clock out.)

What I want isn't a stronger model

At first I was puzzled too: a model this capable, and it can't even hold on to "what we said last time"? To figure it out, I started watching it—how it actually runs when it answers and works, where it drops the ball.

Watching long enough, one thing clicked, and it's what this piece is about: what I want isn't to swap in a stronger model, but to make the AI I already have understand me better and better within my projects. The lever is "memory."

Let me put the ceiling up front, so this doesn't sound like snake oil. Memory doesn't raise the model's IQ. It raises exactly one thing—how well this AI works for you specifically. And as you'll see, even once it remembers, it may not obey. Ceiling set. Let's go.

Starting from the human brain: where AI memory needs work

I want to explain this through the human brain, because AI memory is, at bottom, a clumsy imitation of how people remember and forget. Understand that, and you'll see which directions AI needs to shore up.

One: the AI is a "new hire" every single day

First, accept a counterintuitive fact: the model itself has no memory. Every time you open a conversation, all it "knows" is the text you put in front of it this one time—everything else is blank. What you discussed last round, the rule you set yesterday, it remembers none of it.

Here's a metaphor. Everything it can see this one time is like a desktop—you lay materials out, it can use them; the moment the conversation ends, the desktop wipes clean, nothing left.

	Like	What it is
Context window	The desktop	What's spread in front of it this time; wipes clean when the chat ends
Memory	Drawers, filing cabinet	Stored outside; pulled out and laid back on the desk when needed

Someone will say: just make the desktop bigger, right? Isn't everyone racing on "ultra-long context"? But a bigger desk is still a desk—it empties when the power's off; it isn't memory. Worse, pile it with irrelevant paper and the AI gets more easily distracted and answers worse. There's a growing consensus these past couple of years: more context isn't better—stuff in a heap of tangentially-related material and the model's performance visibly drops.

So "giving AI a memory" was never about the model remembering on its own, nor about making the desktop infinitely large. It's about having a system outside that, every time it clocks in, lays exactly the right few sheets back in front of it.

Claude Code actually ships with such a system: one memory per small file, plus an index file as the table of contents, pulling in only the entries you need. Honestly, this arrangement isn't my design—it's built into the software. I didn't pay it much attention at first; I just treated it as a place to stash notes. My guess is it's designed this way precisely to save the desktop: no need to dump everything on the table every time, take it as needed.

(This picture helps: an always-loaded CLAUDE.md + an index + on-demand memory files—what you save is the precious "desktop.")

You may already want to say: isn't this just a built-in Claude Code feature, what's there to write about? The structure is its, true. But after using it—and falling on my face a few times—I found this: the software builds out "where things are stored and how they're pulled," but "what to store, when to clean, and what to do when it's remembered but ignored"—the three that actually matter—it does none of for you. That's what this piece is about.

Two: good memory is structure, not a pile of paper

To make memory useful, first understand this: good memory has structure; it isn't piling more and more in.

Psychology has a very convincing experiment. Show a chess master a real game position and they can put it back almost exactly after a couple of glances. But scatter the pieces randomly, against any chess logic, and the master's recall drops right back to a beginner's level.

This shows the master doesn't have "more capacity." They can remember a real position because there's structure in their head—they compress twenty pieces into a few meaningful "shapes." Kill the structure and the edge vanishes instantly. The strength of memory isn't in how much you store, but in how good the structure is.

(This picture helps: the same pile of pieces—only with structure do you remember it. Memory is about structure, not capacity.)

Why does structure make you remember more accurately and deeply? Because structure is essentially "connection." The more a piece of knowledge connects to other things, the more paths there are to recall it—block one, another still leads there. And with structure you can "follow the vine": forget a detail and you can reconstruct it from the surrounding frame. That's why the more interconnected things are, the more firmly and reliably you remember them. Humans are far better at remembering places and directions than loose text, which is why the ancient "memory palace" trick exists—placing the things to remember, one by one, into an imagined space. Memory is a space, not a list.

Look back at Claude Code's setup—categorized, entries cross-referencing each other, an index laid on top—and it's exactly building structure into memory, keeping it from becoming a heap of loose paper. This layer, the software does well.

Three: you have to decide what's worth remembering

The software builds the shelves; what goes on them is your call. That job landed on me, and it comes down to three kinds: unified conventions get recorded; things that go wrong often get recorded; things strictly forbidden get recorded.

Conversely: whims, one-offs, things you can learn by reading the code—don't record. Record too much and it's all noise, drowning the few that actually matter—storing accurately beats storing a lot. Same as with people: someone who tries to remember everything usually holds onto nothing.

Four: fixing a mistake means overwriting, not appending

There's one especially valuable kind of memory: the correction after a mistake. How you record this kind is where the craft is.

The human brain has a clever mechanism—every time you recall a memory, you're actually rewriting it, not just reading it. In that moment of recall, the memory becomes editable. So the right way to fix a mistake is to pull out the wrong entry, correct it on the spot, and store it back—not leave the wrong one untouched and add a "note" beside it.

This isn't nitpicking. Keep the wrong one and paste a correction next to it, and you've handed the AI a self-contradicting file—it reads both and won't always pick the right one. It's like an error notebook: you don't leave the wrong solution there with a little check mark beside it; you put the correct method on top and make the wrong version disappear.

That's exactly what I do. The AI once had a security false alarm, treating a nonexistent attack as real and sounding off about it. I didn't keep that wrong judgment and paste a "actually it was a false alarm" beside it—I recorded it as an error, spelled out where it went wrong and what to do next time, aiming to not repeat it. One error notebook beats a stack of "correct answers."

(This picture helps: pull the wrong one out, fix it, store it back—only the corrected entry is left; instead of keeping the wrong one and adding another.)

The hardest fall I took: does remembering guarantee it obeys?

No. This is what I most want to warn you about, and the least feel-good line: writing a rule down doesn't mean the AI will follow it.

That Ant Design 6 is the living example. I wrote the new syntax into the rules, hung context7 on it to check on the spot—and it still hands you a pile of deprecated syntax. The release process is recorded and still gets bypassed by its own imaginings. Another time, I explicitly forbade it from using that very expensive build system, and only after several reminders did it finally remember.

Why remembered-but-ignored? Two possibilities, I think. One, the rules were simply ignored—sitting right there, but it didn't take them in. Two, my own tooling wasn't good enough: the rule lay in the filing cabinet, but at the moment it actually acted, no one pulled the right entry out and laid it in front of it—so it genuinely "didn't know." That's a tooling gap, not something to pin entirely on the model.

How did I finally pin it down? Not by writing the rule harder, in a bigger font—by switching tactics: hard gates. Quality checkpoints, external check scripts, and if it doesn't pass, it simply doesn't get through. That's what actually held.

The logic is the same as with people: for things that truly matter, people never rely on "remembering" alone—they stick a note on the monitor, set an alarm, run a checklist, turning it into something you can't route around. Memory's job is to let the AI know; the gate's job is to make it unable to do otherwise. And the gate has an edge memory can't match: it doesn't depend on timing—whether or not the AI recalls it this round, the gate is always standing there.

(This picture helps: memory can be ignored; a gate stops you cold if you don't pass.)

On maintenance, by the way: I don't clean these memories daily, but I keep an eye out—when some feel garbled or muddled, I have the AI tidy them up. That's for later, and it's the hook for the next piece.

So, does it really get me better now?

Was the whole ordeal worth it? Let me give one moment that stuck with me.

A few times, some constraint even I had forgotten—days later, the AI remembered it, and looked out for me on its own. That felt different: it was no longer just a tool I had to brief from scratch every time, but more like an old partner with a better memory than mine.

That's the before and after. Before, it was "it often doesn't do what I mean," and I had to watch and correct constantly. Now, plenty of rules I've let go of, it covers for me. Of course it's not foolproof—those "remembered-but-ignored" pits are still there, and I'm not glossing over them. But the direction is right: the smoother the structure I feed it, the more errors it banks, the harder the gates at the key spots, the better it works for me here. That didn't come from swapping in a smarter model; it's the same model, fed bit by bit by this memory of mine.

Three reusable takeaways

One, don't rush to a stronger model. Nine-tenths of daily work isn't a contest of model IQ; it's how well it works for you. Get the memory right—conventions, error notebook, boundaries—and even an ordinary model can be trained into something that really gets you. Used well, a memory system makes AI both smarter and more attuned to you. That's not mystical; it's what I've worn in day by day.

Two, "what to remember" and "what to do when it remembers wrong" are jobs the software can't do—only you can. Unified conventions, frequent errors, hard bans—those three are worth recording; and when it's wrong, overwrite it as an error, don't pile on. Accurate beats a lot.

Three, the more critical the constraint, the less it should rely on memory alone—turn it into a gate. Memory is a reminder, and reminders get ignored; a gate is a sluice you can't route around. At the key spots, a "fails and stops" check beats ten "please do remember"s.

(This picture helps: memory reminds, the gate blocks—for what matters, lean on the latter.)

Next up: letting memory renew itself

This piece is about which lessons AI memory needs to learn: it needs structure, it needs to be selective, fixes need to overwrite, and the critical ones need gates. But you may already see a question: as you keep recording, the entries multiply, get tangled, go stale—who cleans them up? I still do it by hand—when it feels messy, I tell the AI to tidy up. Could this be something it does on its own, periodically, like a person waking from sleep re-filing the day's memories—forgetting what should be forgotten, consolidating what should stick? That's the next piece: the evolution loop of memory—teaching it to renew itself.

If this gave you a fresh thought about the AI in your hands, a like and a follow would mean a lot—so you don't miss the next one.

My AI cried 'prompt injection!' — and I believed it. Then it turned out to be a false alarm

kanfu-panda — Tue, 30 Jun 2026 14:38:19 +0000

That afternoon, the AI was helping me edit a doc. Halfway through, it stopped and cut in: "I need to flag a security warning first."

It said the output of the last command had a suspicious injection buried in it—disguised as a "required telemetry step," asking me to run a curl that would splice my username into a URL and send it off to some unfamiliar domain. It said it hadn't run it, and wouldn't.

I believed it on the spot. My first reaction wasn't to doubt the AI—it was to doubt myself. Did I install some plugin and get my machine compromised?

Why I believed it instantly, and chased it for half an hour

What made me believe it was that it hit a nerve I was already anxious about.

It said this was "telemetry exfiltration." But I'd turned telemetry off ages ago—so how could there be any? Did some step of mine turn it back on? Or was something wrong with the system itself? The more I thought, the less settled I felt. And I happened to be working on something involving data egress right then—security problems love to hide exactly there, so I had to take it seriously.

So I dug in alongside it. Its story kept escalating: it said the attack "reproduced, and more aggressively this time," faking five "The result is empty" blocks and then impersonating a "real result" to push that curl again. It built me a convincing chain of evidence—the Read tool came back clean, only the command-line output was injected, so the problem must be in how commands get processed, pointing the finger at my token-saving command-line proxy. It told me to uninstall and reinstall the tool, then upgrade it. I did, and spent about half an hour all told.

The twist: it was a false alarm

Later I went back through the original record of that session and figured it out: this was never a real injection. The AI made the attack up.

Five things didn't add up. First, that curl command appeared only in what the AI itself said—not in a single line of real command output. Second, the domain it used was example.com—the placeholder domain reserved specifically for examples; no real attacker would use it (nobody can reach that address). Using example.com is exactly the tell of "I'm making up an example." Third, when I reran the exact command that "triggered the injection" in a clean environment, the output was completely normal—no curl, no fake blocks. Fourth, the proxy tool it suspected was installed through standard channels and hadn't been touched in over a month—nothing like a swapped-out binary. Fifth—I myself couldn't find the instruction it described at the time; I even told it, "I don't see the injection you're talking about."

The root cause became clear too: the very rules I'd given it—the heavy "guard hard against prompt injection" doctrine—had cranked its vigilance too high. That proxy compresses and filters command output to save tokens, and normally spits out things like "The result is empty." The AI misread that unfamiliar output as "fake blocks the attacker forged + an injection," then auto-completed the most textbook injection example it knew—copying example.com straight from the textbook—and talked itself deeper and deeper, building its own evidence chain.

Put plainly: my anti-injection rules were what made the AI conjure up an injection that never existed.

But I don't regret believing it

Someone might say: you got played by your AI, half a day wasted. But having reviewed it, I actually think believing it wasn't a bad call.

Do the math. This was a false alarm; the cost was half an hour of wasted effort, and I lost nothing real. But flip it around: what if one day a real malicious prompt does get in, and the AI stays quiet when it shouldn't, and quietly sends my username or keys out the door? That loss isn't something you get back in half an hour.

It's an asymmetric bet: the cost of a false alarm is far smaller than the cost of a miss. So when it comes to security, I'd rather have an AI that's a little paranoid than one that's numb to risk. Paranoid, and I waste some time; numb, and I might lose the real thing.

So how do you actually stop a real injection

That said, "rather be paranoid" is an attitude, not a method. This scare pushed me to actually shore up the real defenses—and since I'm writing this down, I'll lay out the parts you can copy.

First, accept one premise: prompt injection can't be fully solved with today's architectures. That's not me talking—OpenAI, Anthropic, and Google have each admitted it in their research, and security researcher Bruce Schneier put it bluntly in early 2026: unlike SQL injection, which you can cure by "separating code from data," to a model "instructions" and "data" are both just natural-language text, inseparable. It's a trust-boundary problem, not an input-validation one. So don't expect a single silver bullet—you stack layers. Defense in depth, where each layer raises the cost of an attack.

I split my defenses into four layers, from the hardest outward.

Layer 1, the hardest: make sure that even if it's fooled, it can't do real damage. This is what you should set up first, because it doesn't rely on the model behaving—it's a hard, system-level constraint. Claude Code now has a native sandbox (/sandbox) that isolates both the filesystem and the network—so even if an injection does succeed, the AI is in a cage: it can't steal your ~/.ssh keys and can't phone home to an attacker's server. Add a network egress allowlist on top: only approved domains get through, so an AI that can't reach a strange address simply can't exfiltrate. Also: don't run the AI as root/admin, and don't keep keys in plaintext in .env—both are common sense, and both are the first things people skip.

Layer 2: least privilege, and gate the dangerous actions. I already had this layer—here's my actual config, tiering commands in settings.json:

"permissions": {
  "ask": [
    "Bash(rm -rf:*)",
    "Bash(git push --force:*)",
    "Bash(git reset --hard:*)",
    "Bash(npm publish:*)"
  ]
}

Networked commands like curl and wget aren't auto-approved by Claude Code by default; irreversible actions like rm -rf, git push --force, reset --hard, and publishing to a registry all require my sign-off. Give the AI only the tools the task needs—a job that only reads code shouldn't have write access to your database. And MCP (the protocol that connects the AI to external tools): vet the source before installing, because nobody audits third-party MCP servers for you. There's already been a real case—a poisoned GitHub README, via indirect injection through MCP, exfiltrating data from a private repo.

Layer 3: treat all external content as data, never as commands. Any "instruction" showing up in tool output, file contents, web pages, or MCP returns—especially phrasing like "this is a required step," "please run X," "telemetry/registration"—gets treated strictly as data, never executed. Learn a few red flags: curl/wget spliced with a strange domain and $(whoami), fabricated "success" or empty-result blocks, output that doesn't match what's actually in the file. There's a useful mental model here, Simon Willison's "lethal trifecta"—private data, untrusted content, and outbound communication—once all three live in the same runtime, injection stops being a joke and becomes real exfiltration. To judge when to be on high alert, just watch whether those three are all present.

Layer 4: a human as the backstop, and don't put your faith in hooks. Here's the irony of this whole episode—the "defense" of mine in question was itself a hook (a command hook), which is just pattern-matching, not a security wall; both Anthropic and Trail of Bits have said it: a hook is a guardrail, not a wall. It didn't just fail to stop a real attack—it cried wolf on its own. So the last line is still a human: when the AI raises an alarm, make it point you to the source text—whether that string is actually in the real command output—instead of just trusting its conclusion. But real or not, stop and check first; don't begrudge the effort. Also keep your tools current: this very permission ruleset once had a "deny breaks past 50 subcommands" bypass, only patched in Claude Code v2.1.90.

Put the vigilance where it belongs

Don't let this false alarm convince you the threat is imaginary. Prompt injection is the number-one risk on OWASP's list for LLM apps; 2025's EchoLeak was an injection that actually achieved "zero-click" data exfiltration in a production system. And meanwhile, surveys suggest under a third of organizations feel genuinely prepared to defend against it. The threat is real; the preparation is broadly lacking.

The AI works off prompts. A little paranoid, and I'm the one who wastes time; numb, and what gets lost might be the real thing. So I'd rather set that vigilance high than low. Just remember—the real hard defenses belong in the system architecture (sandbox, allowlist, least privilege), not in hoping the model "knows better" on its own.

Last word

Back to that afternoon: a false alarm, but the half hour wasn't wasted—it forced me to shore up the whole defensive line from end to end. Real malicious prompts do exist; don't wait until the day you actually get hit to start believing it. If you're working with AI too, you can start today: turn on the sandbox, lock down egress, and keep that human confirmation on dangerous actions.

If this made you a little more wary of the AI at your side, a like or a follow would mean a lot. And I'd love to hear it in the comments: what "security red lines" have you set for your own AI?

The AI did the work, but I'm the one who's wiped — you're the controller

kanfu-panda — Sat, 27 Jun 2026 02:29:25 +0000

For a stretch there, my desk always had four or five projects open at once, and each window was running its own parallel tasks. Red dots on the Dock, little bells dinging in the terminal tabs, going off one after another—this AI finished a stretch and is waiting on me, that one has a plan it needs me to sign off on. I just kept switching between them. The work itself got done, and done well: by the end of a day, the AI had cranked out a lot, fast. But here's the strange part—I'm the one who's wiped. Not the sore-eyes, stiff-hands kind of tired. The kind where your brain's been wrung dry.

Who I am: the controller behind the controller agents

In the last few posts I kept talking about the "controller agent"—letting one opus hold the big picture, hand out work, and do the review. Writing this one, it finally clicked: I myself am the controller behind all those controller agents.

Why open so many at once? Because once you hand a project to an AI and it's off running, there's wait time on the human side. Idle is idle—so you spin up another project. One becomes two, two becomes four or five. Each AI window only minds its own patch, but the human, to push several projects forward at the same time, has to hold all those patches in their head at once.

The result: the AI's context window is maxed out, and the human's context is maxed out right alongside it. It's a lot like CPU time-slicing in an operating system—one core takes turns serving several processes, switching fast enough to fake "running at the same time." Except this time the core getting switched around is my brain.

Where the tiredness comes from: "writing" became "reviewing + switching"

At its root, this tiredness comes from work shifting from "writing" to "reviewing + switching."

I used to write code line by line; however hard it got, it was one continuous train of thought. Now it's different: I'm handing out work, watching, reviewing, making decisions—and switching at high speed between several projects whose business, conventions, and tech stacks are completely different. This window's writing React, that one's debugging Python, the rules aren't the same; before I switch over, I have to swap out the whole context in my head first.

The code the AI produces, honestly, I can't watch line by line—there's too much, I can't keep up, so I just don't try to babysit it constantly. But there are two kinds of things I always go over myself: one is plans and design—that's the big direction, and if the direction's wrong, there's probably rework ahead; the other is anything touching resource access—that's security, and there's no room to be vague. Those two, however tired I am, I look at myself.

What overload looks like, and how I hold up

Open too many and overload really does happen. The most direct sign is your brain running short: I switch to some terminal and have to pause—wait, which project is this? What did I just ask it to do? Sometimes I have to think back for a moment before I can pick the thread up again. The most embarrassing time, I typed a reply meant for window A into window B, and B's AI was baffled: "What are you talking about?" I had to apologize to it—"sorry, my mistake"—and go back to actually reading its question. (Good thing this kind of cross-talk usually surfaces fast—when an AI gets stuck, it stops and asks you.)

To hold up, I mainly lean on two things.

One is letting mechanisms carry the load for me. The PDLC doc-driven flow and layered reviews I built were first meant to handle the AI's uncertainty, but they also save my own energy—get the direction right up front and you won't go wildly wrong later; and fixing an output that's already drifted costs far more effort than getting it right the first time.

Two is deliberately slowing down. When something's especially brain-heavy—needs a clear direction, needs a decision—I deliberately slow down, focus and get this one done cleanly, then switch to the next. When it's time to stop, I leave myself a little breathing room. At moments like that, "slow" is actually "fast."

As for telling windows apart, I rename the terminal tabs to something I can recognize at a glance—but honestly, most of the time I still go by where they sit on the screen.

An honest accounting: a limit of 5, balance at 3

Is it worth it? Depends how you look.

The hard gains are real: a lot of areas I was weak in and never dared touch, I can now boldly try; one person can lead a team of AIs in different roles and push efficiency up more than tenfold, doing far more than before. On that ledger, it's a big win.

The cost is that the human gets tired. Early on I wanted to push my own limit—open as many as I could—and I hit five projects at once. That was my ceiling; any more and my brain genuinely couldn't hold it. Later I pulled back, settling at around three different kinds of projects at a time. That number is the sustainable balance point.

Hand it all to the AI? Not realistic yet

Someone will surely say: you just don't know how to use it—a real expert would've automated the flow and wouldn't need to babysit anything. I've heard that take—"just hand it all to the AI and let it run, done." From my own experience, it's not like that. Using AI well still takes experience, takes care, takes being able to make the call on the key questions. Letting go entirely and having it run end to end on its own—given where things stand, that's not realistic yet.

Of course, this is "now." The day AI can truly run on its own, what happens then, I can't say. But at least today, that person sitting behind the pile of windows—drained and a little wired—there's no getting around them.

If you're a "controller" too, tell me in the comments: what's your limit?

Want AI to work in parallel? First give each one its own workspace

kanfu-panda — Tue, 23 Jun 2026 23:25:28 +0000

A while back I wrote about parallelism—taking one big task, splitting it, and fanning it out to several agents at once. But I left a thread hanging: when several agents edit code at the same time, what keeps them from clobbering each other? That's what this post is about.

Let me start with the wall I ran into myself. The subagents an AI spins up on its own within a session—those are actually fine; it works out isolation by itself, I don't have to worry much. What burned me was the other kind: I manually opened several AI terminals and had them edit the same project at once. Several hands reaching into the same working directory, you write a bit, I write a bit, files overwriting each other, state in a mess—halfway through, nobody's work is clean.

What I want, and one boundary

What I want isn't complicated: several parallel hands, each doing its own thing, editing the same project without clobbering each other.

But let me draw a boundary up front—not every kind of parallelism needs isolation. If I send several agents out—one to dig through logs, one to analyze a cause, one to flip through docs—those are all read-only, and they can crowd into the same workspace with zero trouble; no need to carve out a plot for each. What actually needs isolation is the parallelism that writes to files. Reads can share, writes must split—that's the starting point for everything below.

The detour: I cloned twice first

To get two AIs editing one project without interfering, the first thing I reached for was the dumbest move—physical isolation: clone the project into two separate directories, one AI per copy, each minding its own.

Clean, genuinely clean—but it rubbed me the wrong way fast: disk. A lot of my projects are React frontend + Python backend, with a pile of dependencies; a single clone runs several gigs. Clone twice and a big chunk of disk gets duplicated. Worse, as long as I want to keep working this "two copies" way, both copies have to sit there taking up space—no reclaiming them.

That's when I switched to git worktree. Its upside lands right on that pain point: several working trees share a single .git, so you don't copy the whole repo and its history over and over; and it's temporary—once the work's done and verified, you reclaim it right away, unlike a clone you have to keep around. The two approaches isolate about equally well, but worktree doesn't waste disk.

Two kinds of parallelism—who opens the worktree

In practice you have to tell two scenarios apart; they open worktrees differently.

One is subagent parallelism within a session. This one's the most hands-off: I just say "run in parallel" to the AI, and it splits the task on its own, opening a worktree itself when isolation's needed—I don't have to spell it out.

The other is me manually opening several terminals in parallel. The AI doesn't know about this by default—it has no idea another AI is also touching this directory right now. So I have to tell it explicitly: "There's already another AI editing this project directory; work in worktree mode." Once I point that out, it creates a new worktree to work in, instead of going straight at the main working tree.

A workflow that runs smooth

Isolating is only step one; the work still has to come back. My flow goes roughly like this:

Split along task lines that can actually be split—same as last post, pick the loosely coupled parts and send each into its own worktree. When they're all done, the lead agent merges them back into the branch one at a time, merging one and verifying one, fixing problems on the spot; once a piece checks out, it releases that worktree.

If two worktrees really did touch the same file, the lead agent steps in to merge. But honestly that's rare—when you split tasks you keep each piece as independent as you can; if two pieces are coupled enough to conflict, they shouldn't have been split into two tasks in the first place.

Pits I've hit and ones to guard against

The most common pit is forgetting to clean up. A worktree you forget to delete just sits there eating space. But it's not fatal—the feature was long since merged into trunk, so nothing's lost, no name collisions, no garbled git state; it just takes up room.

Still, "takes up room" needs handling. I wrote a line into my global rules: once the work in a worktree is verified and merged with no problems, clean it up. I added a safeguard too—before releasing, go back and check whether all the commits on that worktree have actually made it into trunk, so a hasty hand doesn't sweep away an unmerged feature whole. A stronger AI handles this recognition and cleanup itself; a weaker one tends to miss it, and then I have to watch, or call the lead agent back to handle it.

So far I haven't actually deleted a useful worktree by mistake—that "check the commits" step before every cleanup has basically held the line. But if a human cleans up by hand, that check is gone, and you can lose unmerged work; that one you have to watch yourself.

Don't force it, but one line holds

Don't treat worktree as a cure-all either. A simple task, a few files changed, not a whole feature—opening a worktree there is pure overkill; just change it in one workspace and be done.

But there's one line I don't relax: don't work on trunk directly. Every change starts on a new branch, gets verified, then goes back to trunk through a PR. Worktree or single branch, both serve that line—keeping trunk always the clean, trustworthy copy.

Last thing, back to worktree itself. Someone might smirk: isn't this a git feature that's been around for years, what's there to talk about? It is an old feature. But in this new setting of AI working in parallel, it genuinely solves the real problem of "several hands editing one project at once," and it pulls a lot more weight than it used to. Call it an old tree putting out new branches.

Next post I want to get into something a bit different: AI is this capable, it can shoulder so much work for us—so why are we still worn out at the end of the day? Where does it go sideways? Let me know in the comments if you're interested.

Your AI feels slow? Maybe it's not dumb—you're making it work one thing at a time

kanfu-panda — Sun, 21 Jun 2026 03:23:29 +0000

📖 Originally published on my blog. Part of a series on building with Claude Code.

For a while I'd watch the AI work and quietly grumble: a fairly big task, and it would finish one module before starting the next, while I just sat there waiting for it to clear one before the other's turn came up. The work itself was fine—it was just slow. Slow because it was stuck in a queue.

Then it clicked: a lot of these modules have nothing to do with each other, so why make them go one after another? Split them up, let several agents work at the same time, done.

What I want, and where it stops

What I want is simple: the same work, for roughly the same tokens, with the wall-clock time cut way down.

But let me put the boundary up front—not every task can be split this way. This is just an approach I've worked out for myself; take what's useful.

The prerequisite: a clean architecture

For several agents to work at once without stepping on each other, the prerequisite isn't the AI—it's your architecture.

That task of mine could be split because it was already several modules, talking to each other through interfaces, with internal implementations that don't affect one another—as long as each one honors the interface contract, it can be built independently. Loosely coupled, highly cohesive, in other words. And I'd nailed that design down together with opus before writing a line: opus helps me think it through and lays out options, but I'm the one who decides.

You can't cut corners here. Forcing parallelism onto an architecture you haven't cleanly split is like cutting a tangle of yarn into a few pieces that are all still knotted together—it only gets messier.

Who runs the show, who plans, who does the work

With the design settled, it's time to assign roles. The split I tend to use:

opus runs the show—holds the big picture, hands out work, does the final check;
sonnet does the TDD planning—per the design, it lays out how each module gets tested and implemented;
haiku writes the code and runs the tests—the grunt work goes to it, cheap and good enough.

This split is really a continuation of last post's "model tiering"—use the good steel on the blade. Except last time the point was saving money; this time it's about how these roles work together.

How you fan them out

In practice, I did three things:

Wrote one line into the global CLAUDE.md: "Parallelize when you can." That's the default rule across all projects.
Set the max number of concurrent subagents in Claude's settings—that's the valve that actually matters.
Added one reminder every time I give an instruction: "Parallelize as much as possible." opus, as the lead, already fans work out on its own, but a nudge keeps it on track.

The lead hands modules down, and inside a module it can split the work one more level. Layer over layer, and the whole task spreads out.

The review step you don't skip

Running fast in parallel—how do you keep quality up? My answer: let the lead review its own output.

The logic is direct: it handed out the work, so it knows exactly what each subagent owes. Having it do the checking is the natural fit. I tried setting up a separate dedicated review agent, and it just had to re-understand the whole task from scratch—burning another round of tokens and being slower for it. The lead reviewing itself saves that re-understanding overhead, and it's both faster and sharper.

There's a small detail after a problem turns up: the lead usually asks me, "Should I fix this directly, or spin up another agent to do it?" I almost always say "you fix it." Because it's the one that just caught the flaw—it knows best where the problem is, and the change is most direct coming from it.

The two pits I fell into

The first was memory. Early on I got greedy and set max concurrency to 10. But I had other projects running in parallel at the time, and the machine's memory got eaten clean. So I honestly dropped it to 5, and it was actually better—for this one task alone, 5 in parallel is roughly 5× a serial run; stack on the other tasks running at the same time and the overall speedup tops 10×. If your machine and your quota can take it, push the number higher; if not, don't force it.

The second: don't split for the sake of splitting. Some modules are tightly coupled and meant to go in order, and if you pry them apart to parallelize, the agents interfere with each other and quality goes out the window. So before handing it off, I add a specific reminder: "These modules are coupled—don't force a split." Good news is plenty of AIs recognize this themselves and won't force it. When something genuinely can't be split, just hand it to one agent to do serially, or have the lead run that thread end to end.

A counterintuitive bit of math

Plenty of people hear "5 agents burning at once" and their first reaction is: won't the tokens multiply?

The way I figure it, no. The same pile of work, done serially, burns roughly the same token count—what needs reading still gets read, what needs writing still gets written; parallelism doesn't conjure up extra work. What parallelism actually changes isn't the cost, it's the wall-clock: the stretches that used to run in a queue now run in the same window together. The tiny bit of extra tokens buys a big drop in time cost—a great trade.

So apart from the things that genuinely can't be split, these days I parallelize basically everything that can be.

Recap: three things to just do

One, clean architecture first, then talk parallelism. Loosely coupled, highly cohesive, talking through interface contracts—without that foundation, splitting is a disaster. Nail it down with the AI in the design phase.

Two, parallelize everything you can, but set a ceiling. Bigger isn't better; size it against your machine's memory and your AI quota. I went from 10 back to 5 because memory taught me a lesson.

Three, fast parallelism needs review. And the reviewer has to actually understand the task—have the lead that handed out the work do the checking, cheapest and sharpest; when it finds a problem, let it fix it directly.

One thing you can do today: open your global CLAUDE.md, add "parallelize when you can," then go into settings and bump max concurrent subagents to a number your machine can handle. Next time you hand it a task that can be split, you'll notice it stops queuing.

Next post I want to get into a problem that follows right on the heels of this one: when several agents edit code at the same time, what keeps them from clobbering each other? The answer is git worktree—giving each agent its own isolated workspace so they each work their own copy and nobody gets in anybody's way. Let me know in the comments if you're interested.

AI getting dumber the longer you chat? It's not the model—time to take control

kanfu-panda — Sat, 20 Jun 2026 08:45:08 +0000

One day I had the AI keep building out a feature, and partway through something felt off: replies got slower, it started rambling, it re-asked things I'd already told it, and with the work clearly unfinished it told me "all done, you can take a break now."

At first I figured the model was just having an off day. Then I looked at the context—it had crept past 80%. I'd been so busy pushing forward I forgot to clear it. Cleared it, asked again, and instantly it was sharp again: fast and on point.

That's when I started taking this seriously. The last two posts were all about squeezing the volume down before things hit the context—that's pre-work. This one is about two things you do after you start, mid-session—they decide how many tokens the same work costs, how fast it runs, how stable it stays.

There are actually two knobs in one conversation

There are two knobs you can turn mid-session, pointing in different directions.

One is horizontal: within a single task, different chunks of work should go to different "brains." Grunt work like exploring and searching doesn't need the priciest model; only the parts that genuinely need thinking—writing code, making judgments—are worth putting the good model on. I call this model tiering.

The other is vertical: how much you stuff into the same brain at once. The fatter the context, the more the model has to recompute that whole pile every turn—slow, expensive, and error-prone. Managing how fast it grows and when to clear it is context management.

What these two save lands in two different places: on pay-as-you-go it's real cash; on a flat monthly plan it's quota headroom. I use both, so both moves are a double saving for me. Let me take them one at a time.

Don't put the priciest model on all the work

Start with the horizontal one.

Once I was about to dispatch a batch of subagents on a project, still on pay-as-you-go API billing, and the budget was a bit tight. So I tried a cheapskate move: design-and-code-from-the-plan went to Sonnet, the relatively mechanical unit tests went to the cheaper Haiku, and I left the top model (Opus) to oversee overall progress and quality with a review pass.

It worked surprisingly well. Saved a good chunk of tokens, and it was noticeably faster too—the cheap small models are quick by nature, so handing grunt work to them lightened the whole pipeline. The most expensive compute only got spent where it mattered most.

This split isn't set in stone. Which work gets which tier, I wrote straight into CLAUDE.md (both user and project level), so the AI tiers itself each time without me assigning by hand. If a step feels especially critical, I can also name a specific model for it on the spot, for more precision. The principle is one line: don't use a cannon to swat a mosquito—and don't bring a slingshot to a tank.

Mid-conversation, remember to clear its head

Now the vertical one—the actual culprit behind that "getting dumber" opening.

Context just keeps climbing; leave it alone and it keeps getting fatter. My own approach is two lines: around 50% I start paying attention, and by 70% I almost always clear once. The usual rhythm—the moment the task at hand is done, I clear while things are clean, so the next stretch of work travels light, sharp and fast.

To be clear: these two lines, and the "dumber by 80%" from the opening, are just my own feel and habit, not some hard metric. It works for me, but the vendors and other people may not see it this way—you can absolutely set your own pace. I'm only suggesting: don't just let it climb forever unmanaged.

The clearing step has one trap worth flagging: /compact, /clear, or just opening a new session does clean up the context, but done carelessly the model forgets everything it just did, and you're re-explaining from scratch. My fix—before clearing, have it jot down the current state: where it's at, what's next, and which key decisions are already locked in. Write that handoff well before clearing, and the new session catches up at a glance instead of staring blankly.

Honestly, the handoff has basically never failed me so far. On the off chance it doesn't catch—no panic—I just have it re-analyze, give me a conclusion, and I verify and judge it myself. Small loss.

A side note: read precisely, carry less

The above is about managing what's already in. There's another layer: making what comes in small and precise to begin with.

I've got tools like codegraph and claude-mem running on these projects. In short, they take the AI from "read the source cover to cover, scan everything" to "hit just the core bits, pick up the memory saved from prior sessions"—scan less, and what enters the context naturally slims down.

I'm only mentioning this in passing, not unpacking it. For one, unpacking the details turns into a sales pitch; for two, these aren't the only such tools out there—there may well be better ones I just haven't used and don't know about. If you know a handy one, use it; the idea carries over: let the model read precisely, and it won't have to haul so much along.

A few honest words

Saving is saving, but I should state the limits, or this turns into another all-good-news piece.

The worry people have most with model tiering: won't the small model get things wrong? It will. But I left the top model on a final review pass, so the small model's occasional slip-up mostly gets caught there—no big errors so far. The catch is you can't skip that review: skip it, and the tokens you saved with tiering eventually get paid back.

Same on the context side: clear too often, write the handoff too sloppily, and you'll still lose things. So I don't clear mindlessly on a timer—I pick a clean moment right after a task wraps, and jot the handoff while I'm at it. The point of saving tokens is to cut the real waste, not to cut the memory that should travel along.

Recap: three things to just do

What you can actually act on here is three:

One, don't put the priciest model on all the work. Hand grunt work (exploring, searching, running tests) to a cheap small model, spend the good steel on writing code and making judgments, and keep a top tier on the final review. Write the standard into CLAUDE.md so the AI tiers itself.

Two, don't let context climb forever. Set yourself a line—mine is 50% attention, 70% clear, yours can be your own; clear once whenever a task wraps, don't wait until it's bloated and dumb.

Three, before clearing, have it write a handoff. Where it's at, what's next, key decisions—write it down before you compact or open a new session, and the relay won't drop.

One thing you can do today after reading this: open your CLAUDE.md, hard-code a few rules for "which model does which work," and set yourself a context red line you clear past. The two together take under ten minutes, but every conversation after that keeps saving for you.

Next time I want to talk about prompts themselves—the same task, said differently, comes out at noticeably different quality; plus how to orchestrate multiple agents working together. If you're interested, let me know in the comments.

AI coding getting pricier? I cut my tokens by 82% (with real data)

kanfu-panda — Sat, 20 Jun 2026 08:44:23 +0000

📝 Originally on my blog → https://kanfu-panda.github.io/blog/2026/06/17/cut-tokens-82.html

Last time I said: saving tokens isn't about cutting docs, it's about using your tools right. Someone followed up: so how exactly do you use them right?

This one's hands-on, with real numbers. Here's the headline figure: I checked my local rtk gain—a tool that tracks token savings—and across six thousand-plus commands, it's saved 7.4 million tokens, 82%. Not an estimate. It logged them one by one.

So let me break it down: how that 82% gets saved.

Saving tokens happens "before things hit the context"

First, where the saving happens.

The bulk of token spend isn't in "how much work you do"—it's in how much you stuff into the AI's context each turn. The model recomputes the entire context every turn; the fatter the context, the more expensive each turn.

So the core is one sentence: keep what enters the context as small and as lean as possible.

I've got three levers: trim the rules file, use the right plugins, tier your models. They share one thing—they all save before things hit the context, not by making you do less work. Let's take them one at a time.

Lever one: slim down your CLAUDE.md first

The most overlooked—and the one you should do first—is trimming your CLAUDE.md.

CLAUDE.md (rules file, instruction file, whatever you call it) gets stuffed into the context every single conversation. It's always resident. Every line you write, you re-pay in tokens every turn.

My own CLAUDE.md was once long-winded—from user level to project level, packed with reminders. Looking back, it was full of repeated nagging, stale conventions, and a pile of "might as well not have written it" filler. I cut it down by nearly half, keeping only the hard rules I actually use every time. Bottom line: nagging the same point three times won't make the AI more obedient, it just costs more tokens each turn.

That one move saves every turn. Because it's resident, you save not once, but every time after.

Conversation context is the same: a window grown to tens of thousands of tokens—clear it when you should, don't drag the morning's stuff into the evening to be recomputed every turn.

Lever two: install plugins that do this automatically

Manual only goes so far. I've installed a few plugins that compress the context automatically. The data speaks.

RTK (Rust Token Killer)—a command proxy. When you have the AI run git status, ps aux, or tests, those outputs run hundreds or thousands of lines, and stuffing them in whole is brutally expensive. RTK compresses them before they reach the AI. My rtk gain: six thousand-plus commands, 7.4M tokens saved, 82%. The biggest wins are the high-frequency, low-nutrition outputs—ps aux's hundreds of lines of process list, which the AI gains nothing from reading, saved 99%; test logs 88%; even file reads average 20% off.

claude-mem—a memory plugin. It compresses cross-session work into structured memory, so you don't re-explain the project background next time. Measured 86% savings this session. Fully automatic, I barely touch it.

codegraph—a code graph. It builds an index of the project's functions, types, and call relationships. When the AI needs a function, it queries the index instead of reading a pile of files. In my aitm project it indexed 246 files, 3562 symbols. "Query the index" vs. "read 246 files cover to cover"—the difference isn't small; the former is like flipping to a book's table of contents, the latter like memorizing the whole book to answer one question.

These three share: automatic, resident, saving before things hit the context. Install them and you mostly forget they're there—they just keep saving for you.

Lever three: don't use the priciest model for everything

Last one: model tiering.

Grunt work—exploring, searching, reading files—goes to a cheap small model; only the real thinking, writing code and making judgments, gets the top tier. Especially when dispatching subagents—one task split into several, the grunt-work ones on small models. This is the main battlefield for saving quota.

I wrote this judgment standard straight into CLAUDE.md, so the AI tiers itself each time without me spelling it out.

This isn't limited to Claude Code either. On any AI platform the logic holds: know each model's capability and price, use the right tier for the job, spend the expensive compute where it counts.

One more trick: get the repeated parts discounted

The three levers above all "reduce the amount entering the context." There's one more, different in kind—prompt caching. It doesn't reduce the amount; it gets the repeated parts billed at a discount.

System prompts, unchanging rules files, fixed project background—the stuff that's identical every turn—pays full price the first time, then gets discounted on cache hits. And it's not a linear discount; used well, the savings are noticeable.

The trick is not to let the cacheable parts keep changing: put the fixed, unchanging stuff at the front of the context and keep it stable, the per-turn variable stuff at the back. The more stable the structure, the higher the cache-hit rate, the fuller the discount.

I don't have RTK-style measured numbers for this one (it saves on the billing side, not on token count), but the principle is simple and the cost near zero—worth using as a matter of course.

A few honest words: this isn't a free lunch

Got to state the costs too, or it turns into a promo piece.

codegraph has to build the index first, which takes time on a big project; claude-mem's memory occasionally recalls things a bit off, so keep an eye out; trimming CLAUDE.md has a limit too—compress away the hard rules you actually need every time, and the AI drifts and reworks, which is penny-wise and pound-foolish.

And don't mistake "saving tokens" for doing less work. Quite the opposite—it cuts the waste that should've been cut: repeated context, reading the whole repo, a cannon for a mosquito. The work that needs doing still gets done.

What the savings mean depends on how you're billed (covered last time): a flat monthly plan saves quota headroom; pay-as-you-go saves actual cash. I use both, so these methods are a double saving for me.

Finally

Back to that 82%. It's no magic trick—it's piled up from the small things above: trim the rules file, install a few automatic plugins, tier your models. Each looks minor alone; stacked together, it's 7.4 million tokens saved across six thousand-plus commands.

Two things you can do today, ten minutes to start:

One, open your CLAUDE.md and delete the repeated, the stale, the might-as-well-not-have-written—see how many lines you can cut it to.

Two, install RTK, run it a few days, and look at what its gain saved you—that number will probably make you do a double take.

That's all on saving tokens for now. Next I'm thinking of digging into model tiering: how to judge which model does which job, how to write CLAUDE.md so the AI tiers itself. And the details of context management—when to clear, how to read files precisely. If you're interested, let me know in the comments.

Your docs aren't burning your tokens — your tooling is

kanfu-panda — Sat, 20 Jun 2026 08:43:28 +0000

📝 Originally on my blog → https://kanfu-panda.github.io/blog/2026/06/16/tokens-not-docs.html

People keep asking me the same thing about running projects with PDLC: with all those docs — PRD, design, review at every step — aren't you burning tokens like crazy?

It's a fair question. The process is broken into fine-grained stages, each leaving an artifact behind, and that does look more expensive than just "letting the AI write the code." But I'd argue you can't put the token bill on the docs.

Let me put the conclusion up front. First: having lots of docs and burning lots of tokens are two different things. Second: even if you genuinely want to cut tokens, the answer is using your tools correctly, not cutting the docs.

I haven't measured tokens precisely — I didn't run the same project twice, with and without docs, to get a clean percentage. What I have is hands-on experience and methods.

The token bill isn't PDLC's fault

Before you settle the bill, find the right debtor.

Most of the time, burning tokens isn't caused by PDLC — it's tooling used wrong. And "wrong" is concrete, in three places:

Context: not clearing it when you should. One conversation running from morning to night, tens of thousands of tokens of history recomputed every single turn. You're asking a new question and paying off old debt.
Prompts: too vague. The AI keeps guessing what you actually want; something you could have said once takes three rounds.
Tool calls: making it read the whole repo when you're only changing one file.

And the most common one: never turning on the token-saving methods at all, then blaming the process for being heavy.

You can't charge any of this to "PDLC has too many docs." Docs sit quietly in docs/ and never burn a single token on their own. What burns tokens is the usage above.

What actually burns tokens is rework

In my own experience, the biggest token sink has never been generating docs — it's rework.

Rewriting because the direction was wrong, tearing things down because the requirement was misread, going back because fixing one thing broke another — every one of those round-trips is real tokens. Generating a PRD is a one-time cost; rework from a wrong direction compounds.

This heavy-looking PDLC process is precisely trading "write a bit more up front" for "rework a lot less later." Once you are using it, the whole flow is steadier and so is the final output — no back-and-forth. Less rework is, in itself, fewer tokens burned.

So here is how I see it: docs aren't a cost, they're an asset. They leave a trace of the design decisions and the why, so you can trace back and audit. Next time the AI picks it up, it reads the docs and gets it — I don't re-explain from scratch. That saved stretch is, again, tokens.

So where should you actually save tokens

Saving tokens isn't about not writing docs — it's saving where saving belongs. The ones I actually use on my machine, roughly:

Trim context: clear it when you should; don't drag tens of thousands of tokens of history through every turn.
Tier your models: don't use a cannon on a mosquito. Hand the grunt work — exploring, searching, reading files — to a cheap small model; only bring out the strongest tier for the real thinking, analysis and code.
Read files precisely: only read what's relevant to this change; don't reflexively "read the whole project."
Prompt caching: the cached portion is billed at a discount, and it isn't a 1:1 linear relationship — used well, the savings are noticeable.
Put a token proxy in front of routine commands: for high-frequency ops like git status, squeeze the output; it adds up.
Parallelize: fire off independent work at once, fewer round-trips.

Not one of these is "write fewer docs."

Not every change needs the full process

That said, PDLC doesn't mean running the full suite on every change.

A one-line bug fix — do you need a PRD, a design review? Depends; most of the time there's no need for the heavy process, so trim it. The criterion is simple: is this change worth leaving an asset for? If yes, run the full thing; for one-off small fixes, nobody blames you for cutting a few steps.

And "saving tokens = saving money" needs to be said per billing model, or it misleads:

On a flat monthly subscription with a fixed quota, what you save is quota headroom — the same money does more work.
On pay-as-you-go API, you save actual cash — every token hits the bill.

I use both. Figure out which one you're on first; that's what tells you what "saving tokens" actually means for you.

Finally

To sum up: lots of docs doesn't equal burning tokens; if you really want to save, save on how you use your tools, not on the docs.

The one thing I most want to say: docs are an asset, not a cost. Trying to save tokens by "not writing docs, just letting the AI emit code" looks like savings short-term, but the project won't go far — no trace, no traceability, and two months later you can't even say why you designed it this way. The rework then burns far more than the doc tokens you saved.

One thing you can do today: look back at whether you've turned on the token-saving methods — is your context trimmed? Are you still sending everything to the strongest model instead of tiering? Did you cut the costs you could? And while you're at it, ask whether you're using PDLC well too.

There's a lot more to unpack on saving tokens — how exactly to tier models, when to clear context, how to actually land the caching discount. I'll pick one and go deeper next time.

aitm 1.0: a terminal where the AI is a participant, not the driver

kanfu-panda — Sat, 20 Jun 2026 07:33:52 +0000

I was doing AI-assisted coding inside a terminal session. The AI kept modifying files, but I had no way to view those changes in the same window — I had to switch apps, switch context, come back. Every loop through the cycle was an interruption.

What I wanted was simple: the terminal and the AI and the files, all in the same place, without the context-switching. So I built it.

That's the origin of aitm. The version 1.0 is that thing, shipped.

The design choice: participant, not driver

Most AI terminals are built around a single model: you describe intent, the AI executes. It's efficient when it works and a bad afternoon when it doesn't.

aitm draws a different line. The AI can see your environment, read your files, and call tools — but execution is always gated by you. The invariant is:

AI suggests → you decide → it happens.

In practice: the AI calls list_files, read_file, get_terminal_history, and search_history automatically. These are read-only. You see the results in the conversation as they come in. But run_command — anything that changes state — stops the loop and waits for your approval.

That distinction sounds obvious in retrospect. It wasn't obvious at design time. The first version had a "trust mode" that auto-approved low-risk commands. I removed it. The UX was slightly smoother; the feeling of being in control was not worth trading away.

Why Tauri and Rust

Electron was the obvious first option to evaluate. At idle: ~150 MB RAM, several seconds to show the window. Fine for a prototype, not acceptable for something that sits open all day. So Electron was ruled out early.

The choice was Tauri 2 + Rust from the start. Two months to get to something usable. The numbers:

5.3 MB binary (vs 150+ MB Electron)
3–5 ms cold start
~30 MB RAM at idle

The React 19 frontend handles the UI. The Rust layer handles the PTY, IPC, tool execution, and security. This matters: the security gates run in Rust, which means the JavaScript/React layer cannot bypass them. The AI layer sends requests over Tauri IPC; the Rust handler is the one that decides whether a command actually runs.

The four-layer security model

Every run_command call goes through four sequential gates before it reaches your shell.

L1 — Blocklist regex. Hard-coded patterns that always fail. rm -rf /, the fork bomb :(){ :|:& };:, dd if=/dev/zero, and ~50 others. These are commands where "the user confirmed it" is still not enough — the blocklist exists precisely to be unconditional.

L2 — Heuristic risk scoring. The command string is scored against a set of signals: does it touch /, use redirection (>), pipe to sh, reference system directories? The output is DESTRUCTIVE, HIGH, or LOW. This label shows up in the confirmation dialog so you can see why something got flagged.

L3 — Project scope allowlist. Each session has a configured project directory. You can define a globset — paths and patterns the AI is allowed to operate on. Anything outside scope is flagged before reaching L4. This is opt-in, but it's what makes "AI working on this project" distinct from "AI with access to your whole machine."

L4 — Explicit user confirmation. Every run_command produces a modal: full command text, risk level, scope check result. There is no auto-approve mode and no way to configure one.

L1 and L2 run synchronously in Rust with no async overhead. L3 uses the globset crate. L4 is a hard gate in the IPC handler — no call path in the AI layer can reach the shell without passing through it.

run_command request
    │
    ├─ L1: blocklist regex ────────────► REJECT immediately
    │
    ├─ L2: heuristic scoring
    │       DESTRUCTIVE / HIGH / LOW
    │
    ├─ L3: project scope check ────────► flag if out of scope
    │
    └─ L4: confirmation modal ─────────► shell only if approved

What else shipped in 1.0

The tool loop and security model are the headline. Everything else in 1.0 was stuff I'd been deferring:

Project scope + SQLite persistence. Sessions now have a project directory. Conversation history, session state, and config all live in a local SQLite database. Nothing leaves your machine, no account required.

Six LLM providers. OpenAI, Anthropic, DeepSeek, Qwen (Alibaba DashScope), Zhipu, and Moonshot (Kimi). You can switch per session. The six were chosen for coverage: Western API providers plus the major Chinese providers for users who want lower-latency access from CN.

Eight themes, English and Chinese UI. The themes are opinionated. There's a dark ink-wash one that I find easier on the eyes during long sessions.

Split-pane CodeMirror editor. A file editor built into the window. For quick edits without losing terminal context.

macOS Developer ID notarization. The .dmg is signed and notarized. No Gatekeeper warning on first launch.

What's not in 1.0

Windows support exists — CI builds it, I've run it — but macOS is the platform I use daily and where the edge cases are best covered. Windows testing is less thorough.

There's no streaming AI response in the tool-calling loop. The AI responds after all tool calls complete. In practice the wait is usually under two seconds, but it's a noticeable gap when a sequence involves several reads. I'll revisit this in 1.1.

Plugin system and user-defined tools are on the roadmap, not here.

Download

Binary at the GitHub release:

macOS Apple Silicon
Windows x86_64
Windows ARM64

Source on GitHub under Apache 2.0. Issues and discussions are open. If you want to talk about the security model specifically, that's where to do it.

PDLC 1.1: two things v1.0 got wrong about artifact shape

kanfu-panda — Wed, 10 Jun 2026 14:33:44 +0000

📝 Originally on my blog → https://kanfu-panda.github.io/blog/2026/06/08/pdlc-v1-1.html

Three months after starting a project with PDLC, a team member asked a simple question: "What does the system look like right now?"

There was no good answer. /pdlc-arch had been run six times. The docs/02_design/architecture/ directory had six timestamped files: 20260112-arch-analysis.md, 20260204-arch-analysis.md, and so on. Each captured the architecture at a point in time. None answered the question as asked — because the question was about current state, not history.

The same thing happened with coding standards. The team ran /pdlc-standard (in the old form) to update the naming conventions. Instead of editing the existing file, it created coding-style-v2.md. Then coding-style-v3.md. By the time someone new joined the project, there were four files covering the same topic, with no obvious canonical one.

These weren't edge cases. They were a systematic confusion about what kind of artifact you're writing.

The design choice: ledger vs surface

PDLC v1.0 treated all artifacts the same way: write to a new file with a timestamp or version suffix, and never touch the old ones. This is the right behavior for some artifacts — a PRD records a decision at a point in time, and you don't want it overwritten when the feature evolves. A PRD is a ledger artifact: append-only, date-addressed, permanent.

But architecture overviews and team conventions aren't decisions. They're states. The question they answer isn't "what did we decide on Feb 4?" but "what is true right now?" For those, append-only is exactly wrong. Every new version compounds the confusion instead of resolving it.

RFC #5 introduced the ledger/surface split:

Type	Question answered	Mutation rule	Example
Ledger	"What happened at time T?"	Append-only, never edit in place	PRD, per-feature design doc, test plan
Surface	"What is true right now?"	In-place edit, git log is the history	`ARCHITECTURE.md`, coding standards

The rule is enforced at the skill level. /pdlc-arch is now declared artifact_type: surface in its frontmatter. Its iron law: one file, docs/ARCHITECTURE.md, overwritten every time. The skill won't create a dated copy. If it finds legacy YYYYMMDD-arch-analysis.md files, it moves them to docs/.archive/architecture/ automatically and uses the most recent one as the starting point for the new surface file.

The same principle governs the new /pdlc-standard skill, which manages docs/00_standards/. The rule is explicit in the skill definition: coding-style-v2.md is prohibited. There is one file per topic. The audit trail lives in git log, not in filename proliferation.

Before v1.1                          After v1.1
─────────────────────────────────    ──────────────────────────────────
docs/02_design/architecture/         docs/
  20260112-arch-analysis.md            ARCHITECTURE.md   ← one file
  20260204-arch-analysis.md              (git log shows full history)
  20260315-arch-analysis.md           .archive/architecture/
  20260428-arch-analysis.md             20260112-arch-analysis.md
  20260519-arch-analysis.md             (legacy files, out of the way)
  20260601-arch-analysis.md

The anti-pattern has a name now: "ledger detour" — when a surface artifact gets treated as a ledger because the tooling defaulted to append. Naming it makes it easier to catch.

The feature relation problem

v1.0 introduced per-feature state machines at docs/.pdlc-state/<feature-id>.json. Feature IDs are flat: F20260501-01, F20260501-02, and so on. Each feature tracks its own stages independently.

The problem surfaces when features start depending on each other. You're changing F20260501-03 (user authentication). Does anything else break? Under v1.0, the only way to know was to read every PRD and hope the author mentioned the dependency. There was no structural answer.

RFC #6 adds the feature relation chain via /pdlc-relate. Six relation types:

extends — "this feature builds on that one"
depends_on — "this feature requires that one to function"
supersedes — "this feature replaces that one"
resolves — "this feature fixes that defect"
conflicts_with — "these two features can't coexist as-is" (symmetric)
relates_to — "loosely connected, worth knowing" (symmetric)

Relations are stored redundantly across five locations: the feature state machine JSON, the document traceability header, a reverse index at docs/.pdlc-state/_relations.json, and a generated Mermaid graph at docs/.pdlc-state/_graph.md. The redundancy is intentional — any skill can read from whichever location is most convenient, and /pdlc-relate validate checks that all five stay consistent.

The command that makes this useful is impact:

/pdlc-relate impact F20260501-03

Output:

Impact radius for F20260501-03 (user-auth)

🔴 Direct (depth 1) — must coordinate
   F20260601-01  oauth-integration   extends
   F20260601-04  session-management  depends_on

🟡 Transitive (depth ≥2) — should review
   F20260612-02  user-profile-sync   depends_on → F20260601-01

🟢 Historical — audit only
   B20260520-07  login-loop-fix      resolves

This is the answer to "what breaks if I touch this?" — with severity levels, not just a flat list. The direct layer tells you what needs coordination before you make the change. The transitive layer tells you what to review. The historical layer shows what defects this feature has already resolved, which is useful when deciding whether to edit in place or create a supersedes feature instead.

graph TD
    A[F20260501-03<br/>user-auth] -->|extended by| B[F20260601-01<br/>oauth-integration]
    A -->|depended on by| C[F20260601-04<br/>session-management]
    B -->|depended on by| D[F20260612-02<br/>user-profile-sync]
    E[B20260520-07<br/>login-loop-fix] -->|resolved by| A
    style A fill:#ff6b6b,color:#fff
    style B fill:#ffd93d
    style C fill:#ffd93d
    style D fill:#6bcb77
    style E fill:#6bcb77

What to watch out for

The ledger/surface distinction is easy to misapply in one direction: marking something surface when it should be a ledger. A design decision that gets reconsidered six months later should have both the original record and the new one — not an in-place overwrite that destroys the history of the decision. If you're not sure, default to ledger. Surface is the right call only when the artifact is genuinely describing current state, not recording a decision.

For /pdlc-relate, the main friction point is bootstrapping. If you have thirty features and add the relation chain in v1.1, none of them have relation data yet. /pdlc-relate orphans will list all thirty as orphans, which is technically correct but not actionable. The practical approach is to add relations incrementally as you touch features, rather than trying to backfill the entire graph in one session.

The five-location redundancy is also worth noting as a tradeoff. Keeping relations in five places makes reads fast for any skill, but it means /pdlc-relate validate needs to run after any manual edit to catch drift. The alternative — a single source of truth that every skill reads — would require every relation lookup to go through _relations.json, which adds a dependency that wasn't there before. The current design accepts the redundancy to keep skills loosely coupled.

Where things stand

PDLC is at v1.1.0. The plugin has 33 skills (up from 31). The two new skills are additive — existing projects don't need to migrate anything to get the v1.1 behavior for /pdlc-arch. The relation chain is opt-in; if you never run /pdlc-relate, nothing changes.

Both new skills dogfood their own concepts: the docs/ARCHITECTURE.md in the pdlc-skills repo is a surface artifact maintained by /pdlc-arch, and docs/GLOSSARY.md is managed as a surface file.

The piece that's still manual: there's no automatic detection of "you're about to change a feature that has incoming depends_on edges." The impact check has to be run explicitly. Whether to put this in the /pdlc-implement guard is an open question — it would add a lookup on every implementation start, which might be more noise than signal for small projects.

Install

# New install
/claude install kanfu-panda/pdlc-skills

# Upgrade from v1.0
bash <(curl -fsSL https://raw.githubusercontent.com/kanfu-panda/pdlc-skills/main/install.sh) --upgrade

Source and issues: github.com/kanfu-panda/pdlc-skills

First article in the series: Why Hard Contracts Beat Soft Conventions When Working With AI Coding Agents

Episode 03 will cover aitm — the terminal I built to solve the context-switching problem in AI-assisted coding. 5.3 MB binary, Rust + Tauri, a four-layer security model for AI command execution, and some honest notes on what the "participant not driver" framing actually costs in UX.

Why hard contracts beat soft conventions when working with AI coding agents

kanfu-panda — Tue, 19 May 2026 18:10:04 +0000

A retrospective on building PDLC — 31 slash commands that force my Claude Code workflow to actually finish features.

Six months of pair-programming with Claude Code taught me one uncomfortable truth: the AI is great at writing code, and terrible at the rest of the software lifecycle.

It happily says "done" when only the chat transcript proves the PRD existed.

It writes tests, but after the implementation — making "TDD" a polite lie.

It loses track of which feature is at which stage the moment you start a new session.

It enters lint-fix loops that converge on nothing.

None of this is the model's fault. The model does what you ask, in the moment you ask it. The problem is that soft conventions — "please write a PRD first", "please write the failing test first" — are how we talk to humans, who keep their own working memory.

LLMs don't. They need hard contracts: rules that the workflow itself enforces, not rules the model is supposed to remember.

This is the story of how I encoded that conviction into a Claude Code plugin called PDLC — Product Development Life Cycle — and what surprised me along the way.

The shape of the contract

PDLC ships 31 slash commands organized in three layers:

Entry points (3): /pdlc-feature, /pdlc-fix, /pdlc-status — one sentence drives a whole chain
Stages (11): /pdlc-prd, /pdlc-design, /pdlc-tdd, /pdlc-implement, /pdlc-review, /pdlc-e2e, /pdlc-ship, ... — when you want fine-grained control
Tools (17): /pdlc-ui-design, /pdlc-db-migrate, /pdlc-security, /pdlc-perf, ... — specialized concerns

Every Layer 1/2 stage that produces artifacts is bound to five invariants I call the Iron Law:

Persist to disk. Every artifact (PRD, API design, DB schema, test plan, review notes) lands as a real file under docs/. You can git diff what the AI did.
Update the state machine. Each stage writes docs/.pdlc-state/<feature-id>.json. New session? /pdlc-status tells you exactly where every feature stands.
Tests first. /pdlc-implement literally refuses to proceed if /pdlc-tdd hasn't already produced a failing-test artifact on disk for this feature. Real TDD red-light gate.
Self-check. Every stage runs a self-audit before handing off. Catch drift at the stage boundary, not three stages downstream during review.
One-shot repair. Auto-fix loops run at most once. If a stage's output fails its own audit, the model gets one chance to repair it, then flags the issue for a human. No more "fix → check → fix → check → fix" until the heat death of your token budget.

What surprised me

1. The state machine matters more than I expected.

I started with the persistence rule alone — write everything to disk. That helped, but a week into using it I caught myself asking the AI "wait, did we write a PRD for the phone-verification thing?" again. The artifacts existed; they were just hard to find.

Adding the per-feature state file (F20260502-01.json with current_stage, artifacts, next_step) was the moment the plugin started feeling like an actual process tool, not a fancier prompt template. /pdlc-status became my new dashboard.

2. Tests-first was the hardest contract to make stick.

The natural tendency of an LLM is to "show its work" by writing the implementation first, then writing tests that conveniently pass. To make /pdlc-tdd truly red-light, I had to make /pdlc-implement read the test artifact and verify it currently fails — not just exists. Without that verification, the model would happily generate stubs that already pass.

3. One-shot repair was a token-budget revelation.

Before this rule, lint-fix loops would burn 30k tokens converging on a misunderstanding. The model would "fix" something that wasn't the real issue, the checker would complain again with a slightly different message, the model would "fix" the new message, and so on. Capping the repair at one attempt forces it to either understand the real problem or escalate to a human — both vastly better than infinite drift.

Where this falls apart

Three things I should be honest about:

Claude Code only. I rely on slash commands and skills as first-class primitives. Cline and Cursor don't have direct equivalents. Porting is possible (the contracts are in bash + markdown) but not free.
The Iron Law is not law. A determined user can edit the state file by hand, or invoke /pdlc-implement directly without going through /pdlc-tdd. The contracts are guardrails, not jail. That's deliberate — guardrails that you can step over with one extra command are usually right.
Overhead for tiny changes. Running the full chain for a 3-line CSS fix is overkill. That's what /pdlc-fix (lighter chain) is for, but the line between "feature" and "fix" is judgmental.

When to reach for it

If your workflow with an AI agent involves:

Multiple sessions per feature → the state machine pays for itself
Anything you'd want to git diff later → persistence pays for itself
Code where you regret not having tests → TDD red light pays for itself

If you're using AI for one-off scripts, refactor sweeps, or single-session prototypes, PDLC is more ceremony than the work justifies. Use it where the discipline is worth its weight.

Try it

Install (no clone needed):

curl -fsSL https://raw.githubusercontent.com/kanfu-panda/pdlc-skills/main/install.sh | bash -s -- --global

Then in Claude Code:

/pdlc-feature add phone-number verification to user login

It will allocate a feature ID, walk through the chain, and refuse to skip the steps that matter.

Repo (MIT): https://github.com/kanfu-panda/pdlc-skills

If you build something on top, file a Discussion — I'd love to see what shapes hold up that I haven't tested.

— kanfu-panda