DEV Community: Igor Kramar

I wrapped Claude Code in a zsh function. Here's every decision I almost got wrong.

Igor Kramar — Mon, 25 May 2026 14:05:46 +0000

Claude Code's --help lists 50+ flags. After two weeks of using it daily, I built a zsh wrapper called cco that bakes in the flags I actually want. The wrapper itself is 60 lines. The interesting part is the decisions behind those 60 lines — most of them I had to backtrack on at least once.

This is the decision log. If you're using Claude Code seriously, some of these will save you the same backtracks.

Decision 1: Function, not alias, not shell script

The dumb instinct is alias cc="claude --permission-mode acceptEdits --append-system-prompt ...". It works until you want subcommands. cco plan, cco safe, cco review — aliases can't branch on arguments.

Standalone shell script in ~/.local/bin/cc was the next thought. It works for most cases, but spawns a subshell. That's fine for stateless commands. It's not fine when the command wraps an interactive process that wants the parent terminal's tty for tmux attachment and prompt rendering. Worked in testing, behaved weird in edge cases.

A zsh function runs in the current shell. Inherits the tty cleanly. Can dispatch on subcommands. Can be tab-completed via compdef. That's what I went with.

Cost: lives in your .zshrc (or a sourced module file). Not portable to bash users without rewriting. I'm fine with that — I'm not shipping this to other people.

Decision 2: `cc` vs `cco`

I picked cc first. Two letters, mnemonic for "Claude Code". I almost committed it.

Then I checked my aliases file. cl was already taken by cargo clippy --all-targets. Fine, I wasn't using cl anyway. But that made me look at cc more carefully.

cc on macOS is a symlink to the C compiler at /usr/bin/cc. I have /opt/homebrew/opt/llvm/bin ahead in $PATH, so which cc resolves to system clang. A zsh function would shadow it — functions take precedence over $PATH lookups in interactive shells.

The argument for shadowing it anyway: I never type cc directly to invoke a compiler. Cargo, CMake, Make — they all call it programmatically.

The counter-argument: programmatic calls happen via execvp, which doesn't see shell functions. But — Rust's cc crate (used by openssl-sys, ring, zstd-sys, and a thousand other dependencies) sometimes invokes cc through shell wrappers in build scripts. The probability of hitting this is low. The debugging cost when it does happen — staring at a ring build failure that makes no sense — is high.

Renamed to cco. Three keystrokes instead of two. Worth it.

Lesson: before claiming a short command name, grep your aliases file and run type <name>. Two minutes of due diligence beats an hour of "why won't this crate build."

Decision 3: System prompt lives in a separate file

Claude Code accepts --append-system-prompt "string". Tempting to inline it in the function. Don't.

System prompts grow. Mine started as three lines (anti-sycophancy, confidence marking, counterargument-first) and is now closer to thirty. Editing thirty lines inside a shell function is painful — escaping, line continuation, no syntax highlighting for the content.

I put mine in ~/.config/claude/system-prompt.md. The function reads it at invocation:

local sys_prompt="${HOME}/.config/claude/system-prompt.md"
[[ ! -f "$sys_prompt" ]] && { echo "✗ System prompt not found: $sys_prompt"; return 1; }
local prompt_content="$(<"$sys_prompt")"
# ... later ...
claude --append-system-prompt "$prompt_content" ...

Three benefits:

Edit in your real editor. Markdown syntax highlighting. Spell check. Git diff when you tweak it.
Separate from code. Different lifecycle. I commit my zsh modules to a public dotfiles repo. My system prompt I might not — it contains opinions I haven't published yet.
Reload without sourcing. Edit the file, next cco invocation picks up the change. No source ~/.zshrc.

The trade-off: one more file dependency. If the file is missing, the function bails out with an error. Acceptable.

Decision 4: Subcommands instead of flags

The function dispatches on the first argument:

case "$sub" in
  plan)   ...  # read-only analysis
  safe)   ...  # dontAsk + tight whitelist
  review) ...  # ultrareview
  resume) ...  # session picker
  here)   ...  # current branch, no worktree
  run|*)  ...  # default: worktree + tmux + acceptEdits
esac

I considered cco --plan, cco --safe, etc. Two reasons against:

Flag parsing collides with Claude's flags. cco --plan could mean "wrapper's plan mode" or "pass --plan to claude" (which doesn't exist, but the parsing logic gets ambiguous fast).
Subcommands compose better with tab completion. cco <Tab> shows the menu. cco --<Tab> would dump every claude flag.

The default case is run|* — bare cco or cco "some prompt" both work. The run keyword exists mostly so tab completion has something to show in the menu for the default.

There's one edge case I left in: cco "plan my vacation" would match the plan) branch because the first word is plan. If anyone ever hits this — cco run "plan my vacation" is the workaround. I judged the collision rare enough to not care.

Decision 5: `--tmux` in default mode

This one I want to be honest about: I almost left tmux out, because I assumed nobody would want yet another tmux session per Claude invocation.

I asked myself point-blank: do you live in tmux? Yes. Default stays tmux-on.

If you don't live in tmux, the value proposition collapses. --tmux only matters if:

You want to detach the session and reattach later from another shell.
You want multiple Claude tasks running in parallel, switchable from one terminal.
You SSH into your dev machine sometimes.

If none of those apply, --tmux just leaks tmux sessions. After a week of work you'll have 40 zombie sessions in tmux ls. Skip it.

I added a cleanup alias just in case:

alias cco-cleanup='tmux ls 2>/dev/null | grep "^cco-" | cut -d: -f1 | xargs -I{} tmux kill-session -t {}'

Decision 6: Worktree by default, "here" mode as escape hatch

--worktree creates a separate git worktree per invocation. Claude works on a parallel branch in a parallel directory. Your main checkout is untouched.

The upside is real, especially on probation at a new job: Claude can refactor aggressively, and if it goes sideways, I just git worktree remove and nothing happened. No git stash, no "wait what state was I in", no panic.

The downside: sometimes you don't want isolation. You're mid-task, files open in VSCode, mental model loaded. You want Claude to fix one bug here, not in a parallel reality.

So I added a here subcommand:

here)
  shift
  local branch=$(git symbolic-ref --short HEAD 2>/dev/null || echo 'detached')
  echo "📍 here mode — current branch ($branch), no worktree"
  claude --permission-mode acceptEdits \
         --append-system-prompt "$prompt_content" \
         "$@" 2>&1 | tee "$log_file"
  ;;

Same system prompt, same acceptEdits, same logging. But no worktree, no tmux. Drop in, do the thing, drop out.

Use cco for "go change the auth-store architecture." Use cco here for "fix the null check on line 42."

Decision 7: `caffeinate -is` wrapping the default mode

macOS clamshell sleep ruins long-running agent tasks. Close the lid, fetch tea, come back — the task has been paused since you walked away.

caffeinate -is keeps the system awake (-s) and prevents idle sleep (-i) for the duration of the wrapped process. When Claude exits, caffeinate releases its assertion. No leaked state.

caffeinate -is claude --worktree "$wt_name" --tmux ... "$@"

Honest limitation: caffeinate -s only works while on AC power. Apple's SMC enforces clamshell sleep on battery regardless of what userland says. There's no way around it without third-party kexts I would never install.

So: lid closed + AC + (external display OR keyboard) → works via standard clamshell mode. Lid closed + battery → sleeps no matter what. I tell people up front, because the alternative is them thinking the wrapper is broken.

I only added caffeinate to the default mode, not to plan, here, safe, or review. Reasoning: the other modes are short-lived. Default mode (worktree + long refactor) is where caffeination earns its keep.

Decision 8: Tee everything, except interactive pickers

Each invocation logs to ~/claude-logs/<timestamp>_<projectname>.log via tee:

claude ... "$@" 2>&1 | tee "$log_file"

This gives me a searchable history without relying on Claude's internal session storage. When something goes wrong three days later — "wait, what did Claude say about the auth refactor on Tuesday" — I rg the logs.

Exception: cco resume uses Claude's interactive session picker. Piping that through tee breaks the picker's TUI rendering. No log for resume. I considered fixing it with script(1) but that's a yak shave for a feature I'd use rarely.

What I'd do differently

If I were starting over:

Pick the name first, by elimination, not by aspiration. I lost 15 minutes flip-flopping between cc, cl, and cco. Running type <candidate> against four options up front would have settled it immediately.
Write the system prompt before the wrapper. The wrapper is plumbing. The system prompt is the actual leverage — that's where you tell Claude how to think. I built the wrapper first because it was the fun part. Wrong order.
Don't add features speculatively. I almost added a --wide flag for --add-dir to pull in shared types and notes directories. I cut it before writing it. Six months in, I still don't need it. Good cut.

The wrapper

Full code: gist.github.com/IgorKramar/9b4c698909047934ee8e5dd775e94ebc

If you build something similar, you'll make different decisions. Some of mine were context-specific (probation at a new job → worktree isolation matters more), some are tooling-specific (tmux user → --tmux default). The point isn't to copy the code. The point is: when your wrapper hits 60 lines, every line should be a deliberate choice, not a default someone else's tutorial gave you.

Mechanistic Interpretability is a 2026 Breakthrough Technology. Here's What That Means for the "LLMs Are Just Matrix Multiplication" Debate

Igor Kramar — Sun, 10 May 2026 04:37:08 +0000

Today a friend of mine — let's leave him nameless — said the line I've been hearing since 2022: "It's still just matrices multiplying, guessing the most probable next word." For a long while I had no good rebuttal beyond intuition.

I've been using LLMs as a daily tool since the original ChatGPT shipped in November 2022. I've gone through the lot — GPT-3.5, GPT-4, every Claude, Gemini, DeepSeek, Qwen, Mistral, the local Gemma stack on LM Studio. I write Claude plugins, build automation pipelines, run agents in production. I'm a frontend developer by trade, but AI tooling has become roughly half of what I do these days.

So when someone tells me LLMs are just statistics, four years of practitioner intuition tell me something doesn't add up. But intuition is a rather poor argument. I went looking for what the research actually says in 2026. This piece is what I found.

The short version: the "just matrix multiplication" framing was reasonable in 2021. In 2026, it isn't. There's a parallel story about cognitive offloading and what happens to the user's mind when leaning on these tools — and there the sceptics are largely right. Both stories matter, and they're often muddled together.

Where "Stochastic Parrots" Came From

The phrase comes from Bender, Gebru, McMillan-Major and Shmitchell's 2021 paper, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? It's worth reading in full — foundational work, and the critique it raises about bias, environmental cost, and hype hasn't aged a day.

The technical claim was narrower than how it's usually quoted. The authors argued that language models, as understood at the time, were stitching together linguistic forms from training data without any underlying model of meaning or the world the language refers to. Given what was visible from outside in 2021 — GPT-3 had only just shipped, interpretability tooling was primitive — this was a fair description.

The phrase then became a kind of shibboleth. People who wanted to deflate AI hype would invoke "stochastic parrots" as shorthand. People who wanted to push back called the framing reductive. Both camps mostly stopped reading the paper itself and started using the phrase as a flag.

What's changed since 2021 isn't whether the original critique was correct. What's changed is that we now have direct empirical access to what's happening inside these models. And what's inside doesn't square with the "no world model, just surface statistics" picture.

The First Crack: Othello-GPT (2022)

Kenneth Li and colleagues at Harvard published Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task (https://arxiv.org/abs/2210.13382). They trained a small GPT-style transformer on a single task: given a sequence of Othello moves, predict the next legal move. The model was given no rules, no board, no description of the game whatsoever. Just sequences.

After training, the model played legal moves with high accuracy. The researchers then probed its internal activations and found a representation of the current board state — not in the input, not in the output, but constructed inside the model's residual stream. They confirmed this causally: by editing the internal representation of a single square, they could change the model's predicted next move in exactly the way you'd expect if it were "looking at" a modified board.

Neel Nanda extended the work in 2023 (https://arxiv.org/abs/2309.00941), showing the board representation was actually linear — recoverable with simple probes and editable with vector arithmetic. Adam Karvonen reproduced the result on real chess games in 2024 (https://arxiv.org/abs/2403.15498).

This isn't surface statistics. A pure n-gram model trained on Othello move sequences would develop conditional probabilities over move tokens, full stop. That a transformer trained on the same data builds a board internally, and uses it causally, is a hard empirical fact about what next-token prediction can produce as a side effect of optimisation pressure.

A small model on a toy domain, granted. But it's a proof of concept: "predict the next token" is a sufficient training signal for representations of the underlying generative process to emerge as a by-product.

Scaling Up: Anthropic's Interpretability Work (2024–2026)

Anthropic ran the same playbook on a production model. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (https://transformer-circuits.pub/2024/scaling-monosemanticity/) used sparse autoencoders to extract millions of interpretable features from a real production model. The features were abstract, multilingual, multimodal — concepts like "Golden Gate Bridge," "code with security vulnerabilities," "sycophantic praise," each represented as an identifiable direction in the model's activation space.

The follow-up papers in 2025 went further. Circuit Tracing: Revealing Computational Graphs in Language Models and On the Biology of a Large Language Model (https://transformer-circuits.pub/2025/attribution-graphs/biology.html) didn't just identify features — the team traced how features connect into circuits performing computation across layers. The findings:

When the model writes a poem with a rhyme scheme, it picks the rhyming target word before generating the line, then plans backward to that target. That's planning, not left-to-right generation.
Asked "what's the capital of the state Dallas is in," the model first activates a Texas representation as an intermediate step, then retrieves Austin from there. Multi-hop reasoning through internal state, not direct lookup.
In medical scenarios, the model forms an internal candidate diagnosis that influences its follow-up questions, even when the diagnosis is never stated aloud.

In January 2026, MIT Technology Review named mechanistic interpretability one of its 10 Breakthrough Technologies of the year (https://www.technologyreview.com/2026/01/12/1130003/mechanistic-interpretability-ai-research-models-2026-breakthrough-technologies/). That isn't an Anthropic press release — it's the field reaching the point where working tools for looking inside have actually arrived.

And critically, those tools are open. Anthropic released the circuit-tracer library, which works on open-weight models like Gemma-2-2b and Llama-3.2-1b. The Neuronpedia community platform (https://www.neuronpedia.org) hosts features and circuits for open models you can browse by hand. You needn't take anyone's word for it — run it yourself and have a look. Subhadip Mitra's January 2026 write-up (https://subhadipmitra.com/blog/2026/circuit-tracing-production/) frames this as interpretability shifting from "interesting research direction" to "practical engineering discipline."

If you want to see this for yourself, the interactive notebooks at https://transformer-circuits.pub and Neuronpedia are the obvious starting points. Pick a feature, see what activates it, try to break the explanation. The most efficient cure for "it's just matrices" I know.

So Is the "Just Prediction" Argument Dead?

Not quite. Intellectual honesty matters here, and I'd be writing a worse piece if I pretended the sceptic camp had nothing left.

Schaeffer, Miranda, and Koyejo's Are Emergent Abilities of Large Language Models a Mirage? (https://arxiv.org/abs/2304.15004) showed that some claimed "emergent" capabilities are partly artefacts of metric choice — swap a hard accuracy threshold for a smooth metric and the discontinuous jump disappears. The original Wei et al. emergence paper still holds for some capabilities, but the picture is more nuanced than "abilities suddenly appear at scale."

Chain-of-thought rationales aren't always faithful to the actual computation that produced an answer. Models hallucinate. They have no online learning, no persistent memory between sessions outside of explicit memory systems. Much of Bender et al.'s critique about bias and hype landed hard and remains there.

What's gone is the strong version: "LLMs only learn surface statistics, no internal model of the generating process exists." That claim is empirically falsified — by Othello-GPT, by Anthropic's circuits work, by Karvonen's chess results, by the linear spatial representations Tehenan et al. (https://arxiv.org/abs/2506.02996) found in LLM activations.

The interesting questions have moved on. They're now about which aspects of the world models are accurate, when models reason faithfully versus rationalise, how to use interpretability tools to debug failures. "Are they really thinking?" was the 2021 question. It isn't the 2026 question.

The Other Side: Cognitive Offloading is Real

Here the sceptics are mostly right, and we AI enthusiasts — myself included — have to be honest about it.

Lee et al. at Microsoft Research and Carnegie Mellon (CHI 2025, https://www.microsoft.com/en-us/research/publication/the-impact-of-generative-ai-on-critical-thinking-self-reported-reductions-in-cognitive-effort-and-confidence-effects-from-a-survey-of-knowledge-workers/) surveyed 319 knowledge workers using GenAI weekly, with 936 first-hand task examples. Higher confidence in the AI was associated with less critical thinking. Higher confidence in one's own expertise was associated with more. The mechanism is cognitive offloading: when you trust the tool, you stop checking, and over time you stop developing the skill that would let you check in the first place.

Gerlich (2025, Societies, https://www.mdpi.com/2075-4698/15/1/6) found the same pattern in a separate 666-participant study. Lodge and Loble at the University of Technology Sydney published a March 2026 review of cognitive offloading in education (https://www.uts.edu.au/news/2026/03/experts-warn-unstructured-ai-use-in-schools-risks-cognitive-atrophy), warning that unstructured AI use in schools risks cognitive atrophy.

But — and the story gets rather more interesting here — there's a strong counter-observation from a recent BCG study.

Randazzo, Lifshitz, Kellogg, Dell'Acqua, Mollick, Candelon and Lakhani — Cyborgs, Centaurs and Self-Automators: The Three Modes of Human-GenAI Knowledge Work (HBS Working Paper 26-036, December 2025, https://ssrn.com/abstract=4921696). 244 BCG consultants tracked through a seven-stage strategic problem-solving workflow. Three empirically distinct modes of AI use emerged:

Cyborgs (~60%) weave their work deeply with the AI across the entire workflow. Iterative dialogue, AI personas, using it to shape both the problem and the solution. They develop new AI-related capabilities — what the authors call newskilling.
Centaurs (14%) maintain a clear division of labour: humans decide what to do and how, AI is used selectively for specific support tasks. This group produced the highest accuracy in business recommendations, outperforming both other modes.
Self-Automators (the remainder) copy data into AI, tweak slightly, paste it back. No skill gains at all. This is the offloading-into-atrophy pattern in its pure form.

So the data immediately tells you that "using AI" is too coarse a category. Cyborgs and centaurs both grow, just along different trajectories. Self-automators lose out.

There's a classic counter-effect in cognitive load theory that complements this: Sweller's worked example effect (1985, replicated dozens of times). For novices in a domain, studying a worked-out solution is often more effective than struggling through problem-solving from scratch — because reduced cognitive load frees working memory for schema acquisition. When I watch Claude work through an architectural problem in a domain I'm new to, then attempt a similar problem myself, I'm running the worked-example pattern. That's the opposite of offloading.

The difference between the two patterns comes down to one thing: whether you alternate observation with active retrieval. Watching Claude solve and nodding along is offloading. Watching, predicting the next step, attempting a variant yourself, comparing — that's learning.

A Small Concrete Example

Yesterday I was setting up internet at my dacha. Weak LTE, antenna pointing the wrong way, no obvious solution. I sent Claude the coordinates of my plot and the base station, plus a photo of the antenna location with a compass and a timestamp visible in the EXIF.

Claude calculated the bearing from the two coordinates using the haversine formula and got 226°. Then it noticed something: the EXIF timestamp said 17:03, and at that hour in early May in Omsk, the sun should be at roughly 225–230° azimuth. The compass in the photo confirmed the direction matched. All three checks lined up.

There's no magic in this. It's three small inferential steps glued together: a geographic calculation, a known fact about solar position, a visual cross-check. I could have done each step myself, in three different tools, in twenty minutes, with a calculator and a sun-position website. Claude did it in one prompt as a coherent answer.

That's the concrete shape of what these tools are good at right now. Not "thinking" in any mystical sense. Integration of small inferential steps across domains, fast enough that the overhead of doing it yourself stops being worth paying. And it's exactly where the offloading risk lives — because the next time I have a similar problem, I shan't reach for the calculator. I'll reach for Claude.

What This Means for Practitioners

If you write code with LLM assistance every day, the "just matrices" framing isn't useful for thinking about what you're working with. It isn't wrong at the hardware level — it's simply not at the right level of description for the questions that actually matter to you. Your CPU is "just transistors switching," and that fact doesn't help you reason about your application either.

The framings that actually predict behaviour in 2026:

LLMs build internal representations of the structures generating their training data. Those representations can be accurate, partial, or distorted, and you can sometimes inspect them with interpretability tools.
Their reasoning is partial, sometimes parallel, often faithful but not always. Treat chain-of-thought as evidence, not proof.
Cognitive offloading is real and measurable. The tools grow you when you contest their outputs and shrink you when you trust them.
The question "use AI or don't" is the wrong one. The right question is which of the BCG modes you operate in, and whether you're choosing it consciously.

If you want to dig deeper, three concrete starting points: read On the Biology of a Large Language Model end to end (it's HTML with interactive diagrams). Spend an hour on Neuronpedia trying to break the feature explanations. Read the Lee et al. Microsoft paper, then watch yourself for a week and notice when you skip the verification step.

Closing

The 2021 debate about whether LLMs were "really thinking" was good for its time. The 2026 debate is about which patterns of human-AI interaction grow human capability and which atrophy it. The first debate had no decent empirical handles. The second one does, and they point in fairly clear directions.

When someone tells you "it's still just matrices," send them this article — or better, send them to Neuronpedia and ask them to explain what they're looking at. The conversation usually shifts from there.

I built a Claude Code plugin that argues with me about architecture. Then it caught me lying to it.

Igor Kramar — Fri, 08 May 2026 20:29:24 +0000

TL;DR. I built a Claude Code plugin (MIT, github.com/IgorKramar/archforge-marketplace) that turns Claude into a senior architect. After running one deep cycle on a real architectural decision, you get back: a multi-page ADR with explicit architectural rules, two or three honest alternatives with trade-offs, and a five-role adversarial roast. On one of my own ADRs the roast found a confused-deputy attack vector at the LLM-tool boundary that I'd missed — flagged independently by both the security role and the compliance role. The plugin then failed to follow its own language rules in a specific, instructive way. This article is what I learned, including how the failure led to the most useful feature I added.

I've been using Claude as a daily collaborator for about a year. For most things — code, writing, debugging — it's excellent. But I noticed a specific failure mode in architectural conversations. I'd say "I'm thinking of using Postgres for this." Claude would say "Great choice, here's why Postgres works." I'd say "Actually, let me reconsider, maybe SQLite." Claude would say "Excellent, SQLite is perfect for this case, here's why." Both answers couldn't be right. They couldn't even both be useful.

Architecture is one of the few engineering activities where disagreement is the work. You need someone who refuses to let you skip ahead. Someone who insists on alternatives. Someone who points at the downside you don't want to hear. Default LLM tuning is the opposite of that.

So I built a plugin. It's MIT-licensed, the repo is at the end. This article is what I learned building it and putting it through a real architectural cycle on a real project — including the part where the plugin caught itself failing, and how that led to the most interesting feature in the whole thing.

Why a cycle, not a template

Most AI-architecture tooling I'd seen was templates. "Here's an ADR template. Fill it in." But ADRs are the artifact, not the work. A good ADR is the residue of a process that included three things you can't fake by filling in a template:

Discovery — making constraints, forces, and prior art explicit before you allow yourself to think about solutions.
Real alternatives — not "the option I want" plus a strawman, but two or three genuine options each with honest trade-offs.
Push-back — someone willing to argue, especially when your first instinct is to skip ahead.

So I structured the plugin around a six-phase cycle:

1. DISCOVER  → constraints, forces, prior art, requirements
2. RESEARCH  → current information from the web (versions, prices, regulation)
3. DESIGN    → 2–3 alternatives, each with trade-offs
4. DECIDE    → pick one, state why, state when it breaks
5. DOCUMENT  → ADR + update ARCHITECTURE.md + diagrams
6. REVIEW    → architectural review when code lands

Each phase is a slash command (/archforge:discover, :research, etc.). There's /archforge:cycle "<problem>" that walks the whole thing with user gates between phases. The phases enforce discipline. You can't propose solutions in discover. You can't introduce new alternatives in decide. You can't write an ADR until you've actually decided. The structure forces the conversation to slow down where it matters.

The router skill — the one that activates whenever the user mentions architecture or design — has a section that turned out to matter more than anything else:

### Hold position. Argue. Don't soft-cave.

If the user proposes something you consider weak, say so directly and argue.
Do not collapse at the first pushback. Soft pushback that folds is a form of
disrespect — the user came for honest critique, not validation.
Maintain the position until presented with a real counter-argument.

That paragraph carried more weight in practice than I expected.

Want to try it before reading the story?

Skip ahead if you want to know what it does first. But if you'd rather see for yourself:

# In Claude Code:
/plugin marketplace add https://github.com/IgorKramar/archforge-marketplace
/plugin install archforge@archforge-marketplace

# In any project (preferably one with an open architectural question):
/archforge:init
/archforge:cycle "should I extract this module into its own service?" --scale=light

A light-scale cycle takes about 10 minutes, walks you through Discover → Design → Decide → Document with pauses for your input, and produces a short ADR with two alternatives and explicit reasoning. If you find the result useful, run a --scale=deep cycle on a real decision next.

The rest of this article is the story of what the cycle produced when I ran it on a hard problem in my own project, and what failed when I did.

What the plugin actually contains

Three layers of components, but the structure isn't the interesting part — what each layer does is.

Skills — knowledge and disposition. Ten of them. A router skill that sets the architect persona. Specialists for ADR writing, system design, frontend architecture, backend architecture, AI agents architecture, code review, web research, an integration skill for Compound Engineering, an architectural-diagrams skill (C4, sequence, state, ER, deployment — all Mermaid). Skills are markdown files Claude pulls in when their description matches the current task.

Slash commands — 17 of them in v0.4. The cycle phases. Plus shortcuts (/archforge:adr), plus tooling (/archforge:map for a decision dependency graph, /archforge:observe for finding architectural decisions made implicitly in code, /archforge:upgrade for migrating projects between plugin versions, /archforge:diagram for any of the five diagram types).

Sub-agents — nine. Three for long-running tasks (architect, reviewer, researcher). Five for adversarial review — but I'll get to those, because they're where the story turns. And one for catching the plugin failing itself. That last one is the entire point of this article.

Plus soft-warning hooks that nudge you when you've changed many files without an ADR, when a new top-level directory appears, when many modules have been touched without architectural observation. Reminders, not gates.

The case study: an AI agent for a regulated market

I'm building a SaaS product — call it "the project" for this article — an AI-driven landing page builder targeting small businesses in a regulated jurisdiction where strict data residency requirements rule out US-hosted LLM providers. Florists, barbers, coffee shops. The differentiation is that an AI agent walks the user from "I want a website" to "the website is published and I can see analytics" without the user having to know design, code, or AI prompting.

The first architectural decision was the AI agent layer. This is hard because everything else depends on it: the backend's role, the data model, the load profile, the legal posture. And there are hard constraints — local data-protection law regulates personal data processing, and Anthropic's API isn't reachable from the target jurisdiction's infrastructure, which makes Claude itself problematic for the project technically and legally.

I ran /archforge:cycle "AI agent architecture for the editor". Below is what happened, condensed.

Discover, round 1 — and the constraint I'd missed

The discovery phase produced a structured document: restated problem, functional requirements, quality attributes, constraints (team: solo developer, budget: startup, compliance: local data-protection law), prior art (Lovable, v0, Cursor, regional AI website builders), forces, and six open questions with polar options.

I answered them. Then I noticed I'd written "Anthropic API only, server-side" as a constraint, and the plugin had built the rest on top of that. But Anthropic isn't reachable from infrastructure in the target jurisdiction, and sending end-user personal data to US servers violates the local data-protection law.

I pointed this out. The plugin produced a second round of discover — clearly labeled "Section 7: Second round of discover (after user input)". It:

Acknowledged the gap honestly: "this should have come up in round one; I accepted the user's premise without verifying it."
Expanded alternatives to four LLM-provider categories: regional commercial APIs (regulator-certified), self-hosted open-source, Anthropic via tunnel (and explicitly flagged it as not solving the data-residency requirement — data still physically leaves the jurisdiction), and a hybrid.
Pushed back on two of my answers to the original questions.

On Q3 (depth of business context) I'd answered "profile + dynamic memory (RAG)." The plugin pushed back: "Profile is cheap — 5–10 fields in the DB. Dynamic memory is a separate module with embedding provider, vector store, retrieval, ranking, periodic re-indexing. Is dynamic memory needed from day one of MVP, or is this clearly a V2 milestone? If V2 — fix it: profile now, dynamic memory in a later ADR."

I said V2.

On Q5 (evaluation), I'd asked for a full observability stack with weekly session labeling. The plugin pushed back: "Full stack with weekly labeling of 30-50 sessions is 3-5 hours per week of your time plus extra infrastructure. For a single developer, that either eats 5-10% of your time or becomes ceremony. Realistic path: log all LLM calls and tool calls structurally from day one (this is non-negotiable), defer the labeling and observability layer until you have ≥100 real sessions to label."

I agreed. Both push-backs ended up in the final ADR as explicit V2 commitments.

This was a senior engineer telling me that what I was asking for was disproportionate to my situation, and being right about it. The "Hold position. Argue. Don't soft-cave" instruction was earning its place.

Research, design, decide

Research surfaced 50+ dated sources on regional LLM providers, their certification posture, pricing, and capability ceilings. Design produced three alternatives with honest trade-offs — Alt 1 minimalist, Alt 2 gateway with redundancy, Alt 3 full agent platform — and a 15-row comparison matrix. The plugin recommended Alt 2, not Alt 3 (which scored higher on coverage). The reasoning: 5-6 weeks to MVP for a single developer is too long; many of Alt 3's features are premature.

I picked Alt 2. The plugin produced ADR-0001 with seven explicit architectural rules — PromptProvider trait, tool registry as enum + struct, dedicated Orchestrator module, current_state column from day 1, structured logging with zero-cost null fields for evaluation, vector storage installed but embedding() returning unimplemented!() until a future ADR.

Real ADR. The kind I'd write if I'd spent an afternoon on it.

Here's the C4 Component-level (L3) diagram the plugin produced for the AI module inside the backend, showing the path from client through gateway to LLM providers — the seven architectural rules made structural:

graph TB
    Client["SPA (editor)<br/>via REST/SSE"]

    subgraph Backend["Backend — ai/ module"]
        direction TB
        Orchestrator["<b>Orchestrator</b><br/>run_turn(session, message)<br/>dispatch_to_role() — V2"]
        ToolRegistry["<b>Tool Registry</b><br/>enum + struct<br/>required_mode (V2)<br/>execute_tool()"]
        PromptProvider["<b>PromptProvider</b> trait<br/>YamlPromptProvider — V0<br/>DbPromptProvider — V2"]
        Gateway["<b>LLM Gateway</b><br/>chat_completion()<br/>embedding()<br/>routing-policy"]
        Sanitizer["<b>Sanitizer</b><br/>strips PII<br/>before sending to<br/>secondary provider"]
        StateMachine["<b>SessionState</b><br/>current_state<br/>('active' in V0)"]
        Logger["<b>StructuredLogger</b><br/>provider, latency,<br/>tokens, cost,<br/>eval_label (V2)"]
    end

    subgraph DB["Database"]
        AISession[("ai_session<br/>+ current_state")]
        AIMessage[("ai_message<br/>+ provider, latency,<br/>tokens, cost, eval_*")]
        AIToolCall[("ai_tool_call")]
        BusinessProfile[("business_profile<br/>+ embedding vector NULL<br/>(V2)")]
        Prompts[("prompt_template — V2")]
    end

    subgraph Providers["LLM providers"]
        ProviderA["<b>Provider A</b><br/>regulator-certified<br/>primary, PII-safe"]
        ProviderAClassif["<b>Provider A (Lite)</b><br/>classification"]
        ProviderB["<b>Provider B</b><br/>secondary,<br/>sanitized payloads only"]
        ProviderEmbed["<b>Provider C</b><br/>embeddings"]
    end

    subgraph Future["V2 (same gateway)"]
        SelfHosted["<b>Self-hosted LLM</b><br/>large open model<br/>via local inference server"]
        SelfHostedEmbed["<b>Self-hosted embeddings</b>"]
    end

    Client -->|API call| Orchestrator
    Orchestrator -->|reads<br/>system prompt| PromptProvider
    Orchestrator -->|selects tools| ToolRegistry
    Orchestrator -->|chat_completion| Gateway
    Orchestrator -->|state<br/>transitions| StateMachine
    Orchestrator -->|every turn| Logger

    Gateway -->|PII-sensitive| ProviderA
    Gateway -->|classification / intent| ProviderAClassif
    Gateway -->|sanitized creative<br/>+ failover| Sanitizer
    Sanitizer -->|cleaned text| ProviderB
    Gateway -->|embedding| ProviderEmbed

    Gateway -.->|V2: added<br/>via config| SelfHosted
    Gateway -.->|V2: via config| SelfHostedEmbed

    StateMachine --> AISession
    Logger --> AIMessage
    Logger --> AIToolCall
    Orchestrator -->|reads context| BusinessProfile
    PromptProvider -.->|V2| Prompts

    ToolRegistry -->|read_state<br/>update_*<br/>query_*| Orchestrator

Solid arrows are V0 connections (the initial implementation per ADR-0001). Dashed arrows are V2 — components the gateway and schema are ready to accept without rewriting calling code. The whole "V2" subgraph isn't built yet; the architecture is shaped to absorb it incrementally through later ADRs. The seven rules from the ADR are visible structurally: the PromptProvider trait abstraction, the typed tool registry with reserved fields, the dedicated Orchestrator module, the current_state column from day 1, the gateway's split between chat_completion() and embedding(), the sanitizer pipeline as a guarded path before the secondary provider. None of this was generated as boilerplate — it followed from the discovery and design phases that came before.

Review found three blockers in its own ADR

This was the first thing that genuinely surprised me. The review phase ran the code-review-architectural skill on the same ADR the plugin had just produced. It found three blocking issues:

B-1. The routing policy "PII data → primary provider, non-PII → secondary" is in the ADR. But who decides at runtime which is which? Three possible interpretations: static tool-to-provider mapping, attribute on the prompt, runtime classifier on message content. This is an architectural seam that will diverge in implementation.

B-2. The sanitizer for the secondary channel is described as nontrivial. But the ADR doesn't say what happens when the sanitizer fails. Falls through? Blocks? Falls back to the primary provider? Gap between "the architecture protects the user" and "the developer protects the user."

B-3. Failover between providers in the middle of an unfinished tool-use loop is not addressed. Tool-use loops have multiple round-trips. If the primary provider fails on the third of five round-trips, what happens? The secondary provider's history format is incompatible. The most common failure point in hybrid gateways, not addressed.

I applied all three. The review document was updated with Status: Applied 2026-05-07 and a closeout block listing each fix. The cycle compounded — the next ADR builds on a corrected base.

I wrote a draft of this article at this point. It was decent. But the story wasn't done.

v0.4: adversarial roast with five roles

In v0.4 I added something I'd been thinking about for a while: multi-perspective adversarial review. Not one reviewer — five, each with strict non-overlapping scope.

devil-advocate — pressure-test for failure modes, hidden assumptions, edge cases, concurrency bugs.
pragmatist — operational reality, on-call burden, real cost over time, skills/bus factor, deployment risk.
junior-engineer — clarity for a fresh reader six months from now. Undefined terms, unfollowable steps, broken cross-references.
compliance-officer — regulatory exposure, PII flows, jurisdiction, audit, third-party risk.
futurist — 1-3 year horizon, structural drift, technology lifecycle, hiring, regulatory drift.

Each role has an explicit "what I cover / what I do NOT cover" table referencing the other roles. The point: when independent perspectives converge on the same finding, the finding is real. When one role is silent on something, it's because another role owns it.

Command: /archforge:roast <ADR-NNNN>. Output: a directory of six documents — one summary plus one per role.

Auto-roast at --scale=deep between Decide and Document, so important decisions never become accepted ADRs without passing the multi-role gauntlet.

The roast on ADR-0002

ADR-0002 was a modular monolith on a Cargo workspace — the second decision in the project, after the AI gateway was settled. I ran roast on it.

The output was strong. 14 high-severity findings, 16 medium, 16 low, plus 8 structural risks from the futurist. Cross-cutting concerns — issues that multiple roles independently surfaced — were the most valuable part. Six of them. The most consequential:

CC-1: a class of confused-deputy attack at the boundary between LLM tools and the database access layer. When AI tools execute under elevated database privileges (a common pattern when tool calls need to read across tenant boundaries efficiently) without authorization checks in the tool itself, prompt injection can cause horizontal data leakage between tenants. The devil-advocate flagged it from a security lens (an attacker with prompt-injection access can use the trusted tool to do things the attacker shouldn't be able to). The compliance-officer independently flagged the same code path from a regulatory lens (horizontal leakage of personal information across tenant boundaries violates the local data-protection law). Two roles, two angles, same architectural attack vector.

This is exactly the "compound" mode I built the plugin for. Two independent reviewers from different lenses converged on a single architectural attack vector. That's a real finding, not noise.

The recommended fix was straightforward — add a rule to the ADR requiring tools to accept an authorization context and perform ownership checks, or alternatively use a constrained database role that enforces row-level security at the connection level. Plus open two new ADRs (operational baseline and compliance contour) before any paying users.

This was the single best architectural moment I'd had with an AI tool. The plugin found a class of security bug in my own ADR through structured adversarial review, and proposed a concrete fix. I was ready to publish the article.

And then the report was garbled

The roast output came back in mixed Russian and English. Section headers were in English (## Headline findings, ## Cross-cutting concerns) but the prose was in Russian, and the prose was full of transliterated English — "обзервабилити" instead of "наблюдаемость", "operational baseline" instead of "операционный минимум", "compile-time гарантии" instead of "гарантии на этапе компиляции".

I'd written an explicit terminology rule into the architect skill in v0.3 specifically to prevent this. Russian artifacts must use proper Russian engineering terminology, not transliterated English calques. The rule was there. It clearly hadn't been applied.

I told Claude: "your output is full of anglicisms, please apply the terminology rule."

What happened next is, in some ways, more interesting than the original bug.

Overcorrection: the failure mode nobody talks about

Claude's response started: "Виноват — это уже второй раз с тем же правилом, явно нарушил." (Guilty — second time with the same rule, clearly violated.)

And then it rewrote the summary. With problems.

It translated Devil-advocate to "Обвинитель" (Prosecutor). And Pragmatist to "Прагматик". And Junior-engineer to "Новый разработчик". And Compliance-officer to "Специалист по соответствию". And — and this is the worst — Futurist to "Стратег" (Strategist).

It translated the section headers. ## Headline findings became ## Главное. ## Cross-cutting concerns became ## Сквозные проблемы. ## Severity counts became ## Распределение по тяжести.

It translated the finding IDs. CC-1 became СП-1. CC-2 became СП-2.

I read this and immediately knew there was something wrong, beyond just being a different translation choice. These weren't translations. They were structural breaks.

Devil-advocate is the name of an agent file. agents/devil-advocate.md. There's a name: field in the frontmatter that says devil-advocate. That string is invoked by the orchestrating command. Translating it to "Обвинитель" means:

Future references to that agent in other artifacts won't resolve.
The roast directory's per-role files (01-devil-advocate.md, 02-pragmatist.md...) suddenly don't match the roles named in the summary.
Comparing this roast to a future one is impossible — two roasts on the same ADR will name the roles differently.
The role concept itself was changed: "Стратег" (strategist) is not a translation of Futurist (long-horizon role); it's a different role. The role was substituted.

## Headline findings is a section header prescribed by the roast command file. The command writes that header literally. Other tooling (and a future user reading the directory) expects that header. Translating it to ## Главное means the document diverged from what the plugin's own templates promised.

CC-1 is a finding ID that gets cross-referenced. The summary says "CC-1 — see devil-advocate F1.1 + compliance C1.1". If you rename CC-1 to СП-1 in the summary but the agent docs still call it CC-1 (because they were written first, in English), the cross-references break. References pointing at IDs that don't exist anymore.

This is the plugin's own templates, names, and identifiers being silently translated under translation pressure. After being told "apply the terminology rule", the assistant pattern-matched too aggressively and translated things that aren't terminology. They're identifiers.

And here's the failure mode in a clean form: AI tools fail in two directions, and most reviews of AI tools only test one direction.

Undercorrection — the LLM ignores a rule. ("обзервабилити" instead of "наблюдаемость".)
Overcorrection — the LLM applies a rule too widely after being corrected. (Translating identifiers along with prose.)

The blog posts about LLM problems usually focus on the first. Hallucination, ignored constraints, dropped instructions. Overcorrection — over-eager application of corrections to inappropriate scope — is also a serious failure mode, and it's exactly the failure mode that trying harder produces.

Root cause: rules in one place, applied in many

Why did this happen? The terminology rule lived in skills/architect/SKILL.md — the router skill. The router skill is loaded into Claude's context when the architectural intent is detected. But the roast command spawns five sub-agents (devil-advocate, pragmatist, etc.). Each sub-agent is a separate context. They read their own SKILL.md files. They don't automatically inherit rules from the router.

So the structural bug was: the rule was authored once, but its enforcement depended on it being read in each of the contexts where output gets generated. In the roast, that was six contexts (five role agents plus the summarizing main thread). Five of them never saw the rule.

After my correction, the main thread did know the rule. But it applied it to everything that vaguely resembled a Russian-with-anglicisms problem — including identifiers that should never have been translated.

The fix could not be "remember to apply the rule everywhere". That's the same fix that already existed and didn't work. The fix had to be structural: the rule needs to be embedded in every place where output is generated, with explicit guards against both directions of failure.

The fix, and what we built around it

In v0.4-rc2, I did three things:

1. Embedded the language rule in every sub-agent. All five roast agents and all three core agents (architect, reviewer, researcher) now have an explicit ## Language and terminology section in their agents/*.md files referencing the architect skill's taxonomy. Each agent's section names the specific identifiers that role uses (its own name, its own finding-ID scheme, the section headers it produces) that must never be translated.

2. Rewrote the language rule in the router skill with explicit categories. A 10-category taxonomy distinguishing what gets translated from what doesn't:

A. Plugin component identifiers — never translate.
B. Software, library, protocol names — never.
C. Standard abbreviations — never.
D. Laws, regulations, standards — never.
E. Artifact identifiers (finding IDs, rule numbers, ADR IDs) — never.
F. Plugin template section names — never (verbatim English even when body is in another language).
G. Project-internal proper nouns — never when capitalized.
H. Term-of-art with no clean Russian equivalent — keep with first-occurrence gloss.
I. Calques — translate per the calque table.
J. Prose verbs and connectors — translate.

Plus a section literally titled "Overcorrection is also a failure" with the exact examples of what had just gone wrong: Devil-advocate should not become "Обвинитель", ## Headline findings should not become ## Главное, CC-3 should not become СП-3. The negative examples are now part of the spec.

3. Added a new agent: meta-reviewer. This is the most important part.

What the meta-reviewer does

The five roast roles attack the substance of an ADR. The meta-reviewer doesn't. It checks the form.

It reads artifacts produced by the plugin and checks them against the plugin's own templates and rules. Five categories:

Template conformance — does this ADR have all the required sections? Does this roast directory have a summary plus one file per role? Does this review have a Status section?
Identifier preservation — are agent names, command names, template section headers, finding IDs, software names, regulation names, all in their original form?
Language-pass evidence — for Russian artifacts, did the calque pass actually run? Is there a one-line note at the end stating what was changed?
Cross-reference integrity — does "ADR-NNNN" point at an ADR that exists? Does "see B-1" resolve to a finding that's in the document?
Lifecycle integrity — has anyone substantively edited an Accepted ADR (which should be superseded, not edited)?

Critically, the meta-reviewer does not evaluate architectural quality. That's roast's job. The meta-reviewer is the role that asks: "does this artifact match what the plugin's own files said it should look like?"

It uses the plugin's own source files as the specification. It reads commands/roast.md to know what sections a roast summary must have. It reads skills/architect/SKILL.md to know which strings count as identifiers. The plugin is grading itself against its own promises.

In --scale=deep cycles, the meta-reviewer runs automatically: after the auto-roast (on the roast directory) and after Document (on the freshly-written ADR). High-severity divergences pause the cycle.

So: the plugin found an architectural bug in its own ADR (review of ADR-0001). Then it found a security bug in another of its ADRs (roast of ADR-0002). Then it failed to follow its own language rule. Then it overcorrected and translated identifiers it shouldn't have. Now it has a role specifically designed to catch the kind of bug the previous version had.

This is what compounding looks like, in the actual Compound Engineering sense: each cycle leaves the next cycle better-equipped, because the system itself is the artifact being improved.

What I'd build differently if I started over

One thing, mainly. I'd put the language rule in every relevant context from the start, not in the router skill alone. The "DRY" instinct says rules should live in one place. For LLM tooling that turns out to be wrong: rules need to be in every context where they're enforced, even at the cost of duplication. Sub-agent contexts don't automatically share state with the parent. Architecting LLM-tool systems is partly the discipline of accepting that LLMs do not naturally inherit context the way functions inherit lexical scope. The meta-reviewer pattern — a role specifically dedicated to checking that artifacts conform to the system's own rules — should probably exist in any AI plugin that produces structured artifacts, not just this one.

What's coming

The plugin's roadmap covers three minor versions. v0.5 "Sharper feedback" sharpens existing feedback — a /archforge:diff <ADR> command that checks whether an accepted ADR still lives in the actual code, an anti-patterns skill (concrete named ones — distributed monolith, database-as-integration-layer, sync chains, cache as source of truth), an architectural-metrics skill, an /archforge:export command for shipping artifacts to articles or portfolios. v0.6 "Memory and history" adds a historian agent reading the project archive, retrospectives, decision-map evolution, optional pre-commit hooks. v0.7 "Teams and budgets" covers cost as a first-class architectural variable and multi-architect coordination.

Full plan, including the explicit anti-roadmap (no code generation from ADRs, no project-management integrations, no doc-site generators, no voting workflows on ADRs, no per-language plugin variants) is in ROADMAP.md.

The compound pattern, more directly

Most AI-tooling discourse focuses on automation — getting AI to do more work for you. That's real and matters. But for some kinds of work, automation is the wrong frame.

Architecture is one of those. The work isn't to produce an artifact faster. The work is to make a defensible decision under uncertainty. The artifact is downstream of the decision. Defensible decisions require structured disagreement, real alternatives, honest trade-off analysis, and someone willing to push back when you skip ahead.

If you cast AI as "thing that produces artifacts", you'll get bad architecture faster. If you cast AI as "thing that argues with me about my reasoning until I either change my mind or strengthen my case, while also checking that the artifacts of our conversation match the rules we agreed on", you'll get better architecture, slower, on purpose.

archforge is one attempt at the second framing. Pairs naturally with Compound Engineering — CE handles feature-level workflow (Brainstorm → Plan → Work → Review → Compound), archforge handles architectural decisions (Discover → Research → Design → Decide → Document → Review). Architectural decisions feed into CE plans by ADR number; CE compound learnings can produce new ADRs. The integration is materialized by running /archforge:remember-compound-integration once per project.

If you try the plugin, the most useful thing you can give back is a specific failure. Not "it's great" — that's nice but not actionable. "I ran roast on this ADR and the futurist role completely missed X" — that's actionable, that's how v0.5 gets shaped.

Repository: github.com/IgorKramar/archforge-marketplace. Issues, PRs, and especially specific bug reports of the plugin failing on your project are welcome. The plugin gets better when it fails in instructive ways. That's the whole point.

Superpowers vs Compound Engineering: is the 'vs' even real?

Igor Kramar — Mon, 04 May 2026 13:02:08 +0000

TL;DR — Superpowers and Compound Engineering aren't competitors. They're optimised for different worlds. Superpowers is gold for mature codebases with established methodology (TDD shops, large legacy systems, teams enforcing standards). Compound Engineering is gold for early-stage products where one person owns a feature end-to-end. Pick by what your codebase looks like, not by which README sounds shinier.

If you've spent any time in the Claude Code plugin ecosystem in the last few months, you've almost certainly heard about both:

Superpowers by Jesse Vincent — a "complete software development methodology" plugin, ~42k stars, in Anthropic's official marketplace.
Compound Engineering by Every — a 36-skill, 50-agent framework around the idea that "each unit of engineering work should make the next one easier", ~16k stars. Both ship as Claude Code plugins. Both wrap roughly the same surface — brainstorm, plan, work, review. Both have evangelists writing "I 100x'd my output" posts. So the natural question gets asked a lot: which one wins?

I installed both, ran them on the same projects for a few weeks (work — a Nuxt/Vue 3 platform product; personal — a small landing-page tool and my wife's florist site), and the answer turned out to be more interesting than I expected.

The "vs" is the wrong question. Let me show you why.

What both plugins are actually solving

Bare Claude Code is excellent at small tasks and dangerous at large ones. Without scaffolding it tends to:

Skip planning and start typing code on turn two.
Forget the lessons from yesterday's bug-hunt by tomorrow morning.
Drift further from your project's conventions the longer the session runs.
Treat each new request as if the codebase were a stranger. Both plugins exist to fix this — but they fix it from different ends. To see the difference, you have to look past the README marketing and ask what each one forces you to do.

The 90% overlap

Honestly: most of the surface is the same. Here's the side-by-side:

Phase	Superpowers	Compound Engineering
Refine a vague idea	`brainstorming`	`/ce-brainstorm`
Turn it into a plan	`writing-plans`	`/ce-plan`
Isolated execution	`using-git-worktrees`	`ce-worktree`
Code review before merge	`requesting-code-review`	`/ce-code-review`
Systematic debugging	`systematic-debugging`	`/ce-debug`

If you only look at this table, the conclusion is "they're the same plugin with different prefixes". That conclusion is wrong, and the difference is hidden in two places: what each plugin enforces, and what each one adds beyond the shared surface.

The real difference

Superpowers is a discipline-enforcement engine

Read the Superpowers source and you find it again and again: this plugin is opinionated, and the opinions have teeth. The clearest example is TDD:

If Claude tries to write code before tests, this skill literally makes it delete the code and start over. No exceptions.

That's not a tip. That's a guard rail. Other examples:

The brainstorming skill activates automatically when you describe a feature — you can't accidentally skip it.
systematic-debugging runs a 4-phase root-cause process and triggers a mandatory architectural review after three failed fix attempts.
YAGNI and "evidence over claims" are baked in as non-negotiables. Superpowers is, in Jesse Vincent's own framing, a methodology. It's there to make Claude behave the way a senior engineer at a disciplined shop would behave, whether you remember to ask or not.

The cost of this is real: Superpowers will fight you when you don't want to do TDD, when you want to vibe-code a quick spike, when you'd rather see something running before you write the test. That's not a bug. That's the entire point.

Compound Engineering is a knowledge-accumulation framework

Compound Engineering's central claim is in its name: each cycle should make the next cycle cheaper. The unique skills, the ones Superpowers has no real equivalent for, all serve that claim:

/ce-compound — after a feature ships, you write down what was learned. Bug patterns, gotchas, surprising decisions. These get indexed and pulled into the context of future plans.
/ce-compound-refresh — periodically reviews stored learnings and decides whether to keep, update, replace or archive them. Without this, your knowledge base drifts.
/ce-strategy — maintains a STRATEGY.md at the repo root: target problem, persona, key metrics, tracks. /ce-ideate, /ce-brainstorm and /ce-plan all read it as grounding.
/ce-product-pulse — time-windowed reports of what users actually experienced, saved to docs/pulse-reports/ so a timeline of outcomes builds up over time. Notice what's happening here: CE is reaching above the engineering loop (strategy) and below it (user outcomes), and tying both back into the planning context. It's trying to be a product loop, not just an engineering one.

The cost is also real: 36 skills and 50+ agents is a lot of surface. Without discipline you end up running ceremonial workflows on tasks that didn't need them. And /ce-compound only works if you actually use it after every cycle — skip it for two sprints and CE collapses into "Claude Code with extra slash commands".

Where each one breaks

This is the part most comparison posts skip, so let me be specific.

Superpowers breaks when your domain isn't TDD-shaped.

I tried Superpowers on a Nuxt/Vue 3 SSR feature where most of the work was prop-drilling, layout tweaks and Pinia state plumbing. The TDD-first enforcement turned a 90-minute change into a 3-hour session of writing tests for code that's mostly visual. For SSR-specific bugs (hydration mismatches, server-only state) the discipline pays for itself. For "make this card layout responsive", it's pure friction.

Superpowers also struggles with rapid iteration. Brainstorm-plan-test-implement is brilliant for a feature you'll ship to a million users. It's overkill for a 30-minute spike where the goal is to know whether an approach is even viable.

Compound Engineering breaks when you skip the compound step.

This is the trap. CE has so many slash commands that it feels productive even when you're just running /ce-plan and /ce-work on autopilot. But CE without /ce-compound is not compound engineering — it's just a more verbose Claude Code session. The plugin's value compounds only if you compound. I've watched myself skip it under deadline pressure on three consecutive cycles before I noticed the framework had quietly become decorative.

CE also breaks at team scale. The named-persona reviewers (ce-dhh-rails-reviewer, ce-kieran-typescript-reviewer) encode somebody else's taste. On a personal project that's fine — useful, even. On a team project, "the reviewer Claude is roleplaying as DHH" is not a conversation I want to have with a senior colleague at standup.

The shape that actually predicts which one fits

Here's the rule of thumb I landed on after two months of switching back and forth:

Your situation	Better fit
Mature codebase, established conventions, real test suite	Superpowers
Greenfield product, you own it end-to-end, conventions still forming	Compound Engineering
Team enforcing TDD or similar discipline	Superpowers
Solo dev or small team, knowledge dies if not written down	Compound Engineering
Legacy system where consistency matters more than novelty	Superpowers
Early-stage product where strategy shifts week to week	Compound Engineering
You want the plugin to constrain you	Superpowers
You want the plugin to remember for you	Compound Engineering

Or stated more bluntly: Superpowers is for codebases with a methodology to enforce. Compound Engineering is for products where the knowledge to compound is still being created.

A startup with a 6-month-old codebase and a single engineer per feature has very little to enforce — there are no conventions yet, the test suite is patchy, the architecture is in flux. What it desperately needs is a memory: why did we pick Pinia over Vuex, what broke last sprint, who is the persona we're optimising for. CE's /ce-strategy, /ce-product-pulse and /ce-compound are exactly that memory.

A 10-year-old enterprise codebase with 30 engineers has the opposite shape. The knowledge already exists — in the test suite, in code review norms, in the architectural decision records. What it needs is enforcement, because the failure mode is drift away from established standards under deadline pressure. Superpowers' "delete the code, write the test first, no exceptions" is precisely calibrated to that failure mode.

What I actually run, and where

For my work codebase (Nuxt/Vue 3, established team, real conventions), I lean on Superpowers — but selectively. brainstorming and writing-plans for any non-trivial feature; systematic-debugging for tricky SSR bugs; I let TDD enforcement run on backend Pinia store logic and turn it down on pure UI work.

For personal projects (the landing-page tool, the florist site), I run Compound Engineering. /ce-strategy once at the start gives every subsequent /ce-plan real context about what the product is for. /ce-compound after each meaningful feature actually compounds — I've already had /ce-plan surface a learning from a previous cycle that saved me an evening.

I don't run both in the same repo. They both touch CLAUDE.md, they both want to be the source of truth for how the agent behaves, and the conflict isn't worth it.

The honest version of "vs"

If someone asks "which plugin should I install?" — that's the wrong question. The right one is: what shape is the work?

Mature codebase, established discipline, the failure mode is drift → Superpowers.

Early product, single owner per feature, the failure mode is forgotten learnings → Compound Engineering.

The "vs" framing is what marketing produces when two tools are competing for the same star count. Engineering produces a different question: what fits this codebase, this team, this stage?

Try both. Run each on the project where its philosophy actually matches. You'll know within a week which one earned its place.

If you've run them both and your shape is different from mine, I'd genuinely like to hear it in the comments — especially the cases where I'm probably wrong.