Jun0

Posted on May 5 • Edited on May 7 • Originally published at jun0-ds.dev

My devil's advocate worked. That was the bug.

#claudecode #ai #agents #productivity

Why a single tool got rewritten five times

Inside sonmat there's a slash command called /devil. When you arrive at a confident conclusion, it sits down and beats on it for a minute. Devil's advocate. A well-known reasoning move. I figured wrapping it in a slash command would be a one-shot job.

Five rewrites.

The first cut shipped on April 2nd as v0.6.0. The last meaningful overhaul shipped on April 26th as v0.11.0. Twenty-four days, five steps, and at every step the tool taught me one thing it didn't know yet. This post is that arc.

v0.6 — meet imp

In the first post in this series I went after the gap that swallows AI work — the model produces confident nonsense, the human gets persuaded by the confidence, nobody verifies. To close that gap I needed a tool that would force me into self-rebuttal.

I called it imp. Little gremlin. I liked the playful name. The description was just "Devil's advocate for reasoning."

The first design was simple. Restate the user's claim in one line. Attack on three axes — Evidence (cherry-picked? missing data?), Logic (any leaps?), Alternatives (could the same facts support a different conclusion?). Name the cognitive bias in play. Close with a balance table — Strength column, Verdict column.

It worked. The first two times.

v0.7 — the name was lying

Third use, I caught myself hesitating. The command was /imp, but every line of the description called it "devil's advocate." Same tool, two names, in the same document. Every time I went to use it, my brain did a little "wait, was it imp," and that little hitch added up.

Sounds petty. It isn't. The name is the tool's identity. If the identity is fuzzy, every invocation costs you a context switch you shouldn't be paying for. Pile up enough of those and you stop calling the tool at all.

v0.7.0 (April 6th) renamed /imp → /devil. Breaking change. The user (still me) had to relearn a command. Did it anyway. A lie you're carrying gets more expensive the longer you carry it.

That same release also dropped Rhythm Rules (Pace / Weight / Learn) into core. So /devil stopped being a standalone tool and became one component inside a larger verification system. Pace is "when do I use this," Weight is "how heavy a pass do I need," Learn is "how do I save what I find." /devil ended up being the thing you reach for when Pace and Weight tell you to.

v0.7.1 → v0.10 — the table was unreadable

Once it actually got used, the balance table started misbehaving:

| Original claim | Counter-argument | Strength | Verdict |

Strength of what? The original claim's strength? The counter-argument's strength? Every single time I read this table, I had to drop back into the body text to figure it out. The column name wasn't carrying the meaning it was supposed to.

v0.7.1 (April 10th) — pinned the subjects to the labels:

| Original claim | Counter-argument | Counter strength | Claim verdict |

Better. But a week later it was annoying me again. "Counter strength" is a noun phrase that's too compact — it doesn't tell the reader what question the column is answering. The first thing you ask when you read a column header is "what is this column asking me?" The faster the header answers that, the faster the table reads.

v0.10.0 (April 17th) — turned the noun phrases into questions:

| Original claim | Counter-argument | Counter (strong/moderate/weak) | Claim after challenge |

The actual header text is "How strong is the counter-argument?" with the answer options shown in parentheses. "Verdict" — abstract noun — got replaced with "Claim after challenge," which has time embedded in it. Much more direct.

Two passes for what looks like a tiny change. But column headers are the first signal a reader gets when they hit a table. If the first signal is fuzzy, every row underneath turns into noise. The header is the meaning carrier; the cells are just data riding it. Worth two passes.

v0.11 — devil started wearing me out

The biggest shift came on April 26th, and here's the awkward part: the problem was that the tool was working.

/devil was too good at doubting.

Every time I made a decision and called /devil, it would dutifully surface real weaknesses. Not fluff — real ones. Cherry-picking risks. Causal directions I hadn't verified. Alternative hypotheses that explained the same data equally well. All true.

The catch: none of it changed what I did next.

Concrete example. I'd ask /devil about a decision like "which series should I file this post under?" and it would honestly point out that series taxonomies are arbitrary, that readers don't browse by series, that my own category metrics are basically empty. All true. But I still had to put the post somewhere, I could move it a week later if I felt like it, and none of the surfaced weaknesses were going to change my next click. The tool had done real work. The result was churn.

I gave it a name — reactive contradiction. Real weakness, but tangential. The tool looks smart. The user gets nothing actionable.

Realizing this was the tool's signature failure mode is what triggered v0.11.

The fix was a gate. v0.11 added a new section called §2.5: a project-relevance gate.

After CCT (Claim-crux / Counter-fit / Cause-chain — the existing step where devil hunts for the load-bearing assumption), there's now a gate it has to pass before §3 depth drive runs. Three questions:

Question	What it asks
Stakes	If this reasoning is wrong, what does the user actually lose here?
Amendment cost	Where in the lifecycle is this decision — draft, or shipped operational state?
Next-action delta	If this counter is surfaced, does the user's next action actually change, or do we just add words?

The three answers produce a verdict:

[devil] Project relevance: "{material | load-bearing-but-low-stakes | off-project}"

material — real stakes. Proceed to §3 depth.
load-bearing-but-low-stakes — real weakness, but stakes don't justify a deep grilling. Note it briefly and stop.
off-project — the weakness is technically correct but disconnected from the actual decision. Say so and stop.

That last verdict — off-project — is the one that mattered. Until v0.11, /devil could only land on "claim survives" or "claim weakened/flipped." Both of those describe what happens after a challenge lands. There was no way to express "the challenge itself missed." So /devil would throw the missed punches anyway. That was the structural cause of reactive contradiction.

Once off-project existed, /devil had the option of an honest exit. A tool shouldn't stand up when there's no work for it to do. That's what the gate is for.

Five rewrites, four lessons

1. A verification tool needs to be verified too.
/devil can't audit itself, so the signal is user fatigue. If after invoking the tool you find yourself sighing instead of nodding, the tool isn't doing its job — no matter how many real weaknesses it found.

2. The name is the tool's identity.
v0.7 was about not carrying that imp/devil lie one more day. Every invocation that costs your reader a small mental swap is a tax that compounds.

3. Column headers are where signal turns into noise.
Two passes to get them right. The leap was going from a noun phrase to a question. Tables hold data. Headers hold the question the data is answering. Get the second wrong and the first becomes meaningless.

4. A busy-looking tool isn't necessarily a useful tool.
When /devil was finding real weaknesses on every call, it looked like the tool was earning its keep. It wasn't. It was generating output that looked like work but wasn't actually moving anything. The fix wasn't to make /devil smarter. It was to give it permission to leave.

What's already pulling on the next rewrite

The gate landed less than a week ago and there are already cracks I can see:

The user doesn't always state Stakes explicitly. So how does /devil estimate them? If the estimate's wrong, doesn't the whole gate collapse back into reactive contradiction?
The boundary between load-bearing-but-low-stakes and off-project is sharp on paper but probably blurry in practice. Both end in "stop here," but they mean different things.
When does /devil need to self-critique its own verdict? "Your gate verdict is too conservative" is a meta-challenge that doesn't have a home yet.

Five rewrites in, the work isn't done. What goes into the next release will be whatever the user (still me) gets tired of next.

Try it

/plugin marketplace add jun0-ds/sonmat
/plugin install sonmat@sonmat

After install, try calling /devil on any decision you're sitting on. If the result feels like churn, that's the signal — the tool has no real work here. Catching that signal is how you train the tool, and yourself.

→ GitHub: jun0-ds/sonmat

GitHub · LinkedIn