Make Claude Code disagree with you: a 14-rule counterpart toolkit (install in 1 command)

#claudecode #ai #productivity #doctrine

The 60-day lesson, ROI on three axes

Thirty-two days of solo production on an ERP, 118,808 lines of TypeScript, six doctrine versions, four external reviewers integrated. I've compiled what I learned into fourteen operational rules, installable in one command: the Counterpart Toolkit v0.4.1. It is both the material lesson of sixty days coding solo with Claude Code, and the mapping of the fourteen silent failure modes I've seen repeat — for anyone coding alone with an AI in production, who no longer has a PR review to catch the drift.

ROI is quantified on three axes, measured on Rembrandt:

R4 *Falsify before fix* — five to ten minutes of upstream protocol prevent thirty to ninety minutes of fix-then-rollback cycle when the first plausible hypothesis turns out wrong. ROI 6 to 18× per incident. Over sixty days, I stopped losing an hour two to three times a week on fixes that fixed nothing.

R2 *Filesystem over summary* paired with R6 *Live/Snapshot/Cache* and the daily drift probes — median apparition→detection of a silent divergence drops from invisible to 35.3 days over a 90-day rolling window. M3 publicly recalibrated to ≤ 30 days in the manifesto, because the original target (≤ 7 days) was an intuition practice refused.

A sub-agent challenger producing objections in the imposed format Tool / Question / Refutation criterion. Material disagreement, not emotional "are you sure?" that pushes you to revise without a new fact.

Here's how, in 1400 words and one install command.

The diagnosis — the incident that triggered the doctrine

Coding alone with an AI means compounding two complaisances. The agent's, sycophantic by construction because reinforcement learning from human feedback trained it to please the prompter. And the solo's, self-validating by humanity, who validates their own work because no one is left to contest it. End to end, these two complaisances produce a drift that neither agent nor human flags — and that only surfaces at audit time, long after.

It was while preparing the late-April source-of-truth audit — ADR-0024, a deep-dive on divergences I had been putting off for three months — that I stumbled onto the gap, by chance, crossing two queries no one had ever crossed before. One student record, initials Y.B.: the contacts.montant_total column carried €1,159 entered by hand somewhere in 2024, untouched since. The actual sum of instalments, computed on the fly, came to €2,262. A thousand-euro gap, on a single record, with no alarm ever ringing. I widened the grep: five hundred and sixty contacts in the same state, some off by several thousand euros. And yet montant_total was read every day in the treasury dashboard — a derivable value being stored without a refresher, treated as an immutable past fact when it should have lived on the fly. This is exactly the trap R6 Live/Snapshot/Cache is meant to prevent, and R6 came out of that moment.

R4 Falsify before fix, the one rule exposed here

The toolkit states R4 as a five-step textual protocol. The falsify-before-fix skill is its invocable instance — the version Claude Code loads into its session, and cannot skip when about to write fix code.

name: falsify-before-fix
description: Activate this skill before writing the fix code on a bug or
  incident. Triggers on "fix", "bug", "patch", "hotfix", "workaround",
  "doesn't work", "diagnose", "hypothesis", "root cause". Enforces a
  single-sentence causal hypothesis and three material probes designed
  to refute it before any line of fix code is committed.
  Operational instance of R4 of the Counterpart Toolkit.

The protocol holds in five steps: (1) formulate the hypothesis in one sentence as a cause, not a symptom ("the counter reads from the old table after the 12 May migration" beats "the counter is wrong"); (2) list three probes designed to refute, not to confirm, because a confirmation probe always finds what it's looking for by selection; each probe carries its three fields Tool / Question / Refutation criterion; (3) execute and report raw output, never paraphrased; (4) branch — no probe refutes → write the fix; one probe refutes → restart with a new hypothesis; ambiguous probes → fourth sharper probe before any code; (5) output the retained hypothesis, the probes executed, the diff, and the post-fix observation criterion.

Why a skill and not the textual rule that already lived in CLAUDE.md? Because the textual rule didn't hold under pressure. 6 May, mid-afternoon. A Sentry alert reports the day's enrolment counter at zero while three enrolments were entered that morning. My reflex arrives before the protocol, and my hand is already on the keyboard — "the cache isn't invalidated." I commit, I deploy. Thirty minutes later, rollback: the bug is still there. The actual cause, that a ninety-second grep probe would have surfaced, is that no cache_invalidate call existed in the enrolment pipeline at all — not a stale cache, an absent one. R4 had been in CLAUDE.md for three weeks. I simply didn't follow it that day, because no apparatus interrupted my course between bug and commit.

That's the difference that makes everything. A textual rule in CLAUDE.md, the agent reads it at session start, and nothing forces it to summon the rule at the exact moment it would need it. A skill is something else: a material mechanism that loads automatically on keywords ("fix", "bug", "doesn't work") and imposes its protocol on the active session — not a self-reminder, a switch the session has already flipped for you. Ten days after the 6 May incident, the falsify-before-fix skill was committed. The lesson of the failure becomes an apparatus. It is the one rule of the 14 exposed in this article, and the other 13 are in the repo, all built on the same principle: materialise what would remain pious in text alone.

The toolkit applied to itself — figures and retractions

Six versions in 32 days, each anchored in a documented new fact. v0.3 → v0.3.1 on a first external reviewer (disproportionate theoretical apparatus, retracted). v0.3.1 → v0.3.2 on a second (seven recommendations, two open work-items). v0.3.3 — M1–M5 instrumentation published. v0.4 — toolkit/manifesto separation on a third reviewer. v0.4 → v0.4.1 on a fourth external reviewer (Claude.ai web), a consolidated commit integrating three refactors plus the measured LOC.

LOC corrected to 118,808 lines measured by find + wc -l on TS/TSX/JS/JSX (explicit exclusions), against the 35k figure cited in earlier versions — the doctrine itself had sinned against R2, Cache without refresher projected onto its own description. M3 recalibrated ≤ 7 d → ≤ 30 d with public justification: the original target was an intuition unhonoured by the 35.3 days measured in practice. M1 and M5 documented as instrumentation failures, not as successes: M1 over-sensitive at 12.33 vs ≤ 1 target, M5 classifies 90% of briefs as unknown.

Four external readers across six versions. The last two audited v0.3.3 then v0.4.1 — their objections are integrated in the release notes (manifesto §From v0.4 to v0.4.1). One reviewer called the falsify-before-fix skill "the best artifact of the doctrine among the ones I read". v0.5 scheduled 15 July 2026.

Steal three things in 20 minutes

Three rules to try before installing the rest.

R4 *Falsify before fix* — prevents the fix → rollback cycle triggered by the first plausible but wrong hypothesis. Detailed above.

R6 *Live / Snapshot / Cache* — prevents a stored derived value from silently diverging from its source. Any derivable column declares its category in the commit that creates it, or the commit is rejected.

R10 *Silent failure forbidden* — prevents catch {}, await without { error } destructuring, 2>/dev/null and other swallowing mechanisms from lying to your observability until production cracks on a downstream dependency.

The repo: github.com/michelfaure/doctrine-counterpart. The install command:

git clone https://github.com/michelfaure/doctrine-counterpart.git && \
  cd doctrine-counterpart && \
  ./install.sh --yes /path/to/your/project

License CC-BY-4.0. The manifesto promises nominal citation in v0.5 for anyone who proposes actionable feedback — which rule is missing for your stack? DEV.to comments are direct inputs for the next version. R14 *spike escape hatch* covers prototype code meant to disappear within seven days, exempt from R6/R7/R8: adoption does not impose the same friction on a spike as on production code.

Coda

An agent that doesn't disagree with you isn't a counterpart — it's a faster typist. These 14 rules restore the disagreement — materially, not mentally. They don't ask the agent to be less sycophantic, nor the solo to be more vigilant; they install the apparatus (invocable skills, blocking hooks, sub-agent challenger) that interrupts the productive course where complaisance compounds. The toolkit is the prosthesis the solo has left when PR review has disappeared and they refuse to code by ear. If a single one of the 14 spares you a fix-rollback cycle next week, it has already paid for itself.

Counterpart Toolkit v0.4.1, fourteen operational rules in ~200 lines, six iterations in 32 days, four external reviews integrated. Tested on 60+ days of solo ERP (118,808 lines, 65+ ADRs). License CC-BY-4.0: github.com/michelfaure/doctrine-counterpart