DEV Community

Odilon HUGONNOT
Odilon HUGONNOT

Posted on • Originally published at web-developpeur.com

I Versioned the Way I Think. Then I Forced It to Comply.

One morning I pasted four principles into my CLAUDE.md, the global instruction file Claude Code reads at the start of every session. "Think before you code", "simplicity first", that kind of maxim you see fly by on X, credited to Andrej Karpathy. I felt clever for about a day.

Then I watched Claude read the file, nod, and carry on exactly as before. A CLAUDE.md is a suggestion box. The model nods, then does whatever it wants. If I wanted it to code my way, writing it down wasn't going to cut it. I had to enforce it.

What follows is what that frustration turned into: a config in four layers, reinstallable in one command, and a discovery that runs through everything else. The only rigor that counts is the one a model can't grant itself.

Four layers, and only one really changes the behavior

My config has four floors, from softest to hardest.

The brain is CLAUDE.md: how I work, not the docs for my code. The rule that sums it up lives inside it: "what not to add: anything Claude rediscovers by reading the code." It holds my design principles, my stance on orchestrating subagents (I size up, I delegate, I verify: "I stay the brain, they're the hands"), and one line that becomes the thread running through the whole thing.

The references: a go-best-practices.md file the brain points to in plain text whenever Go is involved.

The skills: ten of them. A skill is a folder with a playbook that Claude loads on demand for a specific job: review code, write an article, distill a book. Mine are packaged as a marketplace, in a public GitHub repo, with a changelog and a version number. That's the real differentiator: versioned tooling, not just rules scribbled in a file.

The guardrails, finally. And this is the only layer that reliably changes behavior. The first three, the model can read and ignore. The fourth, it can't.

The four config layers, from softest (the model can ignore) to hardest (the model is bound by the guardrail) Brain CLAUDE.md: how I work References go-best-practices.md Skills 10 versioned skills Guardrails hooks: settings.json + guard.sh read, then ignored enforced

Three layers the model can ignore, one it can't.

From the "FORBIDDEN" you ignore to the block you can't

My CLAUDE.md said, in capitals: never read .env files. A suggestion. The model honored it when convenient.

So I wrote a hook. A hook is a script Claude Code runs automatically at a precise point in its cycle. guard.sh attaches to PreToolUse, intercepts every Read and every Bash before it runs, and decides:

case "$tool" in
  Read)
    is_secret_path "$path" && block "reading a secrets file ($path)."
    ;;
  Bash)
    printf '%s' "$cmd" | grep -Eq -- '--no-verify|core\.hooksPath=' \
      && block "bypassing git hooks is forbidden."
    ;;
esac
Enter fullscreen mode Exit fullscreen mode

block exits with code 2. For Claude Code, a PreToolUse hook exiting with 2 is a veto: the tool never runs. Reading a .env, bypassing git hooks with --no-verify: these are no longer things I ask it not to do, they're things that don't happen.

The hook is fail-open. The slightest doubt, a missing jq, a malformed JSON, and it allows. A bug in the guardrail has never blocked a legitimate tool. It only shuts the door on a confirmed violation.

The proof is outside, never inside the model

Here's the CLAUDE.md line that feeds everything else:

External rigor: the only proof that counts is verifiable from the outside (a test, a command, a grep, real output), never the model's self-assessment. "Are you sure?" → an LLM always says yes; demand a fact, not a claim.

Ask a model if it's sure, it says yes. Always. So my skills are built so the proof is executable, not declared.

feature-loop ships a feature in a loop, but before trusting a critical test, it mutates it, checks that it goes red, then restores it. A test that stays green after mutation tests nothing: it gets rewritten. The reason is spelled out in the skill: 76% of LLM-written tests miss the red-to-green transition. A test that always passes is as reassuring as it is dishonest.

rediger-cir, my skill for French R&D tax-credit dossiers, pushes the same logic to the extreme. No reference enters a dossier without resolving through a metadata API: the DOI has to resolve, the journal can't be predatory, the paper has to actually say what you make it say. The golden rule fits in one sentence: never ask the model that just hallucinated a citation whether that citation is real. It's precisely the one you shouldn't ask.

Why the reviewer is never the author

If I had to keep a single decision out of this whole config, it'd be this one.

senior-review reviews code before merge. The obvious trap would be to ask the model that just wrote the code to review itself. Except a model judging its own output over-rates it: self-preference bias is documented and causal (Panickssery, 2024). An honest reviewer has to be blind. My reviewer agents run in a clean context, never see the implementation prompt, and when the author is a known model, they're from a different family. The developer's name never enters the prompt.

Then comes the gate that does the cleanup: no finding without a receipt. Every flag cites a file:line and passes a check, a grep, a sandbox run, a failing test. An unverifiable finding is dropped silently. Models over-flag by reflex; grounding each flag in a real execution removes around 60% of false positives. The goal isn't to find the most problems. It's to surface only the real ones.

Ten skills, from scoping to wrap-up

feature-loop, senior-review and rediger-cir, I've already dissected. But they don't live alone. The ten skills form a chain: four cover the life cycle of a dev task, from scoping to commit, and each knows when to hand off to the next. The other six are orthogonal, pulled out when needed.

Skill

Its marrow

issue-mr

Scope. Turns a fuzzy task into a clean issue, branch and PR. An analyse mode settles the design before a line is written.

feature-loop

Build. A quality loop where author, tester and reviewer are separate agents: an objective gate (build, tests, red-check) before any review, a mandatory live smoke test.

senior-review

Review. Blind reviewers per dimension, no finding without a receipt, an adversarial panel on the critical ones.

branch-wrap-up

Close out. Proposes a conventional commit, push and PR, captures knowledge. Never commits without my go-ahead.

code-mentor

Teach. Slows down at each decision, explains the why, makes me predict before it reveals. The opposite of copy-paste.

skill-builder

Design. The principles for a skill that actually triggers and stays focused: one anchor word, a predictable process.

tech-article

Write. An article that sounds human, not AI. The one you're reading came out of it.

book-distill

Distill. A faithful reading note of a book, every quote verified word-for-word against the text.

rediger-cir

Justify. A French R&D tax-credit dossier where every reference resolves through an API, zero hallucination.

vide-contexte

Remember. Before a context reset, persists the non-obvious insights into memory files.

One thread runs through all of them: the author is never the reviewer, the objective gate comes before the model's judgment, and nothing commits without my consent. These aren't ten tools, they're one way of working, declined ten times.

How I work, held together by the CLAUDE.md

The brain of the config comes down to a few rules I apply to myself as much as to the model.

  • The mother matrix. I don't run everything myself. I size up the difficulty, then delegate to calibrated subagents: a Haiku for the mechanical, an Opus for the hard reasoning. I stay the brain, they're the hands.
  • Expose, don't guess. Before coding, I state my assumptions. Several readings possible, I lay them all out instead of silently picking one.
  • Surgical. Every changed line traces back to the request. No opportunistic refactor on code that isn't broken, no style drifting in on the back of a fix.
  • The simplest thing that works. YAGNI. A one-implementation interface, a just-in-case abstraction, one more file for no reason: all stop signals.
  • Comments for a senior dev. He gets the what in five seconds. A comment survives only if it carries a non-obvious why, a trap, an invariant.

The sixth rule you already know: the proof is external. It's the one that turns all the rest from good intentions into guarantees.

222,793 stars prove nothing

While building this config, I studied the big repos in the space. The numbers, checked against the GitHub API the day I'm writing this, are dizzying. "Everything Claude Code" tops out at 222,793 stars. Next to it, a very clean skills repo by Matt Pocock has 148,737, and another, sold as "Karpathy's skills", 183,652.

That last one deserves a warning, because it illustrates the theme better than any test. The repo is nothing official: it's a third party's reconstruction of a single tweet, duplicated into four formats. The label says Karpathy. The source says a copied tweet. Trusting the label is exactly the mistake my whole config tries to make impossible.

As for the 222,793-star behemoth, its traction comes from distribution, not code: a won Anthropic hackathon, threads with millions of views, an umbrella name, a README turned into a landing page. Under the hood, two or three solid hooks, whose .env-blocking and --no-verify idea I borrowed, by the way, a "learning" loop disabled by default, and internal counters that contradict each other from one section to the next.

My own repo sits at zero stars. And its guardrails are stricter than the average of the one with 222,793. The lesson isn't bitter, it's useful: stars measure a well-told story, not rigor. The two have almost nothing to do with each other.

The check an LLM can't grant itself

Before publishing, I had my own skills audited. The verdict: the scientific references were 98% real, but four falsely precise numbers and one misattributed quote had slipped into the batch. Fixed. Even an author who thinks he's rigorous can't certify himself alone.

This article is the final proof. I had one of the ten skills write it, from the public repo, on the strength of a brief I'd written myself. That brief claimed "Everything Claude Code" was a lone outlier, five times bigger than everything else. The check against the GitHub API, right before writing, showed there's actually a whole pack of six-figure repos. My own brief was wrong, and only a call to an external source caught it.

That's the whole point. The model doesn't know it's wrong. Neither do I, half the time. The only thing that knows is the test that goes red, the DOI that resolves, the file:line you can open. A Claude Code config worth having isn't a list of good intentions. It's the place where proof stops being a claim.

Top comments (0)