DEV Community: Peter Huang

Writing a CLAUDE.md that Claude actually follows

Peter Huang — Wed, 10 Jun 2026 06:19:16 +0000

TL;DR — CLAUDE.md is powerful, but most people fill it with vague preferences that Claude acknowledges and then ignores. The instructions that stick are specific, verifiable, and binary — not stylistic. This post shows the difference, with examples.

The CLAUDE.md that got ignored

My first CLAUDE.md had 30 lines. Things like:

- Write clean, readable code.
- Keep functions small and focused.
- Follow best practices.
- Be concise in explanations.
- Don't over-engineer solutions.

Claude read it. Every session it would acknowledge the file. And then it would write 400-line explanations, functions that did four things, and solutions with three layers of abstraction nobody asked for.

I blamed Claude. But the real problem was me: I'd written instructions that I couldn't have enforced even if I were reviewing a junior engineer's PR. "Be concise" isn't actionable. It's an aspiration. And aspirations don't constrain LLM behavior — they just get absorbed into the same ambient politeness that produces confident, plausible, wrong output.

Why most CLAUDE.md instructions don't stick

An LLM follows instructions the way a very eager-to-please person follows vague advice: by interpreting them in whatever way is consistent with what they were already going to do. "Write clean code" is consistent with almost anything. "Don't over-engineer" depends entirely on what counts as over-engineering, which the model will assess using its own judgment, which is the thing you were trying to override.

The instructions that actually change behavior share a few properties:

1. They're binary, not scalar.

"Be concise" is scalar — there's a spectrum and the model decides where on it to land. "Do not write an introduction paragraph that summarizes what you're about to do" is binary. Either there's an intro paragraph or there isn't. Binary instructions are enforceable; scalar ones get interpreted.

2. They name a specific artifact or action.

"Keep functions small" is vague. "If a function exceeds 40 lines, split it before returning" names a specific thing Claude can check. The more you describe the output rather than the quality, the more Claude can follow it.

3. They have no wiggle room for judgment.

"Prefer composition over inheritance" invites Claude to decide when composition is preferable. "Do not use class inheritance in this repo — use composition" closes the judgment loop. Instructions that leave room for discretion will be filled with the model's discretion.

A before/after on common instructions

Here's a CLAUDE.md rewrite that applies those three properties:

Before: - Write concise explanations.

After: - When I ask a question, answer in 3 sentences or fewer unless I explicitly ask you to elaborate. No bullet lists in conversational responses.

Before: - Follow the project's code style.

After: - Before writing any code, read the nearest existing file in that directory and match its import style, spacing, and naming exactly. Do not introduce new patterns.

Before: - Don't add features I didn't ask for.

After: - Do not modify any file I didn't mention in my request. If you think an adjacent change would help, describe it in a sentence at the end — do not make it.

Before: - Be careful with database changes.

After: - Never generate a migration without a rollback. Every migration file must have both an \upand a \downfunction before you show it to me.

The right column isn't longer because it's more thorough. It's longer because specificity takes more words. But each instruction on the right has exactly one interpretation.

The two categories of CLAUDE.md instructions

After rewriting mine a few times, I've found that instructions fall into two useful categories:

Constraints on output — things Claude must or must not produce. These work well when they're binary and verifiable at a glance. "Do not add comments to code unless I ask" is a constraint on output. So is "all functions must have return type annotations" or "never produce a file over 200 lines without splitting."

Procedural instructions — steps Claude must take before doing the main task. "Before writing a function, read the test file for that module" is procedural. So is "before answering a question about the codebase, run grep for the relevant symbol and cite the file:line." Procedural instructions work because they give Claude a concrete first step that shapes what it does next.

What doesn't work well: value statements, quality adjectives, relative preferences ("prefer X", "try to Y", "consider Z"). These are absorbed. They don't constrain.

What to put at the top

CLAUDE.md instructions have an effective range. The further down the file, the more likely they are to be deprioritized under a long task. Put the most important constraints at the top — not as an introduction or preamble, but as the first actual rules Claude encounters.

My current CLAUDE.md opens with three lines that are binary, specific, and cover the mistakes I'd otherwise see every session:

# Rules (always follow)

1. Do not modify files I did not explicitly mention. If a related change seems useful, list it at the end and wait.
2. Do not add code comments unless I ask. Well-named variables explain themselves.
3. When showing diffs or edits, show only the changed lines plus 2 lines of context — no full-file reprints.

Those three lines have changed my sessions more than the other 40 instructions combined.

Testing whether it works

You'll know your CLAUDE.md is working when you stop seeing the behavior you wrote it to prevent. That sounds obvious, but most people write CLAUDE.md once, never check whether it changed anything, and assume it did.

A practical test: after updating CLAUDE.md, trigger the exact scenario the new instruction was meant to handle. Write a request that would previously have produced the unwanted behavior. If it still appears, the instruction isn't binary enough, isn't specific enough, or is too far down the file. Revise until it's gone.

CLAUDE.md is not documentation. It's configuration. Treat it like a test suite: if the behavior you're trying to prevent still appears, the config is wrong.

The free agent + where to go deeper

If you're building a real CLAUDE.md-driven workflow, the same design philosophy applies to subagents: narrow scope, binary constraints, explicit refusal rules. The free shipping-coach agent in this repo is a working example of that approach — it's a 50-line .md file worth reading for its structure, not just its functionality:

👉 github.com/allcanprophesy-ops/claude-code-shipping-coach

A few more agents built the same way (pr-surgeon, regression-sentinel) are linked from the README if you want to see how the pattern scales beyond a single agent. But start by reading your own CLAUDE.md and asking: which of these instructions is binary? Which ones leave room for the model to decide? Fix the second category first.

What's the one instruction in your CLAUDE.md that you've tried multiple phrasings for and it still doesn't work? Drop it in the comments — I'll suggest a binary rewrite.

Why Claude Code never runs your subagent

Peter Huang — Wed, 03 Jun 2026 06:30:07 +0000

TL;DR ‚Äî Claude Code's description field is not a human-readable summary. It's the routing signal the dispatcher reads to decide whether to invoke your subagent. Write it like documentation and your agent never fires. Write it in trigger phrases and it runs when it should.

The agent that sat there doing nothing

I wrote a subagent I was proud of. Solid instructions, tight scope, the works. Dropped it in ~/.claude/agents/, opened a project, did exactly the kind of task it was built for ‚Äî and nothing happened. Claude just did the work itself, inline, ignoring the agent entirely.

I assumed I'd misconfigured something. I hadn't. The agent was fine. The problem was the one field I'd treated as an afterthought: the description. I'd written it the way you'd write a README line:

description: An expert code reviewer that analyzes your code for quality and bugs.

That sentence is for a human browsing a list. It is almost useless to the thing that actually decides whether to call my agent.

The description is a routing signal, not a summary

Here's the part the docs don't emphasize enough: Claude Code has a dispatcher that, on each turn, looks at what you asked for and decides whether any subagent should handle it. The only thing it has to make that decision ‚Äî before reading the agent's body at all ‚Äî is the description field.

So the description isn't documentation. It's a classifier input. Its entire job is to help the dispatcher answer one question: "Does this user's request match this agent?"

"An expert code reviewer that analyzes your code for quality and bugs" answers that badly. It describes an identity ("an expert reviewer"), not a trigger. The dispatcher is trying to match your request ‚Äî "can you check this before I push?" ‚Äî against the description, and "expert code reviewer" gives it very little surface to match on.

What a triggering description looks like

Compare. Here's the same agent, rewritten so the dispatcher can actually route to it:

description: Use when the user is about to commit, push, or open a PR and
  wants a final check on their changes. Triggers on "review my changes",
  "is this ready", "check before I push", "anything I missed". Reviews the
  current diff for bugs and risk ‚Äî not style or naming.

Look at what changed:

It leads with *when to use it* ‚Äî "Use when the user is about to commit, push, or open a PR." The dispatcher is matching situations, so describe the situation.
It lists literal trigger phrases ‚Äî "review my changes", "is this ready", "check before I push". These are the actual words a user types. Giving the dispatcher real phrasings to match against is the single biggest improvement you can make.
It states an anti-trigger ‚Äî "not style or naming." This stops the agent from grabbing requests that belong to a different agent (or to Claude itself). Scoping out is as valuable as scoping in.

That's the whole trick. The body of the agent can stay exactly the same. Just the description determines whether it ever gets a chance to run.

The anatomy of a description that fires

A reliable pattern, in order:

description: Use [PROACTIVELY] when [situation]. Triggers on [literal user
phrases]. [What it does in one clause]. [Anti-trigger: when NOT to use it].

Use PROACTIVELY when... ‚Äî the word PROACTIVELY is a real signal. It nudges the dispatcher to offer the agent without the user explicitly asking, for situations where that's wanted (a pre-commit check, say). Use it deliberately; don't sprinkle it on everything or every agent becomes pushy.
Triggers on "..." ‚Äî the highest-leverage part. Put 3‚Äì5 real phrasings a user would actually type. Think about how you would ask for this thing on a tired Friday, not how you'd formally describe it.
One clause on what it does ‚Äî enough for the dispatcher to disambiguate between two similar agents.
An anti-trigger ‚Äî "not X", "does not handle Y". Prevents overlap and false fires.

How to test whether it routes

Don't guess. After writing the description, open Claude Code and say the trigger phrase you put in the description. If the agent fires, the routing works. If it doesn't, your description and your real phrasing have drifted apart ‚Äî fix the description to match how you actually talk.

A good loop: use the agent for a week, notice the times you wanted it and it didn't fire, and add those exact phrasings to the Triggers on list. The description is a living classifier; tune it against reality.

See it in a real agent

If you want a working reference, my free, MIT-licensed Claude Code agent has a description written exactly this way ‚Äî situation, trigger phrases, anti-triggers, and a deliberate PROACTIVELY:

üëâ github.com/allcanprophesy-ops/claude-code-shipping-coach

cp shipping-coach.md ~/.claude/agents/

Open the file and read the frontmatter ‚Äî the description is doing real routing work, and it's the part worth copying before anything else. A few more agents built the same way (pr-surgeon, regression-sentinel, test-gap-hunter) are linked from the README, but start by reading one description closely. It's the most important line in the file.

Have you written a Claude Code agent that just never fires? Paste its description in the comments and I'll suggest how to make it trigger.

Your AI writes PR descriptions from your commit messages. That's the bug.

Peter Huang — Sun, 31 May 2026 06:37:43 +0000

TL;DR ‚Äî Commit messages describe your intentions. The diff describes reality. They drift apart over the life of a branch, and most AI PR-description tools summarize the wrong one. A good PR agent reads the diff. Here's the design, and a free agent to start from.

The PR description that lied

A reviewer pinged me on a PR last year: "The description says this adds rate limiting, but I'm looking at the diff and it also changes how we hash session tokens. Was that intentional?"

It was intentional. I'd done both on the same branch. But my PR description only mentioned the rate limiting ‚Äî because I'd generated it from my commit messages, and my commit messages were wip, rate limit middleware, fix, fix again, and tweak. The token-hashing change rode in under fix again. The tool that wrote my description faithfully summarized my commits, and my commits faithfully hid half of what I'd actually done.

The reviewer caught it. But the whole point of a PR description is that the reviewer shouldn't have to reconstruct the change from the diff themselves. That's the job I was supposed to have done.

Commits are intentions; the diff is reality

Here's the core problem with generating PR descriptions from commit messages: a commit message is what you believed you were doing at the moment you typed it. It's written before the change is finished, often mid-thought, frequently as fix or address review comments. It captures intent, badly, at a point in time.

The diff against the base branch is different. It's the complete, current, factual statement of what will actually change when this merges. Every renamed function, every altered return shape, every migration, every accidental console.log ‚Äî all of it is in the diff, and none of it lies, because the diff is the thing being merged.

When an AI tool paraphrases your commit messages, it's summarizing a summary ‚Äî one written hastily, by you, before you were done. Garbage in, plausible-sounding garbage out. The description reads fine. It's just not true.

A PR agent should read the diff

The fix is almost embarrassingly simple to state: have the agent read git diff <base>...HEAD, not git log. But there are a few design choices that separate a PR agent you trust from one that produces filler.

Here's the shape of one (this is a real Claude Code subagent; the structure matters more than the exact wording):

---
name: pr-surgeon
description: "Use when about to open or push a pull request. Reads the"
  actual diff against the base branch and writes a tight PR title + body
  with a real test plan. Triggers on "open a PR", "PR description",
  "ready to merge".
tools: Bash, Read, Grep, Glob
model: inherit
---

You write PR descriptions that reviewers actually read.

## Procedure

1. Find the base branch (gh pr view --json baseRefName, else origin HEAD,
   else main).
2. Read the FULL diff: `git diff <base>...HEAD`. Also read the commit
   list ‚Äî but the diff is the source of truth. Commit messages lie; diffs
   don't.
3. Identify the ONE thing this PR does. If it does more than one thing,
   say so explicitly under a "Note to reviewer" heading ‚Äî do not hide it.

## Output

### Title  ‚Äî <70 chars, imperative, matches the repo's recent PR style>
### What changed ‚Äî 2‚Äì5 bullets, each a concrete behavior change, not a file
### Why ‚Äî the user-visible reason. If you can't find it, ASK. Don't invent.
### Test plan ‚Äî a checklist a reviewer can actually run. Real commands.
### Risk / rollback ‚Äî one line: what breaks if this is wrong, how to revert.

Three things in there are doing the real work:

1. "The diff is the source of truth. Commit messages lie." This single instruction is the whole thesis. The agent is allowed to read the commits for context, but it must reconcile them against the diff and trust the diff. That's what would have caught my token-hashing change ‚Äî it's in the diff whether or not a commit mentions it.

2. "If it does more than one thing, say so explicitly." Agents, like people, want to present a clean single-purpose story. An honest PR agent surfaces scope creep instead of smoothing it over. The "Note to reviewer" heading is where the token-hashing change would have been forced into the open.

3. "Why ‚Äî if you can't find it, ASK. Don't invent." This is the anti-hallucination guard. The diff tells you what changed but rarely why. A weaker agent fills that vacuum with confident fiction ("this refactor improves maintainability"). A good one admits the gap and asks you, because a made-up rationale is worse than a blank.

The section everyone skips: the test plan

Most PR descriptions ‚Äî human or AI ‚Äî stop at "what changed." But the highest-value part of a PR description for the person reviewing it is the test plan: a concrete, runnable checklist of how to verify the change does what it claims.

Not "tested locally." That's noise. A real test plan looks like:

- [ ] `npm run test:rate-limit` passes
- [ ] Hit `/api/login` 6√ó in 10s ‚Üí 6th returns 429
- [ ] Existing session tokens still validate after deploy (no forced logout)

That last line is exactly the kind of thing a diff-reading agent can generate and a commit-summarizing agent never could ‚Äî because the token-hashing change is in the diff, so the agent knows to tell the reviewer to check that existing sessions survive.

Build your own, or start from a free one

The pattern generalizes to any "summarize a change" agent: read the artifact that represents reality (the diff, the schema, the built output), not the artifact that represents intention (commit messages, ticket titles, your own memory).

If you want a starting point, my free, MIT-licensed Claude Code agent is a good template for the structure of a focused agent ‚Äî tool scoping, fixed output format, explicit refusal rules:

üëâ github.com/allcanprophesy-ops/claude-code-shipping-coach

cp shipping-coach.md ~/.claude/agents/

It's a pre-merge checker rather than a PR writer, but it's built on the same bones, and reading one well-structured agent file teaches you more than any amount of theory. The pr-surgeon agent sketched above ‚Äî plus a few others (regression-sentinel, test-gap-hunter) ‚Äî are linked from that repo's README if you'd rather not build from scratch. But honestly: read the free one first, then decide.

What's the worst PR description you've ever had to review ‚Äî or write? I'm collecting examples of where "summarize the commits" goes wrong. Drop one in the comments.

The best Claude Code agents are defined by what they refuse to do

Peter Huang — Sun, 31 May 2026 06:09:35 +0000

TL;DR ‚Äî When I write a Claude Code subagent, the most important part isn't the instructions for what it should do. It's the list of what it must refuse to do. This post explains why, and walks through a real ~50-line agent (free, MIT) that catches the embarrassing stuff in your diff before you merge.

The agent that was too helpful

The first useful-sounding Claude Code subagent I ever wrote was a "code reviewer." The prompt was the obvious thing: "You are a senior engineer. Review this diff thoroughly. Comment on bugs, style, naming, architecture, performance, security, and test coverage."

It worked, technically. It produced a review. A long one. Every single time.

And that was the problem. A review that flags 23 things trains you to read zero of them. The signal ‚Äî "you left a hardcoded API key in here" ‚Äî was buried on line 14 between "consider extracting this into a helper" and "this variable name could be more descriptive." I started skimming its output. Then I started ignoring it. An agent you ignore is worse than no agent, because you've paid the latency and convinced yourself you "have review covered."

The fix wasn't a better "what to do" list. It was the opposite.

Refusal lists

Here's the reframe that made my agents actually useful: a good agent has exactly one job, and an explicit list of things it will not do ‚Äî even when it could.

The "will not do" list is the load-bearing part. LLMs are eager. Given any opening to be more thorough, more helpful, more comprehensive, they take it. Left unconstrained, every agent drifts toward the same bloated generalist that comments on everything. The refusal list is what holds the agent to a sharp edge.

Concretely, a refusal list looks like this (from a pre-merge check agent):

## Rules

- Do not autofix. The user fixes; you verify.
- Do not comment on naming, design, architecture, or "could be cleaner."
  Other tools do that. Your job is narrower.
- Do not pad the report. If there are no blockers, say so in one line.
  A short honest report beats a long padded one.
- Do not run anything destructive (db resets, --fix flags that rewrite
  files) without explicit user request.
- If a check tool isn't installed, say "skipped: <reason>" ‚Äî do not
  fake a pass.

Every line there is closing a door the model would otherwise wander through. "Don't pad the report" exists because the model wants to look thorough. "Don't fake a pass" exists because the model wants to give you good news. You are not describing a task; you are fencing in a behavior.

A real example: a 90-second pre-merge check

Let me make this concrete with an agent I actually use on every project. It has one job: catch the stuff that would embarrass you in PR review, before you open the PR. Leftover console.log. A hardcoded key. A .skip you forgot to remove. A .DS_Store in the diff.

Here's the shape of it (the full file is on GitHub, MIT ‚Äî link at the end):

---
name: shipping-coach
description: "Use as the final pre-merge / pre-deploy check. Runs a fast,"
  opinionated checklist over the diff. Triggers on "ready to ship",
  "pre-flight", "before I merge", "final check".
tools: Bash, Read, Grep, Glob
model: inherit
---

You are the last set of eyes before code ships. Be fast, be specific,
be hard to argue with.

## Checklist (run in parallel where possible)

### 1. Debug residue
Search the diff for: console.log, print(, debugger, .only(, .skip(,
new TODO/FIXME. Report only matches ADDED by this diff.

### 2. Secret leaks
Search for API key shapes (sk-, ghp_, AKIA, AIza), .env contents,
credentials in URLs, private keys. Treat any match as stop-the-line.

### 3. Type / lint / test status
Detect the project's check commands from package.json / Makefile /
pyproject.toml. Run typecheck, then lint, then tests. Stop on first fail.

### 4. Tracked junk
Check for .DS_Store, .env, node_modules/, build output that snuck
past .gitignore.

## Output format

## Pre-ship report (took <Xs>)
### Blockers (N)        <- empty heading if none, never omit it
### Worth a look (N)
### Passed

Three design choices in there are worth calling out, because they're the difference between an agent you trust and one you mute:

1. "Report only matches ADDED by this diff." Without this, the agent flags every pre-existing console.log in the repo and the report is instantly noise. Scope is the diff, not the codebase. One sentence, huge signal difference.

2. A fixed output format with a "Blockers" section that's never omitted. Even when there are zero blockers, the heading stays (showing "Blockers (0)"). This sounds pedantic but it's a trust mechanism ‚Äî you learn the report's shape, so you can read it in two seconds and know exactly where to look. Variable-shape output forces re-reading every time.

3. tools: Bash, Read, Grep, Glob ‚Äî and nothing else. No Write, no Edit. The agent cannot modify your files even if it wanted to, because you didn't give it the tools. Tool scoping is a guardrail you enforce at the schema level, not a promise you hope the prompt keeps.

Try building your own

The pattern generalizes. Pick any narrow job ‚Äî writing a PR description from the actual diff, finding which behaviors your change might break, auditing dependencies ‚Äî and write the agent as:

One job, stated in a sentence.
A description that the dispatcher will actually route to ‚Äî write it in trigger phrases ("when the user says X"), not abstract capability claims.
Tool scoping ‚Äî give it only the tools the job needs. Withhold Write/Edit from anything that should only report.
An output format ‚Äî so results are pipeable and skimmable.
A refusal list ‚Äî the doors you're closing. This is the part everyone skips and it's the part that matters.

Drop the file in ~/.claude/agents/ and Claude Code picks it up automatically. No framework, no config.

The free agent + where to go deeper

The shipping-coach agent above is free and MIT-licensed ‚Äî the whole thing is one ~50-line .md file you can read, fork, and modify:

üëâ github.com/allcanprophesy-ops/claude-code-shipping-coach

Install is one command:

cp shipping-coach.md ~/.claude/agents/

Then in any repo with uncommitted changes, just say "run the pre-flight check on my diff."

If the pattern clicks and you want more agents built the same way ‚Äî pr-surgeon (writes PR descriptions from the actual diff), regression-sentinel (reads a diff asking only "what could this break?"), test-gap-hunter, and a few others ‚Äî they're linked from the repo's README. But the free one is genuinely standalone; start there.

What's the one task in your workflow you wish was automated but isn't? I'm collecting edge cases where a narrow agent would help ‚Äî drop them in the comments.