ShipWithAI

Posted on Apr 24 • Originally published at shipwithai.io

Your CLAUDE.md Is an Instruction File. It Should Be a Failure Log.

#ai #programming #productivity

CLAUDE.md instructions get followed ~60-70% of the time. Mitchell Hashimoto's AGENTS.md in Ghostty has zero aspirational lines — every entry traces to a real agent mistake. Use the Failure-to-Constraint Decision Tree: dangerous actions go to Hooks, repeatable workflows go to Commands, style/convention goes to CLAUDE.md.

Two CLAUDE.md files. Same project. Different philosophies:

# ❌ Before: instruction-first CLAUDE.md (typical)
# 47 lines of well-meaning rules
- "Be careful with production database."
- "Always write tests."
- "Use TypeScript strict mode."
- "Follow our naming conventions."
# Claude reads these, weighs them against 200K tokens... follows ~65%.

# ✅ After: failure-first CLAUDE.md (Hashimoto method)
# 12 lines, each traced to a specific incident
- "NEVER use git push --force. Use --force-with-lease."
  # Failure: 2026-03-12, force push overwrote teammate's commits on feature/auth
- "Run npm test before ANY git commit. No exceptions."
  # Failure: 2026-02-28, broken import pushed to main, CI caught 20min later

One file has 47 lines of advice. The other has 12 lines of scars. Which one does the agent actually follow?

The answer isn't close. The 12-line file wins every time, because every line carries weight. Every line exists for a reason the model can evaluate. The 47-line file is a wishlist. The 12-line file is a harness.

Why do most CLAUDE.md files fail?

Most CLAUDE.md files fail because developers write them like job descriptions: aspirational, comprehensive, bloated. LLMs don't execute instructions like code executes functions. They weigh each instruction against the full context window. More lines means more dilution, which means lower compliance per line.

The data backs this up. An ETH Zurich study (Gloaguen et al., 2026) tested context files across 138 real GitHub issues and found that LLM-generated agentfiles actually reduced success rates by 0.5-2% while increasing inference costs by 20-23%. Even developer-provided files only improved performance by ~4% on average. The typical developer-written file averaged 641 words across 9.7 sections.

That's a lot of instructions for a 4% gain.

Metric	200-line CLAUDE.md	40-line CLAUDE.md
Instructions	~200	~40
Compliance	~60-70%	~85-90%
Maintenance	Monthly pruning needed	Self-maintaining

Frontier LLMs can follow approximately 150-200 instructions with reasonable consistency. Your 200-line CLAUDE.md already exceeds that budget before counting the system prompt (another ~50 instructions). Community benchmarks put compliance at 60-70% for files over 200 lines. That's a coin flip for your most important rules.

What is the Mitchell Hashimoto method for AGENTS.md?

Mitchell Hashimoto (creator of Terraform, Vagrant, and now Ghostty) treats AGENTS.md as a failure log, not an instruction file. Every single line in Ghostty's AGENTS.md exists because the agent made that specific mistake at least once. No line is aspirational. Every line is a scar from a real incident.

In his own words:

"Each line in that file is based on a bad agent behavior, and it almost completely resolved them all" — mitchellh.com, 2026

The mental model shift matters:

Instruction-first	Failure-first
"What should the agent do?"	"What has the agent broken?"
Proactive, aspirational	Reactive, evidence-based
High volume, low signal	Low volume, high signal
Added before problems occur	Added after problems occur
Dilutes over time	Strengthens over time

Instructions are wishes. Constraints are lessons. LLMs don't need more wishes. They need fewer, sharper constraints with concrete context about why each one exists.

How do you build CLAUDE.md from failures instead of imagination?

Start with a minimal CLAUDE.md containing only your project overview and tech stack. Run the agent on real tasks. When it breaks something, convert that failure into a constraint. Then route the constraint to the right layer.

Step 1: Start minimal

Your initial CLAUDE.md should be 5-10 lines:

# Project: Acme SaaS
TypeScript, Next.js 15, Drizzle ORM, deployed on Vercel.

## Build
npm run build && npm test

That's it. No rules. No conventions. No aspirational guidelines. Just enough context for the agent to understand what it's working on.

Step 2: Run the agent, observe failures

Use the agent for real work. Don't preemptively add rules. When the agent makes a mistake, write down exactly what happened:

What: force-pushed to main
When: 2026-03-12
Impact: overwrote teammate's commits on feature/auth

Step 3: Convert the failure into a constraint

Turn the incident into a specific, testable rule:

NEVER use `git push --force`. Use `--force-with-lease`.
# 2026-03-12: force push overwrote teammate's commits on feature/auth

The pattern is always the same: CONSTRAINT + REASON + FAILURE DATE.

Step 4: Route it with the decision tree

Not every constraint belongs in CLAUDE.md. This decision tree is the most important takeaway from this post:

Agent made a mistake
    │
    ├── Is the action irreversible or dangerous?
    │   YES → Hook (PreToolUse block)
    │   Examples: delete production files, force push, edit .env
    │
    ├── Is it a repeatable workflow the agent should automate?
    │   YES → Command or Skill (.claude/commands/)
    │   Examples: run tests after refactor, update changelog
    │
    └── Is it a style, convention, or context issue?
        YES → CLAUDE.md constraint
        Examples: naming conventions, test patterns, commit format

If you take one thing from this post, take the decision tree. It replaces the instinct of "something went wrong, let me add a line to CLAUDE.md" with a structured routing decision.

What does a CLAUDE.md look like before and after?

Before: instruction-first (47 lines)

# Project: Acme SaaS

## Rules
- Be careful with production database.
- Always write tests.
- Use TypeScript strict mode.
- Follow naming conventions.
- Don't use deprecated APIs.
- Keep functions under 50 lines.
- Use ESLint and Prettier.
- Comment complex logic.
- Don't hardcode environment variables.
- Use meaningful variable names.
# ... 37 more aspirational rules like these

Every line is reasonable. None is specific. The agent reads all 47, retains maybe 30, and consistently follows maybe 25.

After: failure-first (18 lines)

# Project: Acme SaaS
TypeScript, Next.js 15, Drizzle ORM, Vercel.

## Build
npm run build && npm test

## Constraints (each from a real failure)

NEVER use `git push --force`. Use `--force-with-lease`.
# 2026-03-12: force push overwrote teammate's commits on feature/auth

Run `npm test` before ANY git commit.
# 2026-02-28: broken import shipped to main, CI caught 20min later

Schema migrations: always generate with `drizzle-kit generate`.
# 2026-03-05: hand-written migration missed NOT NULL, broke staging

API routes: validate input with zod schemas, never trust req.body.
# 2026-03-18: unvalidated input caused 500 errors for 2 hours

18 lines. 4 constraints. Each one backed by a real incident with a date. The agent knows not just what to avoid but why, which makes the constraint stickier in context.

How do you categorize failures into the right layer?

Layer	Enforcement	Compliance	Example
Hook	Deterministic (shell script)	100%	Block `git push --force`
Command	Deterministic (executed)	100%	Run tests after refactor
CLAUDE.md	Probabilistic (LLM context)	60-90%	Use camelCase naming

Category A: Structural failures → Hook. File deletion, sensitive config edits, force pushes. For irreversible actions, you need 100% enforcement, not 60-70%.

Category B: Style and convention failures → CLAUDE.md. Variable naming, comment style, test patterns, commit format. Low-stakes if violated occasionally.

Write them as failure-derived constraints:

- Use camelCase for variables, PascalCase for components.
  # 2026-03-20: agent used snake_case in 3 React components, broke style consistency
- Test files go in __tests__/ next to the source file, not in a top-level test/ dir.
  # 2026-02-15: agent created test/api/users.test.ts, missed by our jest config

Category C: Workflow failures → Commands/Skills. "Always run tests after refactor." "Always update the changelog after API changes." These are repeatable processes. Don't remind the agent. Automate it.

How do you keep CLAUDE.md lean over time?

Prune monthly. HumanLayer's production CLAUDE.md is under 60 lines. Bloat is the number one killer of CLAUDE.md effectiveness.

Monthly pruning checklist:

For each constraint in CLAUDE.md, ask:

1. Has the agent triggered this constraint in the past 3 months?
   NO → candidate for removal

2. Has this constraint graduated to a Hook?
   YES → remove from CLAUDE.md (now enforced, not suggested)

3. Is this a workflow that could be a Command instead?
   YES → move to .claude/commands/, remove from CLAUDE.md

4. Can I name the specific failure behind this line?
   NO → delete it (it's aspirational, not evidence-based)

5. Does the agent already do this correctly without the instruction?
   YES → delete it (you're wasting instruction budget)

I did this exercise on a 90-line CLAUDE.md last month. It dropped to 23 lines. The agent's compliance on the remaining rules went up noticeably within the first session. Fewer rules, better followed.

FAQ

What is the difference between CLAUDE.md and AGENTS.md?

CLAUDE.md is Claude Code's project-level instruction file, loaded automatically at session start. AGENTS.md is an emerging open standard backed by OpenAI Codex, Amp, Google Jules, and Cursor that serves the same purpose but is agent-agnostic. Both are repository-level context files. If you use Claude Code, write CLAUDE.md. If you want cross-agent compatibility, also add an AGENTS.md. The failure-first methodology applies to both.

Should I start CLAUDE.md from scratch or use a template?

Start from scratch with only three things: project name, tech stack, build commands. Then build it through the failure-first workflow: run the agent, observe mistakes, add constraints one at a time. Templates encourage instruction-first thinking, which is the exact problem this post addresses.

Can the agent override or ignore CLAUDE.md constraints?

Yes. CLAUDE.md is "soft" context. The LLM weighs it against other context but can ignore it. Compliance runs 60-70% with large files, higher with lean files. For constraints that must be followed 100% of the time, use Hooks instead. Hooks run as shell scripts and physically block the action. The model cannot bypass them.

How many lines should CLAUDE.md have?

As few as possible. Research suggests LLMs follow ~150-200 instructions consistently, but that budget is shared with the system prompt (~50 instructions). Aim for 30-60 lines of failure-derived constraints plus a minimal project overview. If your file exceeds 100 lines, audit it with the failure-first test: can you name the specific incident behind each line?

Try it now: Open your CLAUDE.md right now. For each line, write the specific failure that caused you to add it. If you can't name the incident, delete the line.

How many lines survived? Drop your before/after count in the comments.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.
I had a 90-line CLAUDE.md last month. Rules for everything. Naming conventions, test patterns, git workflows, API design, deployment checklist. Carefully organized with headers and bullet points.

Claude followed about 65% of it.

Then I learned about Mitchell Hashimoto's approach to AGENTS.md in Ghostty. Every single line in his file traces to a real agent mistake. No aspirational rules. No "best practices." Just scars.

So I tried it. I went through my 90-line file and asked one question for each line: "Can I name the specific failure that caused me to add this?"

23 lines survived.

The problem with instruction-first thinking

Most of us write CLAUDE.md like a job description — comprehensive, aspirational, bloated. But LLMs don't execute instructions like code executes functions. They weigh each instruction against everything else in the context window.

ETH Zurich tested this across 138 real GitHub issues. LLM-generated context files actually reduced success by 0.5-2% while increasing costs by 20-23%. Even developer-written files only improved things by ~4%.

The math is brutal: 200 lines of instructions, shared with a system prompt that already has ~50 instructions, competing for a model's attention across 200K tokens. Your most important rule has the same weight as "use meaningful variable names."

The failure-first method

Hashimoto's approach is the opposite. Start with almost nothing — project name, tech stack, build command. That's 5 lines. Then run the agent on real work. When it breaks something, you have three choices:

Is the action dangerous or irreversible? → Don't put it in CLAUDE.md. Put it in a Hook. A PreToolUse hook that exits with code 2 physically blocks the action. 100% enforcement. No exceptions. Force pushes, file deletions, production edits — these need hooks, not suggestions.

Is it a repeatable workflow? → Put it in .claude/commands/. A command runs deterministically every time. A CLAUDE.md instruction runs when the model remembers it.

Is it a style or convention issue? → Now it belongs in CLAUDE.md. But write it as: CONSTRAINT + REASON + FAILURE DATE.

NEVER use `git push --force`. Use `--force-with-lease`.
# 2026-03-12: force push overwrote teammate's commits on feature/auth

The failure context makes the constraint stickier. The model doesn't just know what to avoid — it knows why. That matters for an LLM weighting instructions.

The result

My 90-line file → 23 lines. Compliance on the remaining rules went up noticeably in the first session. Fewer rules, better followed. The dangerous ones graduated to Hooks where they're enforced 100%. The workflows became Commands. What remained in CLAUDE.md was lean, specific, and battle-tested.

The monthly pruning rule: for each line, can you name the incident? No? Delete it. Has it graduated to a Hook? Remove it. Is the agent already doing it right without being told? You're wasting instruction budget.

Read the full breakdown with the decision tree, before/after examples, and pruning checklist →

This week's takeaway: Your CLAUDE.md is probably too long. The fix isn't writing better instructions — it's deleting the ones without scars behind them.

How many lines is your CLAUDE.md right now? Reply — I'm genuinely curious about the range people are working with.

If you know someone drowning in a 200-line CLAUDE.md, forward this. They'll thank you.

Top comments (2)

A3E Ecosystem • Apr 24

The 60-70% follow rate lands because it matches what most practitioners see -- and framing it as a failure log rather than an instruction file is the part that actually changes the maintenance behavior.

One distinction worth adding: structural temptations and one-time mistakes need different tooling. Structural temptations (model keeps reaching for a pattern you've banned) belong in hooks, not CLAUDE.md -- they recur by design. One-time mistakes that were corrected belong in CLAUDE.md as failure-derived lines. The Hashimoto approach captures both in the same file; separating them gives you a failure log AND enforcement at the trigger point.

The Hashimoto reference is the right anchor -- that post shifted how a lot of people think about AGENTS.md maintenance.

ShipWithAI • Apr 26

Agreed, thanks bro