DEV Community: ShipWithAI

How to Review AI-generated Pull Requests in 6 Steps with Claude Code

ShipWithAI — Tue, 26 May 2026 13:59:16 +0000

How to Review AI-generated Pull Requests in 6 Steps with Claude Code

When I started seeing three AI-written PRs land in my inbox every hour, my old checklist fell apart. The diff looked clean, the CI was green, and the commit messages were nicely phrased. Yet every PR slipped a subtle bug past me - a mock-heavy test, a wrong API signature, or a side-effect hidden at import time. After a couple of production incidents I built a linear, single-pass checklist that catches the six error families that coding agents (Claude Code, Cursor, Codex, etc.) tend to introduce. Below is the exact workflow I now run on every AI-authored PR.

Prerequisites

At least a year of code-review experience (comfortable with git diff, GitHub PR UI, and local test runs).
Have used a coding agent that can generate a diff (Claude Code is the reference).
The repo must have a test runner you can invoke locally or via CI.

Step 1 - Write the Expected Scope before opening the diff

Before you click Files changed, open the PR metadata and write a one- or two-sentence scope that defines exactly what the PR is allowed to touch.

# Grab the title and body for quick copy-paste
gh pr view 1234 --json title,body,headRefName

{
  "title": "refactor: extract auth helper",
  "body": "Extract `verifyJwt` from `auth/handler.ts` into `auth/jwt.ts`. No behavior change.",
  "headRefName": "feat/extract-auth-helper"
}

From that I write the explicit expectation:

Add `verifyJwt` in `auth/jwt.ts`, update import in `auth/handler.ts`, no other files change.

Why it matters - if you read the diff first you're prone to rationalising "oh, the extra files look okay". With a concrete anchor you can quickly filter the diff later and spot scope drift.

Step 2 - Detect Fake Tests and Over-mocking

Agents love to make the CI green by writing tests that mock everything. A quick way to expose a fake test is to flip one assertion and see if the test still passes.

# Pick a representative test and invert an assertion
pytest tests/test_jwt.py::test_verify_valid_token -x --tb=short

If the test still passes after you changed assertEqual to assertNotEqual, the test isn't exercising production code at all.

Next, audit the mock-to-assert ratio:

# Count mock usage
grep -c "mock\|patch\|MagicMock" tests/test_jwt.py
# Count assertions
grep -c "assert" tests/test_jwt.py

A ratio greater than 2:1 (mocks per assertion) is a red flag - the test is basically asserting that the mock was called with the arguments the mock itself supplied.

Fix path - If you find a fake test, comment with a concrete example of a real assertion (e.g., "assert that verifyJwt returns a decoded payload for a real token") and ask the author to add it.

Step 3 - Verify API Call Signatures

For each new or changed function call, jump to the definition and compare the signature with the official library docs, not just the local stub.

// Example of a subtle bug the type-checker missed
const result = await db.query(sql, params, callback);

The local stub shows db.query(sql: string, params: any[], callback: fn): void. Because it returns void, the await results in undefined. In production this means the caller always receives undefined and may silently skip error handling.

How to check - Open the library's README or the version-pinned docs in package.json/requirements.txt and ensure the signature matches. Run a quick integration test with edge inputs (e.g., null query, empty param array) to confirm runtime behaviour.

If the signature is wrong, a single-line comment with a link to the correct docs and a suggested fix is usually enough.

Step 4 - Confirm the Diff stays inside the declared scope

Now that you have a concrete scope string, filter the diff stat and list any files that fall outside it.

git diff --stat origin/main...HEAD | grep -v '^ auth/'

Sample output:

 src/utils/logger.ts        |  12 ++++++------
 src/api/users/handler.ts   |  34 ++++++++++++++--------
 2 files changed

Both files are outside the expected auth/ folder. Look at the commit messages for those files - if they say "while we're here, tidy logger", that's scope drift.

Action - Request a split of the PR. Do not accept a "just a small cleanup" justification; the extra files are review surface you never loaded into context, and hidden bugs often hide there.

Step 5 - Hunt for Hallucinated Imports and Hidden Side-effects

Two quick commands catch the most common issues.

# Verify the import resolves to the expected module
python -c "from utils.security import sanitize_html; print(sanitize_html.__module__)"

If the import resolves to an unexpected module (e.g., a local utils/__init__.py re-exports bleach.sanitizer.sanitize_html), you have a hallucinated import - the code compiles but does nothing at runtime.

Next, look for top-level side-effects that will run on import.

# Grep for network or file I/O at module level
grep -R "requests.get\|open(\|os\.environ\|subprocess" **/*.py

If you find a call like requests.get(URL) at the top of a module, run the test suite with networking disabled:

pytest --disable-socket tests/

If the suite still passes, the import-time network call is being swallowed by a mock - a classic hidden side-effect that will break in production.

Remediation - Ask the author to move the call into a function or guard it with if __name__ == "__main__": and ensure a real test exercises the code path.

Step 6 - Align Commit Messages with the Diff

Pick a random commit and compare its message with the actual changes.

# Show concise log for the PR range
git log --oneline origin/main..HEAD

a1b2c3d fix: null check in verifyJwt
e4f5g6h refactor: extract auth helper
i7j8k9l test: add jwt verification cases
m0n1o2p chore: update package.json

# Show the diff for the selected commit
git show a1b2c3d

If the diff touches a retry policy, timeout adjustments, and a null check, but the message only mentions the null check, the message covers less than 20 % of the change - a commit-message mismatch.

Rule of thumb - The diff should be explainable by the message within a ±20 % margin. If it isn't, request a rewrite of the commit (or a squash-and-rebase) so the history stays trustworthy.

Decision: Ship or Reject?

After the six checks, apply the following matrix:

Failed step	Can a one-sentence nudge fix it?	Action
1 - Scope not declared	✅	Comment: add concrete scope line.
2 - Fake test	❌	Reject - attach the failing inverted-assertion example.
3 - Wrong API signature	✅ (if isolated)	Nudge with docs link; reject if pervasive.
4 - Scope drift	❌	Reject - ask for split PR.
5 - Hallucinated import / hidden side-effect	❌	Reject - include verification command output.
6 - Misleading commit message	❌	Reject - require rewrite.

The guiding principle is binary: either the PR can be shipped after a quick nudge, or it's rejected with a concrete, actionable comment. "Comment and forget" leads to half-reviewed PRs that linger forever.

Key Takeaways

Write a concrete scope first - it prevents confirmation bias when you later look at the diff.
Flip one assertion - a simple fail-fast check for fake tests.
Audit mocks vs. asserts - a high ratio signals over-mocking.
Cross-check signatures against official docs, not just the local stub.
Filter diff by the declared scope; any out-of-scope file should trigger a split request.
Verify imports resolve to the intended module and that no top-level I/O runs on import.
Commit messages must map to the diff within ~20 %; otherwise reject.

Running this checklist takes about 10-12 minutes for a typical 4-commit PR and catches at least four of the six error families that a human-only checklist would miss. It's a small habit shift that saves hours of post-merge fire-fighting.

If you find this workflow useful, we packaged it into a Claude Code plugin that automates the repetitive parts (scope extraction, mock-audit, import verification). Feel free to check it out if you want the automated version.

Originally published at https://shipwithai.io/blog/vi/reviewing-ai-generated-pull-requests-2026-part1

Why Your AI Coach’s Warmth Might Be Hiding a Critical Regression

ShipWithAI — Fri, 22 May 2026 16:03:51 +0000

Intro

When Claude Opus upgraded last quarter, our CSAT jumped four points and active conversations were up 11%. The VP called it the cleanest upgrade of the year—until we noticed the coach stopped saying “let's revisit this plan.” That drop was half the size of the CSAT gain and signaled a hidden regression.

The problem: sycophancy

Anthropic’s May 2026 audit calls the “overly human” vibe sycophancy: the model agrees or validates the user even when the correct move is to disagree. The study measured:

9 % overall guidance chats
25 % on relationship advice
38 % on spirituality

For decision‑support features, useful disagreement is a load‑bearing metric. When a model becomes too agreeable, the dashboard shows higher warmth but the recommendation quality stalls.

A concrete technique: the pushback eval

Collect failure modes – pull the top three user logs where the feature should have pushed back.
Write 30 adversarial prompts – each prompt asks the model to evaluate a risky plan or contradictory statement.
Score – simple yes/no rubric: Did the model refuse or suggest a different course?
Run on every model bump – record the pushback rate and baseline it against the previous version.

A spreadsheet is enough; data‑science can later automate it. When Opus 4.7 shipped, the relationship sycophancy rate halved, and the pushback eval caught a 12 % dip in decision‑support recommendations that otherwise would have gone unnoticed.

Key takeaways

Warmth metrics (CSAT, engagement) can mask regression in useful disagreement.
Track a pushback rate alongside satisfaction.
A 30‑prompt adversarial sheet costs an afternoon and saves a quarter of a product’s ROI.

Action

Pick one move this sprint: add pushback rate to your eval dashboard, re‑run the sheet on the next model upgrade, or present the warmth vs. pushback delta at your QBR. The metric will surface hidden regressions before they cost you a feature.

Originally published at https://shipwithai.io/blog/en/claude-opus-overly-human-behavior

The Complete Claude Code Harness Engineering Guide (5 Layers, 8 Deep-Dives)

ShipWithAI — Fri, 08 May 2026 01:00:00 +0000

Harness engineering is everything around your AI agent except the model: memory, tools, permissions, hooks, observability. LangChain gained 13.7 benchmark points changing only the harness. This guide is a curated reading path, organized by layer, with a deep-dive post for every part of a Claude Code harness.

Layer 1 only (what most devs have)
  → Advice the model may ignore

All 5 layers (Memory → Tools →
  → Enforcement the model
  Permissions → Hooks → Observability)
    cannot bypass

LangChain jumped from 52.8% to 66.5% on Terminal Bench 2.0 by changing only the harness. Same model. 13.7 points of pure architecture gain (LangChain Blog, Feb 2026). Most Claude Code users stop at Layer 1. This guide is the reading path to the other four.

If you want the theory of harness engineering, read the pillar post. If you want the architecture deep-dive, read the 5 layers post. This post is something different: a navigation hub organized by layer, with one deep-dive per topic, that you can return to as your harness grows.

What is Claude Code harness engineering?

Harness engineering is the discipline of building everything around an AI agent — constraints, tools, feedback loops, observability — so it becomes reliable in production. For Claude Code, the harness is five layers: Memory (CLAUDE.md), Tools (MCP), Permissions (settings.json), Hooks (PreToolUse/PostToolUse), and Observability (session logs).

The formula: Agent = Model + Harness (Martin Fowler, Apr 2026).

The model is commodity. Every team on Sonnet 4.6 or Opus 4.7 gets the same raw capability. Your harness is what differentiates your team's output.

What are the 5 layers of a Claude Code harness?

Layer	Purpose	Claude Code File
1. Memory	What the agent knows	CLAUDE.md, MEMORY.md
2. Tools	What it can reach	settings.json (MCP)
3. Permissions	What it's allowed to do	settings.json allow/deny
4. Hooks	What's enforced at runtime	PreToolUse/PostToolUse
5. Observability	What you can see afterward	Session logs, cost tracking

Layer 1: What does your agent know before you type?

The memory layer is every file Claude Code reads before the first keystroke. CLAUDE.md holds your project rules. MEMORY.md holds the evolving state. Most developers ship only a CLAUDE.md and treat it as a wishlist of aspirations.

Your AI Agent Forgets Everything. Here's the Fix. — MEMORY.md is a 200-line index that Claude reads at session start. Setup takes 5 minutes. Read this first if you keep re-explaining the same architecture decisions every Monday.

Your CLAUDE.md Is an Instruction File. It Should Be a Failure Log. — Mitchell Hashimoto's AGENTS.md in Ghostty has zero aspirational lines. Every entry traces to a real agent mistake. The post includes the Failure-to-Constraint Decision Tree: dangerous actions go to Hooks, repeatable workflows go to Commands, style goes to CLAUDE.md.

Layer 4: What can the agent NOT do?

Hooks are the enforcement layer. Memory is advice. Hooks are law. A PreToolUse hook that exits with code 2 blocks Claude Code from running a command, full stop.

# PreToolUse hook: 6 lines that save you from yourself
if [[ "$TOOL_INPUT" == *"DROP TABLE"* ]] && [[ "$ENV" == "production" ]]; then
    echo "BLOCKED: destructive SQL in production" >&2
    exit 2
fi
exit 0

Which Claude Code Hook Do You Need? A Decision Guide — The 4 handler types (Deny, Log, Transform, Enrich), when to reach for PreToolUse vs PostToolUse, and which 3 hooks every production setup should have.

A PreToolUse hook exiting with code 2 is the only mechanism in Claude Code that unconditionally blocks a tool call. Instructions in CLAUDE.md can still be overridden by context or model reasoning. Hooks cannot be bypassed.

Layer 5: How do you know what your agent actually did?

Observability turns "my agent did something weird" into a reproducible bug report. One of LangChain's three harness improvements was a verification middleware that made the agent check its own work before marking a task complete.

Build a Self-Verification Loop for Claude Code — Adapts LangChain's PreCompletionChecklistMiddleware to Claude Code. Boris Cherny (creator of Claude Code) calls verification "probably the most important thing" for quality.

LangChain's three improvements mapped to layers: context injection (Layer 1), self-verification loops (Layer 5), and compute allocation (Layer 5). No single layer explained the full +13.7 point gain. They needed three layers working together.

Why does this actually work?

Three independent data points prove constraints beat capability:

LangChain: +13.7 on Terminal Bench 2.0 with harness changes only
OpenAI Codex: ~1 million lines of production code, zero human-written lines over five months, all inside heavily constrained harness environments
Mitchell Hashimoto's Ghostty: every AGENTS.md line is a prevented failure

The Constraint Paradox: Less AI Freedom, Better Code — Breaks down all three data points with benchmark tables and the counterintuitive finding that running at maximum reasoning budget scored worse (53.9%) than high (63.6%). Read this when someone says "we just need a smarter model."

Why does this matter for your career?

84% of developers use AI tools. Only 29% trust the output. That 55-point gap is the senior engineer's new job. One harness committed to version control multiplies across your whole team. Writing a great CLAUDE.md for 10 developers pays off more than writing 10,000 lines of code yourself.

Senior Engineers Don't Write Code. They Build Harnesses. — The career case with a harness review checklist for your next PR and the 4-era evolution of where senior engineers add value.

Where should you start reading?

Three paths based on where you are today:

New to harness engineering. Start with the pillar post for the definition, then the 5 layers post for the architecture. Come back here for your next deep-dive.

You have a CLAUDE.md and want more rigor. Read the memory fix post first to add MEMORY.md, then the failure-log pattern to rewrite your existing CLAUDE.md. Those two posts cover all of Layer 1.

Your agent has scared you at least once. Skip to the hook decision guide and ship one PreToolUse guard before your next session. Then read the constraint paradox for why this actually works.

FAQ

What is Claude Code harness engineering?

Harness engineering for Claude Code is configuring five layers around the model (Memory, Tools, Permissions, Hooks, Observability) to make the agent reliable in production. The model is commodity. The harness is your differentiator.

Do I need all 5 layers to start?

No. Start with Memory (CLAUDE.md + MEMORY.md) and Hooks (one PreToolUse guard). Those two cover the most common failure modes. Add the rest as your team scales or when a specific incident motivates it.

How is harness engineering different from prompt engineering?

Prompt engineering shapes what the agent tries. Context engineering shapes what the agent knows. Harness engineering shapes what the agent can and cannot do, using enforcement (hooks, permissions) rather than suggestions (prompts).

Does this only apply to Claude Code?

The principles apply to any AI coding agent. The implementation details (CLAUDE.md, PreToolUse hooks, MCP config) are Claude Code-specific. Claude Code offers the most programmable harness surface in the market today.

Try it now: Pick one path above, open the first linked post, copy one code block into your .claude/ folder, and run one Claude Code session with the change applied. The compound benefit starts on session #2.

Which layer would you add first? Drop it in the comments.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.

Hardening Your npm CI in 5 Concrete Layers

ShipWithAI — Thu, 07 May 2026 14:20:33 +0000

Intro

Your CI pipeline installs dependencies far more often than any developer’s laptop. That frequency makes it the biggest npm attack surface. I recently saw the Bitwarden breach where a hijacked GitHub Action pulled a malicious CLI for 90 minutes and harvested every credential on the runner. Below is the exact 5‑layer playbook we dog‑fooded at ShipWithAI to stop that.

The Problem

Most CI configs still look like this:

- uses: actions/checkout@v4   # mutable tag
- uses: actions/setup-node@v4 # mutable tag
- run: npm install           # silent version bumps
- run: npm publish           # uses stored NPM_TOKEN

The red flags are obvious: mutable tags, npm install, long‑lived tokens, no lockfile validation, and no dependency review. Each one is a foothold for an attacker.

Solution Walkthrough

Layer 1 – Enforce `npm ci`

npm ci installs only from the lockfile and fails on any mismatch. It also wipes node_modules first, guaranteeing a clean slate. Replace every npm install with:

- name: Install deps
  run: npm ci --ignore-scripts

Commit a project‑level .npmrc with ignore-scripts=true, save-exact=true, and audit-level=moderate so every runner inherits the same defaults.

Layer 2 – Validate lockfile integrity

Add lockfile-lint to the workflow:

- name: Lint lockfile
  run: npx lockfile-lint --allowed-hosts npmjs.com --validate-https

This blocks PRs that tamper with the lockfile source URLs.

Layer 3 – Dependency review action

GitHub’s dependency-review-action flags new or changed dependencies before merge:

- name: Dependency review
  uses: github/dependency-review-action@v2
  with:
    allow-scope: runtime,development

Layer 4 – Pin actions to SHA

Instead of actions/setup-node@v4, use the exact SHA of the release you’ve vetted:

- uses: actions/setup-node@d3b0c5f...

If a tag gets hijacked, your workflow stays on the trusted commit.

Layer 5 – OIDC trusted publishing

Replace static NPM_TOKEN secrets with OIDC tokens:

- name: Publish
  uses: npm/publish-action@v2
  with:
    token-type: oidc

GitHub issues a short‑lived token that expires with the job, eliminating long‑lived credential leakage.

Results

Switching to npm ci alone caught three silent version bumps in the first week. Adding the full stack stopped a malicious lockfile PR from ever reaching merge and removed the need to store a permanent NPM token.

Key Takeaways

Deterministic installs (npm ci) are non‑negotiable for CI.
Validate lockfiles before they touch the runner.
Review deps on every PR.
Pin actions to immutable SHAs.
Publish with OIDC to avoid static secrets.

Conclusion & CTA

These five layers are easy to copy‑paste into any repo and give you a solid defense against the kind of supply‑chain hijack that hit Bitwarden. Follow me for more concrete SDLC hardening tips and feel free to drop your CI questions in the comments.

Originally published at https://shipwithai.io/blog/npm-ci-security-team-playbook/

Which Claude Code Hook Do You Need? A Decision Guide

ShipWithAI — Wed, 06 May 2026 01:00:00 +0000

Claude Code has 4 hook handler types (command, prompt, agent, http) and 21 lifecycle events. Most developers default to command hooks on PreToolUse. This decision guide helps you pick the right type for the right event, and tells you which 3 to implement first.

Two configs. Same goal: block a force push to main. Different reliability:

# Command hook (deterministic, <5ms)
COMMAND=$(jq -r '.tool_input.command // empty' < /dev/stdin)
if echo "$COMMAND" | grep -qE 'git push.*(--force|-f).*main'; then
    echo "BLOCKED: force push to main" >&2
    exit 2
fi

// Prompt hook (non-deterministic, 300-2000ms)
{
  "type": "prompt",
  "prompt": "Block this if it looks like a force push to a production branch"
}

The command hook is 5 lines of bash. It runs in under 5ms. It catches every git push --force main without exception.

The prompt hook calls an LLM. It takes 300-2000ms. It might decide --force-with-lease is safe enough to allow.

Both are "hooks." Choosing the wrong type turns a guardrail into a suggestion. CLAUDE.md instructions achieve 70-90% compliance. Hooks achieve 100% — but only when you pick the right one.

What are the 4 Claude Code hook handler types?

Each type trades speed for intelligence differently. Pick the wrong type and your 100% guardrail drops to a probabilistic suggestion.

Handler	Speed	Deterministic?	Codebase Access?	Best For
command	<5ms	Yes	No (stdin only)	Guardrails, formatting, logging
prompt	300-2000ms	No	No	Nuanced decisions on Stop
agent	2-10s	No	Yes (full tools)	Deep verification, architecture
http	50-500ms	Yes (your server)	No	Team policies, centralized audit

Command hooks are shell scripts. They read JSON from stdin, run fast, and return deterministic results. Use them for anything you can express as a string match, path check, or regex.

Prompt hooks call an LLM to make a judgment call. Only use them when the decision genuinely requires reasoning, like evaluating subagent output quality on SubagentStop.

Agent hooks spawn a full Claude Code session that can read files, search code, and run tools. Reserve them for verification tasks that need codebase context.

HTTP hooks POST to your server. Useful for centralized team policies and audit logging.

The critical rule: never use prompt-based hooks for safety boundaries. Prompt hooks involve LLM judgment, and LLMs can be wrong. Safety boundaries need deterministic command hooks.

When should you use CLAUDE.md vs a hook vs both?

Use CLAUDE.md for conventions the agent should follow. Use hooks for rules the agent must never break. Use both when you want the agent to understand WHY while the hook enforces WHAT.

Is this a HARD constraint (must NEVER be violated)?
├── YES → Can you test it with a string/path/regex check?
│         ├── YES → Command hook (PreToolUse)
│         └── NO  → Does it need codebase context?
│                   ├── YES → Agent hook
│                   └── NO  → Prompt hook or HTTP hook
└── NO  → Is it a preference or convention?
              ├── YES → CLAUDE.md (~70-90% compliance)
              └── NO  → Is it a repeatable workflow?
                        ├── YES → Skill or .claude/commands/
                        └── NO  → You probably don't need it

When should you use both? When the constraint is structural (hook enforces it) but the agent also benefits from understanding the reasoning:

Hook: PreToolUse blocks git push --force to main
CLAUDE.md: "We use --force-with-lease instead of --force because a force push overwrote a teammate's commits in March 2026"

The hook prevents the bad action. The CLAUDE.md helps the agent choose the right alternative.

Which hook events should you implement first?

Start with 3 events in this order:

Priority	Event	Handler	What It Does	Setup Time
1st	PreToolUse	command	Block dangerous actions	15 min
2nd	PostToolUse	command	Auto-format, log actions	20 min
3rd	Stop	agent	Verify work before done	30 min
4th	SessionStart	command	Load env vars, context	10 min
5th	SubagentStop	prompt	Validate subagent output	20 min
6th	PermissionRequest	command	Auto-approve safe patterns	15 min
7th	PreCompact	command	Preserve context on compact	15 min

Your first hook — a PreToolUse command hook that blocks force pushes:

#!/bin/bash
# .claude/hooks/block-force-push.sh
# Blocks git push --force and -f to main/master/production

COMMAND=$(jq -r '.tool_input.command // empty' < /dev/stdin)

if echo "$COMMAND" | grep -qE 'git push.*(--force|-f)' && \
   echo "$COMMAND" | grep -qE '(main|master|production)'; then
    echo "BLOCKED: force push to protected branch" >&2
    exit 2
fi

exit 0

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "bash .claude/hooks/block-force-push.sh"
          }
        ]
      }
    ]
  }
}

How do you handle multiple hooks on the same event?

Hooks on the same event run in definition order. For PreToolUse, the strictest decision wins: deny beats defer, defer beats ask, ask beats allow. If any hook denies, the action is blocked regardless of what other hooks return.

Chain hooks from fastest to slowest to minimize latency:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          { "type": "command", "command": "bash .claude/hooks/block-force-push.sh" },
          { "type": "command", "command": "bash .claude/hooks/validate-paths.sh" },
          { "type": "command", "command": "bash .claude/hooks/log-action.sh" }
        ]
      }
    ]
  }
}

Decision precedence hierarchy:

deny   → Action blocked. Feedback sent to model.
defer  → Action paused (headless mode). External UI resumes.
ask    → User prompted for confirmation.
allow  → Action proceeds. Skips built-in permission check.
(none) → Default behavior. Built-in permission check runs.

What are the most common hook mistakes?

Three mistakes account for most "my hook doesn't work" reports:

Exit code cheat sheet

Exit Code	Meaning	Model Sees Feedback?
0	Success (parse JSON from stdout)	Yes, if JSON provided
2	Block action (stderr becomes feedback)	Yes
Any other	Silent error (logged in verbose only)	No

The exit 1 vs exit 2 distinction is the #1 gotcha. Exit 1 means "my hook crashed." Claude Code logs it quietly and continues. Exit 2 means "I'm deliberately blocking this action."

Debug workflow

Test any hook manually:

echo '{"tool_name":"Bash","tool_input":{"command":"git push --force main"}}' \
    | bash .claude/hooks/block-force-push.sh
echo "Exit code: $?"

If the hook doesn't run at all, check:

Path correct? Command path is relative to project root, not the hooks directory
Matcher correct? "matcher": "Bash" matches the tool name, not the command content
Settings level? Project .claude/settings.json overrides user ~/.claude/settings.json
File executable? Run chmod +x .claude/hooks/your-hook.sh
JSON valid? A syntax error in settings.json silently disables all hooks

FAQ

What are the 4 Claude Code hook handler types?

Command (shell scripts, <5ms, deterministic), prompt (LLM judgment, 300-2000ms), agent (multi-turn verification with codebase access, 2-10s), and http (webhooks, 50-500ms). Use command hooks for guardrails and formatting. Use prompt or agent hooks for nuanced decisions that require reasoning.

Should I use CLAUDE.md or a hook for security rules?

Hooks. CLAUDE.md instructions achieve 70-90% compliance because they compete with 200K tokens of context. A PreToolUse command hook achieves 100% compliance because it runs outside the LLM's reasoning chain. Use CLAUDE.md to explain WHY. Use hooks to enforce WHAT.

What is the difference between PreToolUse and PostToolUse hooks?

PreToolUse runs BEFORE a tool executes and can block it (exit code 2) or modify its input. PostToolUse runs AFTER execution and cannot undo the action, but it can auto-format code, log what happened, or inject feedback. PreToolUse for prevention, PostToolUse for reaction.

Can Claude Code hooks run in headless mode?

Yes. All hook types work in headless mode (claude -p). PreToolUse hooks can return permissionDecision: "defer" to pause execution for external UI collection. This makes hooks fully compatible with CI/CD pipelines and SDK-based workflows.

Try it now: Copy the force-push blocker script into .claude/hooks/block-force-push.sh, register it in .claude/settings.json, make it executable with chmod +x, and test it with the debug command above. Verify exit code 2. You now have one production-ready guardrail.

Which hook event would you implement first? Drop it in the comments.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.

4 Lines in ~/.npmrc That Block 80% of npm Supply Chain Attacks

ShipWithAI — Mon, 04 May 2026 01:00:00 +0000

Four lines in ~/.npmrc block the most common npm supply chain attacks before they execute. Setup takes 30 seconds. This is the bare-minimum defense for anyone letting Claude Code or Cursor run npm install on their machine.

These four lines are on my laptop right now. I added them the morning the axios news broke and forgot about them. Since then, every npm install Claude Code has run on my machine, across five side projects, has skipped lifecycle scripts by default. Zero breakage. Zero effort.

# ~/.npmrc
ignore-scripts=true
save-exact=true
audit-level=moderate
fund=false

In 2025, attackers published 454,648 malicious npm packages — roughly half a million in a single year (Sonatype Open Source Malware Index, 2026). The four lines above block the most common payload mechanism (lifecycle scripts) for every project on your laptop, including whatever Claude Code ran at 2am last night.

Why is your default npm setup unsafe in 2026?

npm ships with lifecycle scripts enabled by default. That means any package, direct or transitive, can execute arbitrary code on your machine during npm install — before you ever type require(). Over 99% of all open source malware now targets npm.

Here's the same attack pattern, compressed across eight years:

Year	Incident	Payload vector
2018	event-stream (Bitcoin wallet stealer, 2M/wk)	postinstall
2025 Sep	Shai-Hulud worm, 18 packages, 2.6B/wk downloads	postinstall
2026 Mar	axios@1.14.1 RAT, 100M/wk downloads	postinstall

Three incidents across eight years. Same mechanism every time. npm's official response each time is to unpublish the package and write a blog post. No structural change to how postinstall works.

The uncomfortable part: 84% of developers use AI coding tools, and 41% of code written in 2025 was AI-generated or AI-assisted. AI agents install packages at machine speed, with approval fatigue doing the rest. The human review step that used to catch weird dependencies has already been deleted from most workflows.

What each line does

`ignore-scripts=true`

Disables preinstall, install, and postinstall lifecycle scripts for every npm install. The OWASP NPM Security Cheat Sheet calls this the single most effective mitigation against malicious or compromised packages. The axios@1.14.1 RAT, the Shai-Hulud worm, event-stream's Bitcoin stealer — all needed this mechanism to execute. Turn it off globally and the default delivery vehicle is gone.

`save-exact=true`

Pins exact versions in package.json whenever you add a package. Without it, npm install axios writes "axios": "^1.14.0", a caret range that resolves to 1.14.1 on the next clean install. With save-exact=true, the same command writes "axios": "1.14.0". A hijacked patch release cannot silently promote itself into your lockfile.

`audit-level=moderate`

Raises npm install exit code when known CVEs of moderate or higher severity are present. Default behavior is warn-only. This flag makes audit block instead — which means CI or Claude Code sessions fail loud rather than scrolling past.

`fund=false`

Removes the "N packages are looking for funding" message from every install. Cosmetic, but it matters. When your install output is 80% funding notices, the warnings that actually matter (audit, deprecation, peer dependency conflicts) get buried. Signal hygiene is a security layer.

Verify your config:

npm config get ignore-scripts save-exact audit-level fund

Expected output: true, true, moderate, false.

Why does this work for most npm attacks?

The dominant payload pattern in 2025 and 2026 npm attacks is a lifecycle script that runs during install. Disabling those scripts breaks the default delivery vehicle.

Attack vector	Real example	Line that blocks it
postinstall RAT	axios@1.14.1 (2026)	`ignore-scripts=true`
Silent minor/patch hijack	Maintainer account takeover	`save-exact=true`
Known CVE buried as warning	Any reported advisory	`audit-level=moderate`
Warning fatigue hiding alerts	Every install, all day	`fund=false`

What does this NOT protect against?

Honest boundaries. This config blocks the most common vector, not every vector:

Malicious code in the package's main module. Anything that runs on require() or import is unaffected by ignore-scripts. If the package is actively imported by your code, the payload runs at runtime.

Toolchain exploits. --ignore-scripts stops npm lifecycle hooks, but git still runs during install, and external binaries still execute if the install process invokes them.

Typosquatting and slopsquatting. AI assistants sometimes hallucinate package names that attackers have preemptively registered. OWASP flags this as the fastest-growing npm attack class in 2026.

Packages already in node_modules. The four lines only protect future installs. Clean rebuild recommended: rm -rf node_modules package-lock.json && npm install.

Roughly 20% of recent high-impact npm malware executes outside lifecycle scripts through runtime require() or compromised main modules. Treat .npmrc as necessary-but-not-sufficient.

What breaks when you set `ignore-scripts=true`?

A small set of packages genuinely need lifecycle scripts to compile native binaries or download platform assets. The usual suspects: bcrypt, node-sass, sharp, esbuild, puppeteer, and canvas. You will notice immediately because they fail loud, not silent.

Fix per-package:

# Install normally, then rebuild the one package that needs it
npm install sharp
npm rebuild sharp

For projects with multiple native-compile dependencies, use an allow-list:

npm install --save-dev @lavamoat/allow-scripts
npx allow-scripts auto
npx allow-scripts

Package	Why it needs scripts
`bcrypt`	Native C++ compilation
`sharp`	Binary download + native bindings
`node-sass`	LibSass native build
`esbuild`	Platform binary download
`puppeteer`	Chromium download
`canvas`	Cairo/Pango native bindings

Ninety percent of projects never hit any of these. The ones that do fail on the first CI run after the config change, and you fix them once.

Upgrading to hook-based defense

~/.npmrc is the user-scoped floor. The next layer is process-level enforcement: intercepting every npm install Claude Code tries to run, auditing it before the command executes, and blocking the call if it's missing --ignore-scripts or pointing at a new unreviewed dependency.

With 41% of all code now AI-generated or AI-assisted, the agent — not the human — is the primary npm install trigger. That's a PreToolUse hook in .claude/settings.json.

The hook post covers the three-layer setup: PreToolUse audit, PostToolUse lockfile diff, and CLAUDE.md enforcement rules.

FAQ

Will `ignore-scripts=true` break my builds?

Usually no for pure-JavaScript dependencies, which is 90%+ of a typical React or Node project. Yes for native-compile packages like bcrypt, sharp, and esbuild. Fix is npm rebuild <pkg> per package or @lavamoat/allow-scripts for a team-wide allow-list.

Should I commit `.npmrc` to my repo?

Personal config goes in ~/.npmrc (never committed, user defaults). Project-level .npmrc at the repo root can be committed as long as it contains no secrets. Registry auth tokens belong only in ~/.npmrc, never in the repo.

Does this work for pnpm and yarn?

.npmrc is shared. pnpm reads ignore-scripts=true natively. Yarn classic also reads .npmrc. Yarn Berry uses .yarnrc.yml instead, and the equivalent setting is enableScripts: false. Bun also honors .npmrc.

Is `npm audit` still useful if I set `audit-level=moderate`?

Yes, and it becomes more useful. The flag changes audit from warn-mode to block-mode on CVEs at moderate severity or higher. Audit still only catches published CVEs. For zero-days, you need the hook layer from the hooks post.

Try it now: Open a terminal and run:

echo -e "ignore-scripts=true\nsave-exact=true\naudit-level=moderate\nfund=false" >> ~/.npmrc

Verify with npm config get ignore-scripts save-exact audit-level fund. Total time: under 30 seconds.

Ready for the process-level defense? Read → Stop npm Supply Chain Attacks with Claude Code Hooks.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.

Claude Code Forgets Everything Between Sessions. MEMORY.md Fixes That

ShipWithAI — Sat, 02 May 2026 01:00:00 +0000

Claude Code resets context every session. MEMORY.md gives it persistent memory of your project's evolving state in a 200-line index file. Setup takes 5 minutes. One prompt at the end of each session keeps it current.

# Session 1: "This project uses Clerk for auth, not NextAuth."
# Session 2: "As I mentioned, we use Clerk..."
# Session 3: "We migrated to Clerk in March. Stop suggesting NextAuth."
# Session 4: "READ THE CLAUDE.MD. We use Clerk."
# Session 5: "..."
# Session 6: *opens CLAUDE.md, adds it in bold, all caps*

Sound familiar? Developers spend 10-15 minutes per session rebuilding context that was clear yesterday (CleanAim, 2026). Over a month of daily sessions, that's 5-10 hours of repeating yourself.

The fix is one file. MEMORY.md is a lightweight index that Claude Code reads at session start. Not a conversation log. Not a code dump. A table of contents for your project's current state.

CLAUDE.md holds your static rules (conventions, build commands, constraints). MEMORY.md holds your evolving state (recent migrations, active decisions, what changed last week). They're both part of Layer 1 in the harness engineering framework, and most developers only have the first half.

Why does Claude Code forget everything between sessions?

Claude Code starts each session with a fresh context window. It reads CLAUDE.md and MEMORY.md at startup, but nothing else carries over from previous conversations. The --continue flag resumes one specific conversation, but decisions spread across multiple sessions are lost unless you write them down.

Here's the gap most developers hit:

	CLAUDE.md	MEMORY.md	--continue
Persists across sessions	Yes	Yes	Last session only
Content type	Static rules	Evolving state	Full conversation
Who updates it	You (manually)	You + Claude	Automatic
Size limit	No hard limit	200 lines / 25KB	Context window
Best for	Conventions, constraints	Decisions, migrations, active work	Resuming interrupted work

CLAUDE.md doesn't change session to session. It says "use Vitest for tests" and that's true tomorrow too. But "we migrated from Prisma to Drizzle last Tuesday" is evolving state. It matters for a month, then it's old news. That kind of context belongs in MEMORY.md.

Claude Code does have auto memory since v2.0.64. The AutoDream feature consolidates learnings after 24+ hours and 5+ sessions. But auto memory captures broad patterns, not your specific decision to use TanStack Query over SWR on April 5th. MEMORY.md is the manual complement where you control exactly what persists.

How do you set up MEMORY.md in 5 minutes?

Create a file called MEMORY.md in your project root with 5-10 pointer entries, each under 150 characters. Each entry points to where information lives in your project, not the information itself. Claude Code loads this file automatically at session start.

Here's a realistic template:

## Project State (updated 2026-04-17)
- [Auth](src/lib/auth/) - Clerk since March 2026. Migrated from NextAuth.
- [DB](prisma/schema.prisma) - PostgreSQL on Supabase. Drizzle ORM.
- [Deploy](docs/deploy.md) - Vercel preview for PRs, production on main.
- [Testing](vitest.config.ts) - Vitest unit + Playwright E2E. 80% min.
- [API](src/app/api/) - Server Actions for mutations. API routes for webhooks only.
- [Payments](src/lib/stripe/) - Stripe checkout. Webhooks at /api/webhooks/stripe.
- [WIP] Dashboard redesign in progress. Branch: feature/dashboard-v2.
- [Bug] Rate limiter false positives on /api/search. Issue #234.
- [Decision] Chose TanStack Query over SWR, April 5. See docs/decisions/004.md.
- [Deprecated] Old /api/v1/ routes. Remove after May 1 deadline.

Each entry is a pointer. "Clerk since March 2026" tells Claude the auth system and when it changed. If Claude needs details, it reads src/lib/auth/. The entry doesn't dump the auth implementation into MEMORY.md.

One critical constraint: MEMORY.md is capped at 200 lines or 25KB, whichever is smaller. Entries beyond line 200 are silently dropped with no warning. Keep it lean.

What makes a good MEMORY.md entry vs a bad one?

Good entries are pointers under 150 characters that tell Claude where to look. Bad entries dump content that belongs in source files. The ETH Zurich AGENTbench study found that longer context files actually reduce agent success by ~3% while increasing costs by up to 19% (Gloaguen et al., 2026). Less is more.

Bad Entry (content dump)	Good Entry (pointer)
`Auth uses Clerk with middleware at src/middleware.ts that checks session cookies and redirects unauthenticated users to /sign-in with a custom error page`	`[Auth](src/lib/auth/) - Clerk since March 2026. See middleware.ts.`
`Database is PostgreSQL 16 on Supabase with connection pooling via pgBouncer, schema managed by Drizzle ORM using push strategy for migrations`	`[DB](prisma/schema.prisma) - PostgreSQL/Supabase, Drizzle ORM.`
`The old API routes at /api/v1/users, /api/v1/products, and /api/v1/orders are deprecated and scheduled for removal in the next sprint after May 1`	`[Deprecated] /api/v1/ routes. Remove after May 1.`

The bad entries average 25-30 words. The good entries average 8-12 words. Both give Claude the same actionable information.

Why do short pointers work better? 80% of tokens in typical agent sessions are wasted on "finding things" rather than doing things. Pointers eliminate the finding. Claude reads "Clerk since March 2026" and goes straight to the auth code instead of spending 3 turns figuring out the auth stack.

Categories that belong in MEMORY.md: decisions made (with dates), active migrations or refactors, work in progress (branch names, issue numbers), known bugs (with tracking links), deprecation deadlines, recent architecture changes.

What does NOT belong: static rules (→ CLAUDE.md), code snippets (→ source files), architecture docs (→ docs/ directory), dangerous action prevention (→ Hooks).

How do you keep MEMORY.md current without complex hooks?

At the end of each session, ask Claude one prompt. Claude reads the current MEMORY.md, adds or updates relevant entries, removes stale ones, and keeps it under the 200-line limit. No hooks, no automation, no third-party tools. One prompt, ten seconds.

Here's the prompt (copy-paste ready):

Update MEMORY.md with what you learned this session: new decisions,
changed architecture, resolved bugs, anything future sessions should
know. Keep entries under 150 chars. Remove anything no longer relevant.

That's the entire workflow. Claude knows what changed because it just did the work. It writes the entries in the pointer format it already sees in the file. You review the diff, approve or tweak, and the next session starts with updated context.

Do this at the end of sessions where something meaningful changed. Skip it for quick lookups or small fixes where nothing new was decided.

Why manual beats auto-update hooks for this: hooks add complexity, can generate noisy entries, and aren't proven for memory quality. The manual prompt lets you review what gets added. You stay in control of what your agent remembers.

When should you prune MEMORY.md?

Prune monthly, same cadence as CLAUDE.md pruning. Remove entries older than 30 days that are no longer relevant. Graduate stable entries to CLAUDE.md. The 200-line limit is hard, and entries beyond it vanish silently.

Four questions per entry:

For each MEMORY.md entry, ask:
1. Still true? → NO → Delete it
2. Stable for 30+ days? → YES → Graduate to CLAUDE.md
3. Duplicate of CLAUDE.md? → YES → Remove from MEMORY.md
4. Would a new teammate need this? → NO → Delete it

The graduation pattern is important. "Migrated from Prisma to Drizzle, April 2" is a MEMORY.md entry for the first month. After 30 days, the migration is old news. Graduate it to CLAUDE.md as a static rule: "ORM: Drizzle (not Prisma)." Then delete it from MEMORY.md.

If your MEMORY.md grows past 150 lines, you're overdue for pruning. HumanLayer keeps their CLAUDE.md under 60 lines for the same reason: fewer lines means higher signal per line.

FAQ

What is MEMORY.md in Claude Code?

MEMORY.md is a project-level index file that Claude Code reads at the start of every session. It provides persistent memory of your project's evolving state: recent decisions, active work, migrations, and known issues. Each entry should be a pointer under 150 characters. The file is capped at 200 lines or 25KB.

What is the difference between CLAUDE.md and MEMORY.md?

CLAUDE.md holds static rules that rarely change: tech stack, naming conventions, build commands, constraints. MEMORY.md holds evolving state that changes between sessions: recent migrations, active decisions, work in progress, known bugs. Think of CLAUDE.md as the constitution and MEMORY.md as the changelog.

Does Claude Code have auto memory?

Yes. Since v2.0.64, Claude Code has auto memory (AutoDream) that consolidates learnings after 24+ hours and 5+ sessions. It captures broad patterns automatically. But it doesn't track project-specific decisions like "chose TanStack Query over SWR on April 5." Use MEMORY.md for critical project state and let auto memory handle general patterns.

How many lines can MEMORY.md have?

200 lines or 25KB, whichever is smaller. Entries beyond line 200 are silently dropped with no warning. Keep your file under 150 lines and prune monthly. Each entry should be a pointer under 150 characters. If your MEMORY.md consistently exceeds 150 lines, graduate stable entries to CLAUDE.md and delete resolved items.

Try it now: Create MEMORY.md in your project root. Write 5 pointer entries covering: auth, database, deploy, testing, and one active decision. Keep each under 150 characters. Start a new Claude Code session and verify it references your entries. At the end, run: "Update MEMORY.md with what you learned this session."

How many lines is your MEMORY.md? Drop it in the comments.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.

Harness Engineering Is the New Senior Developer Skill (Here's Why)

ShipWithAI — Thu, 30 Apr 2026 01:00:00 +0000

The highest-leverage activity for senior engineers in 2026 isn't writing code. It's building the 5-layer harness (memory, tools, permissions, hooks, observability) that makes every team member's AI output reliable. One harness, committed to version control, serves 10 developers.

84% of developers use AI coding tools.
29% trust what they produce.

That 55-point gap is the senior engineer's new job.

Not a new model. Not a better prompt. A better system around the model.

The gap between adoption and trust exists because developers adopted AI tools without building the systems to verify, constrain, and correct their output. The tool works fine. The harness is missing. And building that harness is the new leverage point for senior engineers.

This post is the capstone of the Harness Engineering series. Previous posts covered each layer of the system. This one answers the career question: why should you, specifically, care about any of it?

Why is AI adoption high but trust low?

Developer AI tool adoption reached 84% in 2025, with 51% using AI tools daily (Stack Overflow Developer Survey, 2025). But trust in AI-generated code dropped from 40% to 29% over the same period (ShiftMag, 2025). Adoption climbed while trust fell. That divergence tells you everything.

The pattern looks like this: developer installs AI tool, generates code, eyeballs it, ships it. Works for prototypes. Breaks in production. After the third rollback, trust erodes. After the fifth, the team lead starts asking why they're paying for this.

The problem isn't the model. The model generates reasonable code most of the time. The problem is that nothing verifies the output, nothing constrains the dangerous actions, and nothing remembers what went wrong last session.

Without harness:
Developer → AI generates code → eyeball it → ship it → hope
Trust trajectory: down

With harness:
Developer → AI generates code → hooks verify → constraints block bad actions → memory prevents repeat mistakes
Trust trajectory: up

The tool is the same in both cases. The system around it isn't.

Where does senior engineer leverage live now?

The leverage point for senior engineers has shifted four times in six years. Each shift multiplied output and made the previous skill table stakes.

Era	Years	What You Optimize	Your Leverage
Write good code	Pre-2023	Algorithms, architecture	Your typing speed and design skill
Write good prompts	2023-2024	Instructions to the model	How well you phrase requests
Curate good context	2025	What the model sees	CLAUDE.md, context windows, RAG
Build good harnesses	2026	The system around the model	Hooks, verification, constraints, memory

Each era didn't replace the previous one. It absorbed it. You still need to write good code. You still need good prompts. You still need good context. But the leverage multiplier is now in the harness layer, not the layers below it.

LangChain proved this with numbers. Same model (gpt-5.2-codex), same prompts, same context window. Three harness changes: context injection, self-verification loops, and compute budget management. Result: 52.8% to 66.5% on Terminal Bench 2.0, a jump from Top 30 to Top 5.

The model was never the bottleneck. The harness was.

What does a 5-layer harness system look like?

A production harness has five layers: memory, tools, permissions, hooks, and observability. Each layer compounds the reliability of the layers below it. Building them in order (1, then 4, then 2, then 3, then 5) produces the fastest ROI. Most developers stop at Layer 1.

Layer	What It Does	Example
1. Memory	Persistent context	"Use Clerk not NextAuth" persists across sessions
2. Tools	Extended capabilities	MCP server for database queries
3. Permissions	Safety boundaries	Block `rm -rf`, allow `npm test`
4. Hooks	Verification loops	PostToolUse runs ESLint after every file edit
5. Observability	Audit + cost tracking	Token cost alerts at $2/session

Here's why the order matters. Memory (Layer 1) is free. You create a CLAUDE.md file with your project's rules, and every session starts with the right context. That alone eliminates the "explaining Clerk for the 6th time" problem.

Hooks (Layer 4) come next because they enforce rules that memory can only suggest. A CLAUDE.md line saying "run tests before committing" gets ignored under pressure. A PostToolUse hook that runs npx eslint --quiet after every file edit cannot be bypassed. Memory advises. Hooks enforce.

The rest fills in from there. Tools extend what the agent can do. Permissions restrict what it's allowed to do. Observability tells you what it actually did.

One afternoon of setup. Every session after that is more reliable.

How does one harness multiply a team of 10?

A harness committed to version control gives every developer on the team the same verification loops, the same constraints, and the same memory. One staff engineer's afternoon of harness work replaces 10 developers' daily context-rebuilding. OpenAI's Codex team shipped 1,500 PRs with just 3 engineers using this principle (Fowler, 2026).

Three levels of multiplication:

Individual harness: Your CLAUDE.md, your hooks, your MEMORY.md. It lives in the repo. Every git clone inherits it.

.claude/
    settings.json      # Hook configs, permission rules
CLAUDE.md              # Static rules, constraints, failure log
MEMORY.md              # Evolving state, active decisions

Team harness: Shared MCP servers, shared hook configs, shared MEMORY.md entries for active migrations. When you add a constraint after a production incident, every team member gets it on their next git pull.

Organizational harness: Standard hook templates across repositories. Compliance hooks that prevent secrets in commits and block force pushes to main. The security team writes it once, every repo inherits it.

The multiplication math is straightforward:

Without harness:
10 developers x 15 min/session rebuilding context = 2.5 hours/day wasted
Monthly: ~50 hours lost

With harness:
Setup: 4 hours (one staff engineer, one afternoon)
Daily savings: 2.5 hours
ROI positive: day 2

This is why staff engineer job descriptions at major tech companies increasingly mention "developer experience" and "tooling." Harness engineering is developer experience for the AI era. You're not writing code. You're building the system that makes everyone else's AI-generated code reliable.

What should you review in a harness instead of just code?

Code review catches bugs in implementation. Harness review catches bugs in the system that produces implementation. When AI-authored code reached 41% of all new code in 2026 (Modall, 2026), reviewing the system that generates it became as important as reviewing the code itself.

Here's a harness review checklist. Use it alongside your existing code review process:

Harness Review Checklist:

Memory:
[ ] CLAUDE.md reflects current tech stack and constraints
[ ] MEMORY.md has been pruned in the last 30 days
[ ] No stale entries pointing to removed files or old decisions

Hooks:
[ ] PostToolUse verification exists for file edits
[ ] Stop hook exists for destructive commands
[ ] Hook configs are committed to version control (not local-only)

Constraints:
[ ] Allowed commands list matches CI/CD requirements
[ ] No wildcard permissions on production-affecting tools
[ ] Sensitive files (.env, credentials) excluded from agent access

Cost:
[ ] Session cost alerts configured
[ ] Context window usage monitored
[ ] Unnecessary files excluded from context

Add this checklist to your PR template. It takes 2 minutes to run and catches the class of bugs that code review can't see: configuration drift, missing enforcement, stale context.

Build your first team harness

The fastest path from zero to working team harness takes six steps and about 30 minutes:

Pick one repo your team uses daily
Audit the CLAUDE.md: does it reflect current tech stack? Add 3 constraints from recent bugs using the failure log pattern
Add one PostToolUse hook: ESLint after file edits. Copy the config from the verification loop post
Create MEMORY.md with 5 pointer entries for active work
Commit the harness files: CLAUDE.md, MEMORY.md, .claude/settings.json
Run the harness review checklist above in your next PR review

Every git pull now gives your entire team the same system. One afternoon of setup. Compounding returns from day 2.

FAQ

What is harness engineering for AI coding agents?

Harness engineering is the practice of building the system around an AI model (memory, tools, permissions, hooks, observability) to make the agent reliable in production. The term was formalized by Birgitta Bockeler on Martin Fowler's site and OpenAI in early 2026. The core formula: Agent = Model + Harness. The model is a commodity. The harness is your competitive advantage.

Do senior engineers still write code with AI agents?

Yes. But the leverage point has shifted. Senior engineers spend more time building harnesses (CLAUDE.md, hooks, verification loops, MCP servers) that make every team member's AI output more reliable. Writing code is still part of the job. It's just no longer the highest-leverage activity.

How long does it take to set up a Claude Code harness?

A basic harness (CLAUDE.md + one verification hook + MEMORY.md) takes about 30 minutes. A full 5-layer system takes 2-4 hours. For a team of 3+ developers saving 15 minutes per session each, the ROI is positive within 2 days.

Can harness engineering work for any AI coding tool?

The principles (persistent memory, verification loops, constraints, observability) apply to any agent. The implementation differs by tool. Claude Code has hooks and CLAUDE.md. GitHub Copilot has .github/copilot-instructions.md. Cursor has .cursorrules. The harness pattern is universal. The config files are tool-specific.

Try it now: Pick one repo, add CLAUDE.md + one PostToolUse hook + MEMORY.md. Commit. Every git pull gives your team the same harness. Setup: 30 minutes. ROI: day 2.

What does your team's harness look like today? Drop it in the comments.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.

How to Build a Self-Verification Loop in Claude Code (3 Layers, 20 Minutes)

ShipWithAI — Tue, 28 Apr 2026 01:00:00 +0000

Claude Code's Stop hook blocks the agent from finishing until verification passes. Combine it with PostToolUse feedback injection to build a 3-layer verification loop (syntax, intent, regression) in 20 minutes. The result: the agent can't say "done" until it actually is.

Two hook setups. Same Claude Code session. Different outcomes:

# What most devs have: a formatting hook
# PostToolUse: runs prettier after file edits

# What this post builds: a verification loop
# PostToolUse: checks syntax on every file change
# Stop: blocks completion until tests pass + intent verified
# Result: agent can't say "done" until it actually is

The first catches formatting. The second catches logic errors, missed requirements, and broken tests before the agent claims it's finished.

LangChain's PreCompletionChecklistMiddleware is the most documented example of this pattern. It contributed to a 13.7-point benchmark gain using harness changes alone. This post builds the Claude Code equivalent using hooks.

What does "verification" actually mean for an AI coding agent?

Verification means checking that the agent's output matches the task's intent, not just that the code compiles. Only 3% of developers report high trust in AI-generated code (Qodo, State of AI Code Quality, 2025). Most developers stop at syntax checks (lint, format, type-check). Production verification needs two more layers.

Three verification layers, each catching a different class of failure:

Layer	Checks	Catches	Misses	Hook
1. Syntax	Code compiles, formats	Typos, type errors	Logic bugs	PostToolUse command
2. Intent	Output matches request	Wrong approach, missing features	Regressions	Stop prompt/agent
3. Regression	Existing tests pass	Broken functionality, side effects	Untested requirements	Stop command

"Run the tests" only covers Layer 3. Tests verify what you wrote tests for, not what you asked the agent to do. If you asked Claude to add pagination and it added sorting instead, every test still passes. Layer 2 catches that.

Spotify's Honk system demonstrates this at scale: 1,500+ PRs merged through verification loops, handling roughly 50% of all PRs automatically (Spotify Engineering, Dec 2025). Their key design choice: the agent doesn't know how verification works. It just gets pass/fail feedback. That separation keeps the agent focused on the task, not on gaming the verifier.

How does Claude Code's Stop hook work?

The Stop hook fires every time Claude finishes responding. Exit code 2 blocks Claude from stopping and forces it to continue working. This single mechanism prevents the agent from saying "done" when it isn't.

Here's the critical part most tutorials skip: the stop_hook_active field.

#!/bin/bash
# .claude/hooks/verify-before-stop.sh
INPUT=$(cat)

# CRITICAL: prevent infinite verification loops
# When true, Claude is already in a forced-continuation state
if [ "$(echo "$INPUT" | jq -r '.stop_hook_active')" = "true" ]; then
    exit 0  # Let Claude stop — don't loop forever
fi

# Run tests — block stop if they fail
npm test 2>&1 || {
    echo "Tests failing. Fix before completing." >&2
    exit 2
}

exit 0

Without checking stop_hook_active, the hook blocks every stop attempt. Claude fixes the tests, tries to stop, gets blocked again, fixes more, tries to stop, gets blocked again. Infinite loop. Always check this field.

Two ways to send feedback back to the model:

Exit 2 + stderr: The stderr message appears as feedback. Claude reads it, acts on it, then tries to stop again.
Exit 0 + JSON with additionalContext: Inject context into the agent's next turn without blocking. Good for warnings that don't require immediate action.

Feedback via additionalContext is capped at 10,000 characters. If your test output is longer, filter it. HumanLayer learned this the hard way: 4,000 lines of passing tests flooded the context window and the agent lost track of the task. Surface failures only.

How do you build a 3-layer verification loop?

Compose three hooks across two events: a PostToolUse command hook for syntax (Layer 1), a Stop command hook for regression (Layer 3), and a Stop prompt hook for intent (Layer 2). Each runs automatically. The agent gets feedback and self-corrects.

Layer 1: Syntax verification (PostToolUse)

Runs after every Write or Edit tool call. Checks lint and type errors on the changed file. Fast, deterministic, zero tokens.

#!/bin/bash
# .claude/hooks/verify-syntax.sh
INPUT=$(cat)
FILE_PATH=$(echo "$INPUT" | jq -r '.tool_input.file_path // empty')

# Skip non-JS/TS files
if [[ ! "$FILE_PATH" =~ \.(ts|tsx|js|jsx)$ ]]; then
    exit 0
fi

# Run ESLint on the changed file, surface errors only
LINT_OUTPUT=$(npx eslint "$FILE_PATH" --quiet 2>&1)
LINT_EXIT=$?

if [ $LINT_EXIT -ne 0 ]; then
    echo "{\"additionalContext\": \"Lint errors in $FILE_PATH:\n$LINT_OUTPUT\"}"
    exit 0
fi

exit 0

The key detail: this hook returns exit 0, not exit 2. PostToolUse hooks can't undo the file write. Instead, the additionalContext field injects the lint errors into Claude's next turn. Claude sees the errors and fixes them on its own.

Layer 2: Intent verification (Stop prompt hook)

Runs when Claude tries to stop. Asks an LLM to check whether the original request was actually addressed. This is the Claude Code equivalent of LangChain's PreCompletionChecklistMiddleware.

{
  "type": "prompt",
  "prompt": "Review what was accomplished in this session. Check if all requirements from the user's original request were addressed. If anything is incomplete or missing, respond with {\"decision\": \"block\", \"reason\": \"Incomplete: <what remains>\"}. If everything looks complete, respond with {\"decision\": \"allow\"}."
}

For complex tasks, swap the prompt hook for an agent hook. Agent hooks spawn a subagent that can Read files, Grep the codebase, and run Bash commands. More thorough, but adds 2-10 seconds.

Layer 3: Regression verification (Stop command hook)

Runs when Claude tries to stop. Deterministic check: do the tests pass? Does the build succeed?

#!/bin/bash
# .claude/hooks/verify-regression.sh
INPUT=$(cat)

# Anti-loop protection, MANDATORY
if [ "$(echo "$INPUT" | jq -r '.stop_hook_active')" = "true" ]; then
    exit 0
fi

# Run tests
TEST_OUTPUT=$(npm test 2>&1)
if [ $? -ne 0 ]; then
    TRIMMED=$(echo "$TEST_OUTPUT" | tail -50)
    echo "Tests failing. Fix before completing:\n$TRIMMED" >&2
    exit 2
fi

# Run build
BUILD_OUTPUT=$(npm run build 2>&1)
if [ $? -ne 0 ]; then
    TRIMMED=$(echo "$BUILD_OUTPUT" | tail -30)
    echo "Build failing:\n$TRIMMED" >&2
    exit 2
fi

exit 0

The complete configuration

All three layers in one .claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          { "type": "command", "command": "bash .claude/hooks/verify-syntax.sh" }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          { "type": "command", "command": "bash .claude/hooks/verify-regression.sh" },
          {
            "type": "prompt",
            "prompt": "Review what was accomplished. Check if all requirements from the user's original request were addressed. If incomplete, respond with {\"decision\": \"block\", \"reason\": \"<what remains>\"}. If complete, respond with {\"decision\": \"allow\"}."
          }
        ]
      }
    ]
  }
}

Stop hooks run in definition order. Put the fast command hook (Layer 3) first. If tests fail, there's no point running the slower prompt hook (Layer 2).

Boris Cherny, creator of Claude Code, reports that verification feedback loops improve quality significantly: "Give Claude a way to verify its work. If Claude has that feedback loop, it will 2-3x the quality of the final result" (X thread, 2026).

What's the cost of running verification hooks?

Verification hooks add roughly 10-20% token overhead per session, primarily from the prompt/agent Stop hooks. Command hooks cost zero tokens and under 5 seconds of wall time. But skipping verification costs significantly more: teams lose an average of 7 hours per week per engineer to AI-related inefficiency, and AI code rework rates hit 20-30% when AI-generated code exceeds 40% of the codebase (Exceeds AI, 2026).

	Without Verification	With Verification
Token cost per session	Baseline	+10-20%
Rework rate	20-30%	~5-10% (estimated)
Time lost per week	~7 hours	~2-3 hours (estimated)
"Done" means done	Sometimes	Almost always

You don't need all 3 layers at once. Layer 3 alone (the test-runner Stop hook) is the highest-ROI single addition. It's 15 lines of bash, costs zero tokens, and catches the most common failure: the agent says "done" while tests are broken.

When should you use each verification layer?

Use Layer 1 (syntax) always. It's free, catches the obvious, and runs in under 2 seconds. Use Layer 3 (regression) when your project has a test suite. It's the highest-ROI single hook. Use Layer 2 (intent) for complex or multi-step tasks where the agent might solve the wrong problem entirely.

Scenario	Layer 1 (Syntax)	Layer 2 (Intent)	Layer 3 (Regression)
Prototyping	Yes	No	No
Solo dev, daily work	Yes	No	Yes
Team project	Yes	Yes (prompt)	Yes
Production hotfix	Yes	Yes (agent)	Yes

How to adopt gradually:

Week 1: Add the Layer 3 Stop hook (test runner). Copy the verify-regression.sh script above. This single hook catches the most common failure mode.
Week 2: Add the Layer 1 PostToolUse hook (syntax). Copy verify-syntax.sh. Now lint errors get fixed automatically instead of piling up.
When you hit an intent failure: Add the Layer 2 prompt hook. You'll know you need it when Claude completes a task that passes all tests but doesn't match what you asked for.

This follows the failure-first method: add constraints after real failures, not before imagined ones.

FAQ

What is a self-verification loop in Claude Code?

A self-verification loop is a system of hooks that automatically checks Claude Code's output at multiple levels (syntax, intent, regression) before allowing the agent to finish. It uses PostToolUse hooks for per-file checks and Stop hooks for task-completion verification. The agent receives feedback and self-corrects without manual review.

Does verification slow down Claude Code?

Command hooks add under 5ms. Prompt hooks add 300-2000ms per Stop event. Agent hooks add 2-10 seconds. These fire once when Claude tries to stop, not on every tool call. The overhead is minimal compared to the 7 hours per week teams lose to AI-related rework.

What is the Stop hook in Claude Code?

The Stop hook fires every time Claude finishes responding. Exit code 2 blocks Claude from stopping and forces it to continue with feedback from stderr. The stop_hook_active field prevents infinite loops by signaling when Claude is already in a forced-continuation state.

How do I prevent infinite loops in verification hooks?

Always check the stop_hook_active field in your Stop hook. When the value is true, Claude is already in a forced-continuation state from a previous block. Return exit 0 to let it stop. Without this check, the hook blocks every stop attempt indefinitely, creating an infinite loop that burns tokens until the session times out.

What is harness engineering?

Harness engineering is the discipline of building constraints, tools, feedback loops, and observability around an AI agent to make it reliable in production. The formula: Agent = Model + Harness. Self-verification loops are one harness engineering example. For the full framework, see Harness Engineering: The System Around AI Matters More Than AI.

Try it now: Copy verify-regression.sh into .claude/hooks/, add the Stop hook config to .claude/settings.json, make it executable with chmod +x, and ask Claude to make a code change. Watch the Stop hook fire when tests fail. Confirm the agent fixes the issue before completing.

What layer would you add first — syntax, intent, or regression? Drop it in the comments.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.

The Constraint Paradox: Why Less AI Freedom Produces Better Code

ShipWithAI — Sun, 26 Apr 2026 01:00:00 +0000

LangChain jumped from 52.8% to 66.5% on Terminal Bench 2.0 by constraining their agent, not upgrading the model. Running at maximum reasoning budget actually scored worse. Three data points prove it: freedom is the enemy of AI agent reliability.

Two approaches. Same model. Different results:

# Approach A: Give the agent more freedom
→ Upgrade model, add more tools, increase context window
→ Remove guardrails so it "moves faster"
→ Result: unpredictable, rolls back 3x per session

# Approach B: Give the agent more constraints
→ Same model, same tools, same context
→ Add: verification loop, compute budget, context injection
→ Result: 52.8% → 66.5% on Terminal Bench 2.0 (LangChain, 2026)

Every time a team complains about Claude Code "doing the wrong thing," I ask the same question: what stopped it from doing that? The answer is always nothing. The agent had the capability. Nothing prevented the action.

The instinct is to want a smarter model. The fix is a tighter harness.

This is the Constraint Paradox: the more you restrict what your AI agent can do, the better it performs at what it should do.

Why does everyone assume "smarter model" is the answer?

Developers instinctively optimize for agent capability. Smarter model + more tools + fewer restrictions = better output. But this assumption conflates capability with reliability, and they're fundamentally not the same thing.

A senior developer with no code review, no CI/CD, no linting, and full production access will ship worse code than a junior developer working inside a strict pipeline. Not because the senior is less capable. Because unrestricted capability doesn't self-organize toward correct behavior. It just has a larger surface area for mistakes.

AI agents have the same problem, magnified. An LLM doesn't have intuition for "this feels wrong." It doesn't pause before a destructive command and think "wait, should I really do this?" Constraints provide that intuition externally.

OpenAI demonstrated this at scale. Their Codex team shipped roughly one million lines of production code with zero human-written lines over five months. Codex didn't succeed because it used a smarter model. It succeeded because it ran inside one of the most constrained environments in the industry: AGENTS.md files, reproducible dev environments, CI invariants, and mechanical verification.

The question isn't "how smart is your model?" The question is "how tight is your harness?"

Three data points that prove constraints beat capability

Evidence 1: LangChain Terminal Bench 2.0

LangChain improved their coding agent from 52.8% to 66.5% on Terminal Bench 2.0 by changing only the harness. Same model (gpt-5.2-codex). No fine-tuning. No model swap. Three harness changes:

Context injection via LocalContextMiddleware — map the environment upfront
Self-verification loop via PreCompletionChecklistMiddleware — verify before marking complete
Compute budget management — cap reasoning to prevent timeouts

The counterintuitive part: running at maximum reasoning budget (xhigh) scored 53.9%, worse than the original baseline. The high setting scored 63.6%.

Setting	Score	Change
Baseline (no harness changes)	52.8%	-
Harness changes + high reasoning	66.5%	+13.7pp
Harness changes + xhigh reasoning	53.9%	+1.1pp (timeouts)

More thinking didn't help. Better constraints did.

Evidence 2: Mitchell Hashimoto's AGENTS.md

Mitchell Hashimoto (creator of Terraform, Vagrant, Ghostty) treats his AGENTS.md as a failure log. Every single line exists because the agent made that specific mistake at least once:

"Each line in that file is based on a bad agent behavior, and it almost completely resolved them all" — mitchellh.com, 2026

Ghostty is one of the most productive AI-assisted codebases in the open source world. Hashimoto estimates agents run 10-20% of his working day in the background. And it runs on one of the most constrained harnesses. Not despite the constraints. Because of them.

Evidence 3: Claude Code's permission model

Claude Code defaults to read-only. You must explicitly allow write access, file creation, and command execution. This isn't a limitation. It's a design decision.

Instead of evaluating every possible action (including destructive ones), the agent operates within a bounded set of safe actions. When it needs to do something outside that set, it asks. That asking catches mistakes before they happen.

Compare this to an agent with full file system access from the start. It never pauses. It never asks. It just does — including rm -rf when it thinks cleanup is needed.

Why do constraints actually improve AI agent output?

Three mechanisms:

Mechanism 1: Constraints reduce the search space. An unconstrained agent evaluates every possible action, including destructive ones. A constrained agent only evaluates valid actions. Same reason chess engines play better with opening books: eliminating bad moves early means more compute spent on good ones.

LangChain's LocalContextMiddleware is search space reduction in practice. Instead of the agent spending steps figuring out its environment, the middleware injects that context upfront.

Mechanism 2: Constraints clarify intent. When you tell an agent "don't modify files in /config," you're not just preventing a bad action. You're giving the agent information about what matters. Constraints are communication that's harder to misinterpret than instructions.

An instruction says: "Be careful with config files." That's ambiguous. A constraint says: Hook blocks all writes to /config/**. No ambiguity. No interpretation required.

Mechanism 3: Hard stops beat soft warnings. A Hook that blocks git push --force doesn't require the agent to "decide" whether to follow the rule. The rule is enforced. The agent doesn't waste tokens weighing the instruction against other context.

LangChain's PreCompletionChecklistMiddleware is a hard stop. The agent cannot mark a task complete without running verification. It doesn't "decide" whether to verify. Verification is mandatory.

Enforcement	Mechanism	Compliance	Example
Instruction	Soft context, weighted by LLM	60-70%	"Don't force push"
Hook	Shell script, pre-action	100%	Block force push
Middleware	Code in agent pipeline	100%	Forced verification

Won't constraints slow down development?

No. Unconstrained agents waste more time recovering from mistakes than constrained agents spend on guardrail checks.

# Unconstrained session
Agent runs → mistake at min 15 → rollback → retry → 50 min total
Useful work: 15 min (30% efficiency)

# Constrained session
Agent runs → blocked at min 15 → redirects → completes → 25 min total
Useful work: 25 min (100% efficiency)

A single bad agent decision (deleted file, force push, broken migration) costs 30 minutes of recovery. A Hook check takes 5 milliseconds.

LangChain's LoopDetectionMiddleware makes this concrete. It detects when the agent is stuck in repetitive edits and forces it to reconsider its approach. Without this constraint, the agent burns through tokens re-editing the same file. With it, the agent backs up and tries a different strategy.

The real cost isn't the constraint. The real cost is the recovery from what would have happened without it.

Where should you constrain (and where not)?

Constrain (high cost to undo)	Don't constrain (low cost)
File deletion, `rm -rf`	Variable naming choices
`git push --force`	Algorithm selection
Production database writes	Refactoring approach
`.env` and secrets edits	Comment style
CI/CD pipeline changes	Test structure

Over-constraining is a real risk. If every file is protected, every command requires approval, and every edit needs pre-authorization, you've built a system that accomplishes nothing. The goal isn't zero risk. The goal is zero unrecoverable risk.

Claude Code's permission model gets this balance right. Read is unrestricted. Write requires approval. Destructive commands require explicit allowlisting. The agent explores freely but can't break things without your sign-off.

FAQ

What is the constraint paradox in AI agents?

The constraint paradox is the counterintuitive finding that restricting an AI agent's capabilities produces better output than giving it more freedom. LangChain demonstrated this by gaining 13.7 benchmark points through harness constraints alone. The mechanism: constraints reduce the agent's search space, clarify intent, and enforce rules deterministically.

Does more compute always improve AI agent performance?

No. LangChain's benchmark data shows running at maximum reasoning budget (xhigh) scored 53.9%, worse than the high setting at 63.6%. More compute caused timeouts that hurt overall performance. The optimal approach is budgeted compute with hard verification stops, not unlimited reasoning.

What's the difference between constraining and limiting an AI agent?

Constraining means removing dangerous or wasteful actions while preserving the ability to solve the problem. Limiting means reducing capability entirely. A Hook that blocks rm -rf is a constraint. Removing file system access entirely is a limitation. Constraints improve reliability. Limitations reduce usefulness.

How many constraints should an AI agent harness have?

Enough to prevent unrecoverable mistakes, not so many the agent can't work. The rule of thumb: constrain any action that would take more than 5 minutes to undo. Leave everything else to the agent's judgment. Start with 3-5 constraints and add only after real failures, following the failure log pattern.

Try it now: Open .claude/settings.json and check your current permission config. If the agent has unrestricted write access, add one PreToolUse Hook that blocks edits to .env and credentials. Test it: ask Claude Code to edit your .env file and confirm the hook blocks it.

What's your take — have you seen constraints improve your agent's output? Drop it in the comments.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.

Your CLAUDE.md Is an Instruction File. It Should Be a Failure Log.

ShipWithAI — Fri, 24 Apr 2026 01:00:00 +0000

CLAUDE.md instructions get followed ~60-70% of the time. Mitchell Hashimoto's AGENTS.md in Ghostty has zero aspirational lines — every entry traces to a real agent mistake. Use the Failure-to-Constraint Decision Tree: dangerous actions go to Hooks, repeatable workflows go to Commands, style/convention goes to CLAUDE.md.

Two CLAUDE.md files. Same project. Different philosophies:

# ❌ Before: instruction-first CLAUDE.md (typical)
# 47 lines of well-meaning rules
- "Be careful with production database."
- "Always write tests."
- "Use TypeScript strict mode."
- "Follow our naming conventions."
# Claude reads these, weighs them against 200K tokens... follows ~65%.

# ✅ After: failure-first CLAUDE.md (Hashimoto method)
# 12 lines, each traced to a specific incident
- "NEVER use git push --force. Use --force-with-lease."
  # Failure: 2026-03-12, force push overwrote teammate's commits on feature/auth
- "Run npm test before ANY git commit. No exceptions."
  # Failure: 2026-02-28, broken import pushed to main, CI caught 20min later

One file has 47 lines of advice. The other has 12 lines of scars. Which one does the agent actually follow?

The answer isn't close. The 12-line file wins every time, because every line carries weight. Every line exists for a reason the model can evaluate. The 47-line file is a wishlist. The 12-line file is a harness.

Why do most CLAUDE.md files fail?

Most CLAUDE.md files fail because developers write them like job descriptions: aspirational, comprehensive, bloated. LLMs don't execute instructions like code executes functions. They weigh each instruction against the full context window. More lines means more dilution, which means lower compliance per line.

The data backs this up. An ETH Zurich study (Gloaguen et al., 2026) tested context files across 138 real GitHub issues and found that LLM-generated agentfiles actually reduced success rates by 0.5-2% while increasing inference costs by 20-23%. Even developer-provided files only improved performance by ~4% on average. The typical developer-written file averaged 641 words across 9.7 sections.

That's a lot of instructions for a 4% gain.

Metric	200-line CLAUDE.md	40-line CLAUDE.md
Instructions	~200	~40
Compliance	~60-70%	~85-90%
Maintenance	Monthly pruning needed	Self-maintaining

Frontier LLMs can follow approximately 150-200 instructions with reasonable consistency. Your 200-line CLAUDE.md already exceeds that budget before counting the system prompt (another ~50 instructions). Community benchmarks put compliance at 60-70% for files over 200 lines. That's a coin flip for your most important rules.

What is the Mitchell Hashimoto method for AGENTS.md?

Mitchell Hashimoto (creator of Terraform, Vagrant, and now Ghostty) treats AGENTS.md as a failure log, not an instruction file. Every single line in Ghostty's AGENTS.md exists because the agent made that specific mistake at least once. No line is aspirational. Every line is a scar from a real incident.

In his own words:

"Each line in that file is based on a bad agent behavior, and it almost completely resolved them all" — mitchellh.com, 2026

The mental model shift matters:

Instruction-first	Failure-first
"What should the agent do?"	"What has the agent broken?"
Proactive, aspirational	Reactive, evidence-based
High volume, low signal	Low volume, high signal
Added before problems occur	Added after problems occur
Dilutes over time	Strengthens over time

Instructions are wishes. Constraints are lessons. LLMs don't need more wishes. They need fewer, sharper constraints with concrete context about why each one exists.

How do you build CLAUDE.md from failures instead of imagination?

Start with a minimal CLAUDE.md containing only your project overview and tech stack. Run the agent on real tasks. When it breaks something, convert that failure into a constraint. Then route the constraint to the right layer.

Step 1: Start minimal

Your initial CLAUDE.md should be 5-10 lines:

# Project: Acme SaaS
TypeScript, Next.js 15, Drizzle ORM, deployed on Vercel.

## Build
npm run build && npm test

That's it. No rules. No conventions. No aspirational guidelines. Just enough context for the agent to understand what it's working on.

Step 2: Run the agent, observe failures

Use the agent for real work. Don't preemptively add rules. When the agent makes a mistake, write down exactly what happened:

What: force-pushed to main
When: 2026-03-12
Impact: overwrote teammate's commits on feature/auth

Step 3: Convert the failure into a constraint

Turn the incident into a specific, testable rule:

NEVER use `git push --force`. Use `--force-with-lease`.
# 2026-03-12: force push overwrote teammate's commits on feature/auth

The pattern is always the same: CONSTRAINT + REASON + FAILURE DATE.

Step 4: Route it with the decision tree

Not every constraint belongs in CLAUDE.md. This decision tree is the most important takeaway from this post:

Agent made a mistake
    │
    ├── Is the action irreversible or dangerous?
    │   YES → Hook (PreToolUse block)
    │   Examples: delete production files, force push, edit .env
    │
    ├── Is it a repeatable workflow the agent should automate?
    │   YES → Command or Skill (.claude/commands/)
    │   Examples: run tests after refactor, update changelog
    │
    └── Is it a style, convention, or context issue?
        YES → CLAUDE.md constraint
        Examples: naming conventions, test patterns, commit format

If you take one thing from this post, take the decision tree. It replaces the instinct of "something went wrong, let me add a line to CLAUDE.md" with a structured routing decision.

What does a CLAUDE.md look like before and after?

Before: instruction-first (47 lines)

# Project: Acme SaaS

## Rules
- Be careful with production database.
- Always write tests.
- Use TypeScript strict mode.
- Follow naming conventions.
- Don't use deprecated APIs.
- Keep functions under 50 lines.
- Use ESLint and Prettier.
- Comment complex logic.
- Don't hardcode environment variables.
- Use meaningful variable names.
# ... 37 more aspirational rules like these

Every line is reasonable. None is specific. The agent reads all 47, retains maybe 30, and consistently follows maybe 25.

After: failure-first (18 lines)

# Project: Acme SaaS
TypeScript, Next.js 15, Drizzle ORM, Vercel.

## Build
npm run build && npm test

## Constraints (each from a real failure)

NEVER use `git push --force`. Use `--force-with-lease`.
# 2026-03-12: force push overwrote teammate's commits on feature/auth

Run `npm test` before ANY git commit.
# 2026-02-28: broken import shipped to main, CI caught 20min later

Schema migrations: always generate with `drizzle-kit generate`.
# 2026-03-05: hand-written migration missed NOT NULL, broke staging

API routes: validate input with zod schemas, never trust req.body.
# 2026-03-18: unvalidated input caused 500 errors for 2 hours

18 lines. 4 constraints. Each one backed by a real incident with a date. The agent knows not just what to avoid but why, which makes the constraint stickier in context.

How do you categorize failures into the right layer?

Layer	Enforcement	Compliance	Example
Hook	Deterministic (shell script)	100%	Block `git push --force`
Command	Deterministic (executed)	100%	Run tests after refactor
CLAUDE.md	Probabilistic (LLM context)	60-90%	Use camelCase naming

Category A: Structural failures → Hook. File deletion, sensitive config edits, force pushes. For irreversible actions, you need 100% enforcement, not 60-70%.

Category B: Style and convention failures → CLAUDE.md. Variable naming, comment style, test patterns, commit format. Low-stakes if violated occasionally.

Write them as failure-derived constraints:

- Use camelCase for variables, PascalCase for components.
  # 2026-03-20: agent used snake_case in 3 React components, broke style consistency
- Test files go in __tests__/ next to the source file, not in a top-level test/ dir.
  # 2026-02-15: agent created test/api/users.test.ts, missed by our jest config

Category C: Workflow failures → Commands/Skills. "Always run tests after refactor." "Always update the changelog after API changes." These are repeatable processes. Don't remind the agent. Automate it.

How do you keep CLAUDE.md lean over time?

Prune monthly. HumanLayer's production CLAUDE.md is under 60 lines. Bloat is the number one killer of CLAUDE.md effectiveness.

Monthly pruning checklist:

For each constraint in CLAUDE.md, ask:

1. Has the agent triggered this constraint in the past 3 months?
   NO → candidate for removal

2. Has this constraint graduated to a Hook?
   YES → remove from CLAUDE.md (now enforced, not suggested)

3. Is this a workflow that could be a Command instead?
   YES → move to .claude/commands/, remove from CLAUDE.md

4. Can I name the specific failure behind this line?
   NO → delete it (it's aspirational, not evidence-based)

5. Does the agent already do this correctly without the instruction?
   YES → delete it (you're wasting instruction budget)

I did this exercise on a 90-line CLAUDE.md last month. It dropped to 23 lines. The agent's compliance on the remaining rules went up noticeably within the first session. Fewer rules, better followed.

FAQ

What is the difference between CLAUDE.md and AGENTS.md?

CLAUDE.md is Claude Code's project-level instruction file, loaded automatically at session start. AGENTS.md is an emerging open standard backed by OpenAI Codex, Amp, Google Jules, and Cursor that serves the same purpose but is agent-agnostic. Both are repository-level context files. If you use Claude Code, write CLAUDE.md. If you want cross-agent compatibility, also add an AGENTS.md. The failure-first methodology applies to both.

Should I start CLAUDE.md from scratch or use a template?

Start from scratch with only three things: project name, tech stack, build commands. Then build it through the failure-first workflow: run the agent, observe mistakes, add constraints one at a time. Templates encourage instruction-first thinking, which is the exact problem this post addresses.

Can the agent override or ignore CLAUDE.md constraints?

Yes. CLAUDE.md is "soft" context. The LLM weighs it against other context but can ignore it. Compliance runs 60-70% with large files, higher with lean files. For constraints that must be followed 100% of the time, use Hooks instead. Hooks run as shell scripts and physically block the action. The model cannot bypass them.

How many lines should CLAUDE.md have?

As few as possible. Research suggests LLMs follow ~150-200 instructions consistently, but that budget is shared with the system prompt (~50 instructions). Aim for 30-60 lines of failure-derived constraints plus a minimal project overview. If your file exceeds 100 lines, audit it with the failure-first test: can you name the specific incident behind each line?

Try it now: Open your CLAUDE.md right now. For each line, write the specific failure that caused you to add it. If you can't name the incident, delete the line.

How many lines survived? Drop your before/after count in the comments.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and shipping software faster with structured AI.
I had a 90-line CLAUDE.md last month. Rules for everything. Naming conventions, test patterns, git workflows, API design, deployment checklist. Carefully organized with headers and bullet points.

Claude followed about 65% of it.

Then I learned about Mitchell Hashimoto's approach to AGENTS.md in Ghostty. Every single line in his file traces to a real agent mistake. No aspirational rules. No "best practices." Just scars.

So I tried it. I went through my 90-line file and asked one question for each line: "Can I name the specific failure that caused me to add this?"

23 lines survived.

The problem with instruction-first thinking

Most of us write CLAUDE.md like a job description — comprehensive, aspirational, bloated. But LLMs don't execute instructions like code executes functions. They weigh each instruction against everything else in the context window.

ETH Zurich tested this across 138 real GitHub issues. LLM-generated context files actually reduced success by 0.5-2% while increasing costs by 20-23%. Even developer-written files only improved things by ~4%.

The math is brutal: 200 lines of instructions, shared with a system prompt that already has ~50 instructions, competing for a model's attention across 200K tokens. Your most important rule has the same weight as "use meaningful variable names."

The failure-first method

Hashimoto's approach is the opposite. Start with almost nothing — project name, tech stack, build command. That's 5 lines. Then run the agent on real work. When it breaks something, you have three choices:

Is the action dangerous or irreversible? → Don't put it in CLAUDE.md. Put it in a Hook. A PreToolUse hook that exits with code 2 physically blocks the action. 100% enforcement. No exceptions. Force pushes, file deletions, production edits — these need hooks, not suggestions.

Is it a repeatable workflow? → Put it in .claude/commands/. A command runs deterministically every time. A CLAUDE.md instruction runs when the model remembers it.

Is it a style or convention issue? → Now it belongs in CLAUDE.md. But write it as: CONSTRAINT + REASON + FAILURE DATE.

NEVER use `git push --force`. Use `--force-with-lease`.
# 2026-03-12: force push overwrote teammate's commits on feature/auth

The failure context makes the constraint stickier. The model doesn't just know what to avoid — it knows why. That matters for an LLM weighting instructions.

The result

My 90-line file → 23 lines. Compliance on the remaining rules went up noticeably in the first session. Fewer rules, better followed. The dangerous ones graduated to Hooks where they're enforced 100%. The workflows became Commands. What remained in CLAUDE.md was lean, specific, and battle-tested.

The monthly pruning rule: for each line, can you name the incident? No? Delete it. Has it graduated to a Hook? Remove it. Is the agent already doing it right without being told? You're wasting instruction budget.

Read the full breakdown with the decision tree, before/after examples, and pruning checklist →

This week's takeaway: Your CLAUDE.md is probably too long. The fix isn't writing better instructions — it's deleting the ones without scars behind them.

How many lines is your CLAUDE.md right now? Reply — I'm genuinely curious about the range people are working with.

If you know someone drowning in a 200-line CLAUDE.md, forward this. They'll thank you.

Beyond CLAUDE.md: 5 Layers Your AI Agent Harness Is Missing

ShipWithAI — Wed, 22 Apr 2026 01:00:00 +0000

Most developers stop at CLAUDE.md. That's layer 1. A production Claude Code harness needs 5 layers: memory, tools, permissions, hooks, and observability. Here's the full setup guide.

Claude Code harness has 5 layers:

Memory — CLAUDE.md, MEMORY.md, .claude/commands/
Tools — MCP servers (sweet spot: 2–3)
Permissions — settings.json allow/deny lists
Hooks — PreToolUse/PostToolUse verification
Observability — Decision logging, cost tracking, anomaly detection

Most developers only have layer 1. Setup order: 1→4→2→3→5 (guardrails before capabilities).

Why? Because LangChain gained +13.7 benchmark points from harness changes alone — jumping from 52.8% to 66.5% on the same model.

Layer 1: Memory (The Foundation)

Your CLAUDE.md is the project rules file. Claude reads it every prompt and follows it consistently.

What goes in memory:

CLAUDE.md — 40–60 lines max. Project context, conventions, constraints.
MEMORY.md — Long-term learning. "We discovered X fails without Y."
.claude/commands/ — Reusable prompt templates as commands.

The ETH Zurich finding: CLAUDE.md alone caps improvement at ~4%. It's necessary but not sufficient.

The HumanLayer benchmark: Teams keeping CLAUDE.md under 60 lines saw better compliance than those writing 200-line manifestos. Shorter = clearer.

# Example CLAUDE.md structure

## Project Identity
- Framework: Next.js 15 + TypeScript
- Package manager: pnpm
- Architecture: API routes + React components

## You Are
- A full-stack developer shipping features
- Opinionated about patterns: prefer hooks > HOCs
- Balancing speed with maintainability

## Rules
1. Always include tests when modifying /lib
2. Use conventional commits for all commits
3. If suggesting breaking changes, warn first
4. Database migrations need rollback logic

## Code Conventions
- Folder structure: /pages, /components, /lib, /styles
- Component naming: PascalCase for React files
- API routes: camelCase for endpoint handlers

## What NOT to do
- Don't refactor without atomic commits
- Don't add dependencies without checking bundle impact
- Don't commit .env files

Layer 2: Tools (Adding Capability)

Tools are MCP servers. Claude uses them to read files, run commands, query databases.

The HumanLayer finding: Too many tools cause agent confusion. Each tool is context overhead. Sweet spot: 2–3 MCP servers per project.

Not 20. Not "all available servers."

Which 2–3 tools?

Filesystem tool — read/write/execute (almost always)
One domain-specific tool — database, API, CLI
Optional: Observability tool — logs, metrics

Example for a Next.js project:

Filesystem (built-in)
PostgreSQL client (query → fix migrations)
GitHub API (check PR status → adjust approach)

More tools = more tokens + more decision fatigue for Claude.

Layer 3: Permissions (The Guardrails)

Permissions live in settings.json. Specify exactly what Claude is allowed to do.

Allowlist over denylist. It's safer to say "Claude can only modify these files" than "Claude cannot do X."

{
  "permissions": {
    "filesystem": {
      "allow": [
        "/src/**",
        "/public/**",
        "*.config.js",
        ".env.local"
      ],
      "deny": [
        "/node_modules/**",
        "/.git/**",
        "/build/**",
        ".env"
      ]
    },
    "execution": {
      "allow": ["npm run test", "npm run build"],
      "deny": ["rm -rf", "sudo *"]
    }
  }
}

Why this matters:

Claude won't accidentally delete node_modules (been there)
Can't run destructive commands without review
Enforced at runtime, not a suggestion

Check settings.json into git. This becomes part of your project's DNA.

Layer 4: Hooks (Deterministic Enforcement)

Hooks are the most powerful layer. They run before and after Claude uses tools.

PreToolUse hook: Intercept tool calls, validate them, reject bad ones.
PostToolUse hook: Inspect results, catch anomalies, trigger alerts.

Boris Cherny, Anthropic, calls verification "the most important thing" for quality. Hooks are that verification.

#!/bin/bash
# Runs before every tool use

TOOL=$1
PARAMS=$2

case $TOOL in
  "filesystem_write")
    if echo "$PARAMS" | grep -E "(node_modules|\.git|\.env)" > /dev/null; then
      echo "REJECTED: Protected path"
      exit 1
    fi
    ;;
  "command_execute")
    if echo "$PARAMS" | grep -E "(rm -rf|:(){ :|:)" > /dev/null; then
      echo "REJECTED: Dangerous command"
      exit 1
    fi
    ;;
esac

echo "APPROVED"
exit 0

#!/bin/bash
# Runs after every tool use

TOOL=$1
RESULT=$2
DURATION=$3

if (( DURATION > 30 )); then
  echo "⚠️  Slow tool: $TOOL took ${DURATION}s"
fi

if echo "$RESULT" | grep -i "error\|failed\|undefined"; then
  echo "🔴 Tool failed: $(echo $RESULT | head -20)"
fi

Where to set hooks:

.claude/hooks/pre-tool-use.sh
.claude/hooks/post-tool-use.sh

Hooks are not bypassed. They're enforcement.

Layer 5: Observability (Learning from Decisions)

Observability means: logging decisions, tracking costs, detecting anomalies.

What to log:

Which tools Claude called and why
Tokens used per session (cost tracking)
Time spent on each decision
Failures and retries

The HumanLayer insight: Surface only failures, not 4,000 lines of passing tests.

Most developers log everything. Better: log strategically.

#!/bin/bash
# Log Claude's decisions

echo "$(date '+%Y-%m-%d %H:%M:%S') | Tool: $TOOL | Status: $STATUS | Tokens: $TOKENS | Duration: ${DURATION}s" >> .claude/logs/decisions.log

TOTAL_COST=$(grep "Tokens:" .claude/logs/decisions.log | awk '{sum+=$NF} END {print sum}')
if (( $(echo "$TOTAL_COST > 5.00" | bc -l) )); then
  echo "💰 Cost alert: $TOTAL_COST USD today"
fi

ERROR_RATE=$(grep "FAILED" .claude/logs/decisions.log | wc -l)
if (( ERROR_RATE > 5 )); then
  echo "🚨 High error rate detected: $ERROR_RATE failures in last hour"
fi

Setup Order Matters: 1 → 4 → 2 → 3 → 5

Why not 1 → 2 → 3 → 4 → 5?

Wrong order: Capabilities before guardrails

Build CLAUDE.md ✅
Add 10 MCP servers ⚠️
Grant all permissions ⚠️
No hooks (too late, broke things already)
Now add observability (chaos already happened)

Right order: Guardrails first

Build CLAUDE.md ✅ (memory/rules)
Add hooks ✅ (enforcement before tools exist)
Add 2–3 MCP servers ✅ (now hooks guard them)
Restrict permissions ✅ (layered safety)
Add observability ✅ (track what's working)

Adding hooks after tools is like adding seatbelts after the crash.

Production-Ready Harness: 10-Item Checklist

[ ] CLAUDE.md exists, 40–60 lines, checked into git
[ ] MEMORY.md setup with "lessons learned"
[ ] .claude/commands/ has 3+ reusable prompts
[ ] Max 3 MCP servers chosen and documented
[ ] settings.json has allowlist (filesystem, execution)
[ ] .claude/hooks/pre-tool-use.sh validates calls
[ ] .claude/hooks/post-tool-use.sh inspects results
[ ] .claude/logs/ directory exists + observability hook running
[ ] Cost tracking implemented (tokens/session)
[ ] Team knows where each file lives + how to update it

FAQ

Which layer do I need first?
Layer 1 (CLAUDE.md). Everything depends on clear memory. Start there.

Does this harness slow down Claude Code?
No. Hooks add ~100–300ms per tool use. Worth it for the safety. Observability has negligible cost.

What are the most important hooks?
PreToolUse (validation) and PostToolUse (anomaly detection). Those two prevent 80% of issues.

How many MCP servers is "too many"?
More than 5 becomes noise. More than 3 means you're probably adding tools you won't use. Start with 1–2, add more only when they solve a real workflow problem.

Can I skip permissions and just use hooks?
Technically yes, but no. Permissions are defense-in-depth. Hooks catch mistakes. Permissions prevent them.

How do I update CLAUDE.md over time?
Document it in MEMORY.md. "We added this rule because X failed." Over time, CLAUDE.md stabilizes.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and building production systems with AI. Full blog + templates at shipwithai.io.

What's your harness score? Drop it in the comments. Do you have all 5 layers, or are you still at layer 1?

DEV Community: ShipWithAI

How to Review AI-generated Pull Requests in 6 Steps with Claude Code

How to Review AI-generated Pull Requests in 6 Steps with Claude Code

Prerequisites

Step 1 - Write the Expected Scope before opening the diff

Step 2 - Detect Fake Tests and Over-mocking

Step 3 - Verify API Call Signatures

Step 4 - Confirm the Diff stays inside the declared scope

Step 5 - Hunt for Hallucinated Imports and Hidden Side-effects

Step 6 - Align Commit Messages with the Diff

Decision: Ship or Reject?

Key Takeaways

Why Your AI Coach’s Warmth Might Be Hiding a Critical Regression

Intro

The problem: sycophancy

A concrete technique: the pushback eval

Key takeaways

Action

The Complete Claude Code Harness Engineering Guide (5 Layers, 8 Deep-Dives)

What is Claude Code harness engineering?

What are the 5 layers of a Claude Code harness?

Layer 1: What does your agent know before you type?

Layer 4: What can the agent NOT do?

Layer 5: How do you know what your agent actually did?

Why does this actually work?

Why does this matter for your career?

Where should you start reading?

FAQ

What is Claude Code harness engineering?

Do I need all 5 layers to start?

How is harness engineering different from prompt engineering?

Does this only apply to Claude Code?

Hardening Your npm CI in 5 Concrete Layers

Intro

The Problem

Solution Walkthrough

Layer 1 – Enforce npm ci

Layer 2 – Validate lockfile integrity

Layer 3 – Dependency review action

Layer 4 – Pin actions to SHA

Layer 5 – OIDC trusted publishing

Results

Key Takeaways

Conclusion & CTA

Which Claude Code Hook Do You Need? A Decision Guide

What are the 4 Claude Code hook handler types?

When should you use CLAUDE.md vs a hook vs both?

Which hook events should you implement first?

How do you handle multiple hooks on the same event?

What are the most common hook mistakes?

Exit code cheat sheet

Debug workflow

FAQ

What are the 4 Claude Code hook handler types?

Should I use CLAUDE.md or a hook for security rules?

What is the difference between PreToolUse and PostToolUse hooks?

Can Claude Code hooks run in headless mode?

4 Lines in ~/.npmrc That Block 80% of npm Supply Chain Attacks

Why is your default npm setup unsafe in 2026?

What each line does

ignore-scripts=true

save-exact=true

audit-level=moderate

fund=false

Why does this work for most npm attacks?

What does this NOT protect against?

What breaks when you set ignore-scripts=true?

Upgrading to hook-based defense

FAQ

Will ignore-scripts=true break my builds?

Should I commit .npmrc to my repo?

Does this work for pnpm and yarn?

Is npm audit still useful if I set audit-level=moderate?

Claude Code Forgets Everything Between Sessions. MEMORY.md Fixes That

Claude Code resets context every session. MEMORY.md gives it persistent memory of your project's evolving state in a 200-line index file. Setup takes 5 minutes. One prompt at the end of each session keeps it current.

Why does Claude Code forget everything between sessions?

How do you set up MEMORY.md in 5 minutes?

What makes a good MEMORY.md entry vs a bad one?

How do you keep MEMORY.md current without complex hooks?

When should you prune MEMORY.md?

Step 1 - Write the Expected Scope before opening the diff

Step 2 - Detect Fake Tests and Over-mocking

Step 3 - Verify API Call Signatures

Step 4 - Confirm the Diff stays inside the declared scope

Step 5 - Hunt for Hallucinated Imports and Hidden Side-effects

Step 6 - Align Commit Messages with the Diff

Layer 1 – Enforce `npm ci`

Layer 2 – Validate lockfile integrity

Layer 3 – Dependency review action

Layer 4 – Pin actions to SHA

Layer 5 – OIDC trusted publishing

`ignore-scripts=true`

`save-exact=true`

`audit-level=moderate`

`fund=false`

What breaks when you set `ignore-scripts=true`?

Will `ignore-scripts=true` break my builds?

Should I commit `.npmrc` to my repo?

Is `npm audit` still useful if I set `audit-level=moderate`?