ShipWithAI

Posted on Apr 22 • Originally published at shipwithai.io

Beyond CLAUDE.md: 5 Layers Your AI Agent Harness Is Missing

#ai #productivity #claudecode #shipwithai

Most developers stop at CLAUDE.md. That's layer 1. A production Claude Code harness needs 5 layers: memory, tools, permissions, hooks, and observability. Here's the full setup guide.

Claude Code harness has 5 layers:

Memory — CLAUDE.md, MEMORY.md, .claude/commands/
Tools — MCP servers (sweet spot: 2–3)
Permissions — settings.json allow/deny lists
Hooks — PreToolUse/PostToolUse verification
Observability — Decision logging, cost tracking, anomaly detection

Most developers only have layer 1. Setup order: 1→4→2→3→5 (guardrails before capabilities).

Why? Because LangChain gained +13.7 benchmark points from harness changes alone — jumping from 52.8% to 66.5% on the same model.

Layer 1: Memory (The Foundation)

Your CLAUDE.md is the project rules file. Claude reads it every prompt and follows it consistently.

What goes in memory:

CLAUDE.md — 40–60 lines max. Project context, conventions, constraints.
MEMORY.md — Long-term learning. "We discovered X fails without Y."
.claude/commands/ — Reusable prompt templates as commands.

The ETH Zurich finding: CLAUDE.md alone caps improvement at ~4%. It's necessary but not sufficient.

The HumanLayer benchmark: Teams keeping CLAUDE.md under 60 lines saw better compliance than those writing 200-line manifestos. Shorter = clearer.

# Example CLAUDE.md structure

## Project Identity
- Framework: Next.js 15 + TypeScript
- Package manager: pnpm
- Architecture: API routes + React components

## You Are
- A full-stack developer shipping features
- Opinionated about patterns: prefer hooks > HOCs
- Balancing speed with maintainability

## Rules
1. Always include tests when modifying /lib
2. Use conventional commits for all commits
3. If suggesting breaking changes, warn first
4. Database migrations need rollback logic

## Code Conventions
- Folder structure: /pages, /components, /lib, /styles
- Component naming: PascalCase for React files
- API routes: camelCase for endpoint handlers

## What NOT to do
- Don't refactor without atomic commits
- Don't add dependencies without checking bundle impact
- Don't commit .env files

Layer 2: Tools (Adding Capability)

Tools are MCP servers. Claude uses them to read files, run commands, query databases.

The HumanLayer finding: Too many tools cause agent confusion. Each tool is context overhead. Sweet spot: 2–3 MCP servers per project.

Not 20. Not "all available servers."

Which 2–3 tools?

Filesystem tool — read/write/execute (almost always)
One domain-specific tool — database, API, CLI
Optional: Observability tool — logs, metrics

Example for a Next.js project:

Filesystem (built-in)
PostgreSQL client (query → fix migrations)
GitHub API (check PR status → adjust approach)

More tools = more tokens + more decision fatigue for Claude.

Layer 3: Permissions (The Guardrails)

Permissions live in settings.json. Specify exactly what Claude is allowed to do.

Allowlist over denylist. It's safer to say "Claude can only modify these files" than "Claude cannot do X."

{
  "permissions": {
    "filesystem": {
      "allow": [
        "/src/**",
        "/public/**",
        "*.config.js",
        ".env.local"
      ],
      "deny": [
        "/node_modules/**",
        "/.git/**",
        "/build/**",
        ".env"
      ]
    },
    "execution": {
      "allow": ["npm run test", "npm run build"],
      "deny": ["rm -rf", "sudo *"]
    }
  }
}

Why this matters:

Claude won't accidentally delete node_modules (been there)
Can't run destructive commands without review
Enforced at runtime, not a suggestion

Check settings.json into git. This becomes part of your project's DNA.

Layer 4: Hooks (Deterministic Enforcement)

Hooks are the most powerful layer. They run before and after Claude uses tools.

PreToolUse hook: Intercept tool calls, validate them, reject bad ones.
PostToolUse hook: Inspect results, catch anomalies, trigger alerts.

Boris Cherny, Anthropic, calls verification "the most important thing" for quality. Hooks are that verification.

#!/bin/bash
# Runs before every tool use

TOOL=$1
PARAMS=$2

case $TOOL in
  "filesystem_write")
    if echo "$PARAMS" | grep -E "(node_modules|\.git|\.env)" > /dev/null; then
      echo "REJECTED: Protected path"
      exit 1
    fi
    ;;
  "command_execute")
    if echo "$PARAMS" | grep -E "(rm -rf|:(){ :|:)" > /dev/null; then
      echo "REJECTED: Dangerous command"
      exit 1
    fi
    ;;
esac

echo "APPROVED"
exit 0

#!/bin/bash
# Runs after every tool use

TOOL=$1
RESULT=$2
DURATION=$3

if (( DURATION > 30 )); then
  echo "⚠️  Slow tool: $TOOL took ${DURATION}s"
fi

if echo "$RESULT" | grep -i "error\|failed\|undefined"; then
  echo "🔴 Tool failed: $(echo $RESULT | head -20)"
fi

Where to set hooks:

.claude/hooks/pre-tool-use.sh
.claude/hooks/post-tool-use.sh

Hooks are not bypassed. They're enforcement.

Layer 5: Observability (Learning from Decisions)

Observability means: logging decisions, tracking costs, detecting anomalies.

What to log:

Which tools Claude called and why
Tokens used per session (cost tracking)
Time spent on each decision
Failures and retries

The HumanLayer insight: Surface only failures, not 4,000 lines of passing tests.

Most developers log everything. Better: log strategically.

#!/bin/bash
# Log Claude's decisions

echo "$(date '+%Y-%m-%d %H:%M:%S') | Tool: $TOOL | Status: $STATUS | Tokens: $TOKENS | Duration: ${DURATION}s" >> .claude/logs/decisions.log

TOTAL_COST=$(grep "Tokens:" .claude/logs/decisions.log | awk '{sum+=$NF} END {print sum}')
if (( $(echo "$TOTAL_COST > 5.00" | bc -l) )); then
  echo "💰 Cost alert: $TOTAL_COST USD today"
fi

ERROR_RATE=$(grep "FAILED" .claude/logs/decisions.log | wc -l)
if (( ERROR_RATE > 5 )); then
  echo "🚨 High error rate detected: $ERROR_RATE failures in last hour"
fi

Setup Order Matters: 1 → 4 → 2 → 3 → 5

Why not 1 → 2 → 3 → 4 → 5?

Wrong order: Capabilities before guardrails

Build CLAUDE.md ✅
Add 10 MCP servers ⚠️
Grant all permissions ⚠️
No hooks (too late, broke things already)
Now add observability (chaos already happened)

Right order: Guardrails first

Build CLAUDE.md ✅ (memory/rules)
Add hooks ✅ (enforcement before tools exist)
Add 2–3 MCP servers ✅ (now hooks guard them)
Restrict permissions ✅ (layered safety)
Add observability ✅ (track what's working)

Adding hooks after tools is like adding seatbelts after the crash.

Production-Ready Harness: 10-Item Checklist

[ ] CLAUDE.md exists, 40–60 lines, checked into git
[ ] MEMORY.md setup with "lessons learned"
[ ] .claude/commands/ has 3+ reusable prompts
[ ] Max 3 MCP servers chosen and documented
[ ] settings.json has allowlist (filesystem, execution)
[ ] .claude/hooks/pre-tool-use.sh validates calls
[ ] .claude/hooks/post-tool-use.sh inspects results
[ ] .claude/logs/ directory exists + observability hook running
[ ] Cost tracking implemented (tokens/session)
[ ] Team knows where each file lives + how to update it

FAQ

Which layer do I need first?
Layer 1 (CLAUDE.md). Everything depends on clear memory. Start there.

Does this harness slow down Claude Code?
No. Hooks add ~100–300ms per tool use. Worth it for the safety. Observability has negligible cost.

What are the most important hooks?
PreToolUse (validation) and PostToolUse (anomaly detection). Those two prevent 80% of issues.

How many MCP servers is "too many"?
More than 5 becomes noise. More than 3 means you're probably adding tools you won't use. Start with 1–2, add more only when they solve a real workflow problem.

Can I skip permissions and just use hooks?
Technically yes, but no. Permissions are defense-in-depth. Hooks catch mistakes. Permissions prevent them.

How do I update CLAUDE.md over time?
Document it in MEMORY.md. "We added this rule because X failed." Over time, CLAUDE.md stabilizes.

Originally published on ShipWithAI. I write about Claude Code workflows, AI-assisted development, and building production systems with AI. Full blog + templates at shipwithai.io.

What's your harness score? Drop it in the comments. Do you have all 5 layers, or are you still at layer 1?

Top comments (5)

Mykhailo • Apr 22

Very solid breakdown. Thanks for sharing! In my practice, many ignore the observability layer and pay it less attention.

ShipWithAI • Apr 22

Thanks bro, stay tuned for my next posts in the harness series!

Mykhailo • Apr 22

Sure

Nguyễn Hồng Sơn Anh • Apr 22

Thank you for sharing this interesting knowledge.

A3E Ecosystem • Apr 24

Layer 4 being in the setup chain before Layer 2 is the part most get backwards. Default instinct: capabilities first, guardrails after -- which means CLAUDE.md stays aspirational and the hooks never get written.

The feedback loop between Layer 5 and Layer 1 is worth spelling out: decision logging at the hook level (what triggered, what got blocked) is exactly the data source for converting aspirational CLAUDE.md lines into failure-derived ones. The two articles are really a sequence: build the harness first (this one), let the harness surface the failure cases, then rewrite CLAUDE.md from those cases.

Question: for the observability layer -- are you logging hook decisions to a local file, an external store, or piping them back into MEMORY.md? Curious how the feedback loop closes in practice.