DEV Community: Collin Wilkins

What Is an AI Agent Harness?

Collin Wilkins — Mon, 01 Jun 2026 23:56:41 +0000

This is for people trying to understand the infrastructure around large language models: Claude Code, Codex, Cursor, LangGraph, MCP servers, repo instructions, permissions, hooks, and all the agent plumbing that has popped up.

The short answer:

The model is the brain. The harness is the operating environment that makes the brain useful.

Or shorter:

Models generate answers. Harnesses generate trust.

Start with the agent

At the simplest level, an AI agent is a model inside a loop:

You give it a task
The model thinks about what to do
It answers or calls a tool
The tool does something
The result goes back to the model
The loop repeats until the task is done

That loop can be a few dozen lines of code or a whole product with permissions, memory, tools, logs, tests, and UI around it. The loop is the seed. The harness is what makes it safe and useful in real work.

Same model, different results

The easiest mistake with agents is blaming the model for everything. Sometimes the model really is the problem but it often isn't.

The same model and task can produce different output because agents are probabilistic plus the environment changes what the model sees, what it can do, how it gets checked, and when it is allowed to stop.

One setup produces a clean pull request, runs the tests, catches the edge case, and leaves a useful summary.

While another edits the wrong file, forgets the project conventions, says "done" without verification, and hands you a pile of slop.

A better model can help, a better harness narrows the variance.

Model vs. agent vs. harness

People blur these words together, so separate them first.

Term	Plain-English meaning	Example
Model	The brain that predicts, reasons, and writes	Claude, GPT, Gemini
Agent	The model plus a loop that lets it act	Claude Code fixing a bug
Harness	The system around the agent that guides and checks the work	Instructions, tools, memory, tests, hooks
Tool	Something the agent can use	Shell, browser, file search, calculator, MCP server
Memory	Context that survives beyond one prompt	`CLAUDE.md`, `AGENTS.md`, project memory, handoff notes

If you only remember one line:

A model thinks. An agent acts. A harness keeps the agent from acting like an idiot.

A harness is not a framework

A framework helps humans assemble agents. LangGraph, LangChain, and similar tools give you graphs, state, nodes, tool bindings, memory, middleware, and routing. You can use those pieces to build a harness.

But the framework itself is not automatically the harness.

A harness is the working environment around the agent. It runs the loop, exposes tools, injects project context, enforces permissions, verifies outputs, and keeps useful memory.

That distinction matters. If you buy or build a framework, you still own the harness design. If you use Claude Code, Codex, Cursor, or Windsurf, you are already working inside a harness. The question is how well it fits your codebase, risk tolerance, and workflow.

The simplest harness you already know: CLAUDE.md

Claude Code is a useful doorway into this idea because the harness is visible.

Every serious Claude Code setup has a CLAUDE.md file or the cross-tool version AGENTS.md. Anthropic's docs describe memory as persistent instruction Claude reads at the start of a session. It carries commands, structure, standards, workflow preferences, and recurring mistakes.

That is harness work.

Your CLAUDE.md isn't the agent or the model. It is one guide inside the harness.

A good one feels like a brief you would give a sharp contractor before they touched your codebase:

# CLAUDE.md

## Project
One sentence on what this project does and who uses it.

## Commands
- Dev: `npm run dev`
- Build: `npm run build`
- Type check: `npx tsc --noEmit`

## Architecture
- `src/lib/services/` - business logic
- `src/components/` - UI components

## Rules
- Never commit `.env` files or secrets
- Make minimal changes and avoid unrelated refactors
- Run type check after code changes
- Static export only, no server-side features

## Workflow
- Ask before making architectural changes
- Run tests before saying the task is done
- When unsure, explain the tradeoff

That file doesn't impact model intelligence. It makes the environment stricter and gives the model fewer ways to wander off.

The harness decides what context the model gets, what it can touch, how it proves the work, and what survives after the session ends.

What a harness actually does

A harness has two jobs:

Help the agent get it right the first time
Catch problems early enough that the agent can correct itself

Martin Fowler's framing is useful here. He splits a harness into guides and sensors.

Piece	What it does	Examples
Guides	Shape the agent before it acts	`CLAUDE.md`, `AGENTS.md`, skills, tool descriptions, project docs
Sensors	Check the agent after it acts	tests, linters, type checks, screenshots, review agents
Memory	Carries forward what should survive the session	project memory, handoff notes, session recaps, index files

Guides steer the agent before it acts. Sensors tell you whether the work held up. Memory is the notebook for future reference. A useful harness needs all three.

Pay attention to the last arrow. A harness controls how the agent acts this time and how the next run starts a little less cold.

The architecture underneath

Once you move past the simplest version, a modern harness starts to look like a small control plane.

The Arize primer frames nine pieces as one system: loop, context, tools, system prompt assembly, permissions, hooks, persistence, built-in skills, and sub-agent management.

That is what makes this more than prompt engineering. A prompt can ask the model to be careful. A harness can route the model through checks, permissions, and recovery paths that make care more likely.

The best harnesses push judgment to the model and keep control in the system. The model decides which files to work on. The harness decides whether it is allowed to modify them.

Same model, better harness

This is not just semantics.

LangChain published a useful example in February 2026. They kept the model fixed and changed the harness around their coding agent. The score moved from 52.8 to 66.5 on Terminal Bench 2.0.

A better harness improved the same model by roughly 14 points.

The changes included better prompts, tool setup, verification middleware, trace analysis, and loop detection. That is the pattern teams should steal.

Don't start by asking, "Which model will solve this?" Ask, "What does the model need around it to do this reliably?"

A simple example

Say you ask an agent to fix a failing test. Without much harness, the session often looks like this:

The model reads your prompt
It scans a few files
It changes some code
It rereads its own code
It says "done"

That sounds fine until you notice the missing step: it never ran the tests.

With a basic harness, the flow changes:

The agent starts and loads the project instructions
It sees the commands, conventions, and files that matter
It changes the code
The harness tells it to run the test command before exiting
If tests fail, the agent keeps going
When it finishes, it writes a short handoff note

The minimum viable harness

For most teams, a minimum viable harness has four layers:

Layer	Small version	Why it matters
Instructions	`CLAUDE.md` or `AGENTS.md`	The agent knows the project before you re-explain it
Tools	One or two tools it actually needs	The agent can act instead of only talk
Verification	Test, lint, type check, screenshot, review pass	The agent has to prove the work
Memory	Handoff note, index file, project memory	The next session doesn't start cold

That is already a harness. Keep it simple - the point is reliability, not complexity.

What this looks like in an engineering org

For an enterprise team, the harness becomes part of engineering operations. It's a shared operating agreement:

A repo-level AGENTS.md or CLAUDE.md with architecture, commands, boundaries, and review expectations
A small approved tool set: search, edits, shell, browser, test runner, docs lookup
Permission modes for read-only work, workspace edits, and dangerous actions
Hooks that block secrets, destructive commands, production credentials, or unreviewed deploy paths
Verification gates that make the agent run the same checks a human engineer would run
Session notes that explain what changed
A small eval set of real tasks that shows whether the harness is getting better or just louder

How to write the instruction file

The blunder with CLAUDE.md is treating it like a wish list.

"Be a senior engineer."

"Think step by step."

"Write clean code."

Fine, but mostly wasted space. Use the file for things the agent would otherwise get wrong:

Critical commands: build, test, lint, type check, run one file.
Architecture map: where things live and what belongs where.
Hard rules: the specific mistakes the agent must not make.
Workflow preferences: when to ask, act, change, and verify.
Out of scope: files, systems, or integrations the agent should not touch.

Keep it short.

Anthropic's current memory docs emphasize hierarchy, imports, recursive lookup, and specific instructions over vague guidance. A CLAUDE.md shapes behavior. It does not physically prevent bad actions.

For enforcement, you need settings, permissions, tests, hooks, or human review. That is the difference between a guide and a guardrail.

Where tools fit

Tools are the agent's hands.

They let the model search files, run commands, query APIs, calculate, open a browser, read a spreadsheet, or edit a document.

The common mistake is thinking more tools equals a smarter agent. Usually it means a more confused one. Tool-selection accuracy drops from around 43% to under 14% as the tool count grows, and every tool definition spends roughly 300-1,400 tokens whether it gets used or not.

Start with the smallest tool set that can do the job. Rewriting probably needs no tools. Fixing a bug needs file access and a test command. Researching current prices needs web search or an API.

One tool should have one clear job.

Bad:

manage_files(action, file, destination, overwrite, format, permissions)

Better:

read_file(path)
write_file(path, content)
delete_file(path)

Tool design is part of the harness.

Where memory fits

Memory is where people make this sound harder than it needs to be. There are two simple versions: what has been said in this session, and what should survive across sessions.

Early on, longer-term memory can be boring markdown:

what changed
what is still broken
what command to run next
what the next session should not repeat

That can live in a handoff note, an INDEX.md, a session recap, or a project memory file. In my note system, the agent reads an index, topic map, and CLAUDE.md before touching anything.

Not glamorous. Useful.

Workflows before agents

Not every problem needs a fully autonomous agent.

Anthropic makes a useful distinction: workflows follow predefined code paths, while agents dynamically decide their process and tool use. If the steps are predictable, use a workflow:

write outline
check outline
write draft
review draft

If the steps are not predictable, use an agent:

inspect this unfamiliar codebase
figure out why the tests fail
decide which files to change
verify the fix

Start with the simplest pattern that works. Most useful setups begin as workflows and only become agents when the model genuinely needs to choose the path.

When multiple agents make sense

Start with one agent and a good harness. Multiple agents make sense when the roles are meaningfully different:

one researches, one writes
one implements, one reviews
one can read sensitive data, one can execute actions
one routes tasks, specialists handle narrow domains

If the second agent doesn't have a different job, permission set, or evaluation role, you probably added complexity for vibes. The safer pattern is a supervisor:

User -> Main agent -> Specialist agent only when needed -> Verification

There is one multi-agent pattern worth knowing early: separate the builder from the judge.

Anthropic Labs described this in their long-running application harness: planner, generator, evaluator. The generator built. The evaluator inspected the result with Playwright, graded it, and sent feedback into the next sprint.

The principle scales down:

Small setup	Bigger setup
Write code, then run tests	Generator builds, evaluator tests in browser
Ask a second agent to review	Dedicated evaluator grades each sprint
Write a checklist before starting	Planner and evaluator negotiate what "done" means

The point is not "use three agents." The point is: don't let the same session that made the thing be the only judge.

What not to overbuild

Don't turn your first harness into a platform. Start with one job, one agent, one clear instruction file, one or two tools, one verification step, and a few test prompts. Then watch where it fails. When the same failure repeats, tweak the harness, iterate.

The practical build path

If you want to build your first harness today:

Write one sentence describing the agent's job.
List the tools it truly needs (follow least privilege!)
Write the rules it must follow.
Define the output format.
Add one verification step.
Test it on five real examples.
Add memory only when the next session needs something from this one.

You can ask an LLM to help:

I want to build an AI agent.

Goal:
[what I want it to do]

Example user requests:
[5 messy examples]

Tools it may use:
[web search / files / calculator / custom API / none]

Rules it must follow:
[non-negotiables]

It must never:
[boundaries]

Please turn this into an agent spec, system prompt, minimal tool list, verification checklist, and 10 test cases.

The formula is simple:

Agent = Role + Goal + Tools + Rules + Output format
Harness = Instructions + Tools + Verification + Memory

That is enough to start.

Takeaway

A harness is the setup around an agent that makes its work more reliable.

It isn't just an SDK or just a prompt. It isn't just CLAUDE.md.

It is the operating environment: instructions, tools, tests, memory, permissions, hooks, recovery paths, and feedback loops.

The model gives you capability. The harness decides whether you can trust it.

Some of the harness is durable: verification discipline, context preparation, running tests before calling something done. Some is scaffolding that dissolves as models improve. Know which half you're building on.

Taste Is a Moat

Collin Wilkins — Thu, 23 Apr 2026 13:59:47 +0000

"Taste is a moat." "Avoid AI slop." You see those lines everywhere right now.

But what do they actually mean? How is taste determined, and why does it matter so much?

Here are my thoughts...

Taste is subjective but it's not arbitrary. It's rooted in feeling — wanting to be liked, wanting to be accepted, wanting to seem smart. Eventually those motivations turn into pride and craft, and the craft is the thing you can see in the work (output).

Which is why AI can't get there.

Taste lives in the chooser

Taste is earned through lived judgment, the kind that separates us from them.

It comes from defending a choice to peers who pushed back. From late hours meeting a deadline. From social norms you broke (on purpose or by accident) and had to sit with. Each one leaves a small calibration mark.

Taste forms early. Maybe things came naturally to you, or maybe you had to grind through something you weren't good at. Either way, those hours of repetition drilled little thoughts into your head. "You gotta make the leap." "Keep that left foot down." "Moving too fast." Thousands of tiny feedback loops.

Taste only develops from doing. It doesn't come from reading, and it doesn't come from inference.

From System 2 to System 1

Another way I think about it comes from Kahneman's Thinking, Fast and Slow. System 2 is deliberate thinking — slow, effortful, conscious. System 1 is fast and automatic. Taste is what happens when System 2 runs so often on the same kind of decision that it drops into System 1. Choices aren't conscious anymore. You just know.

And it shows up everywhere. Clothing choices. Writing patterns. Speech. Knowing what to say at 2am to an angry tenant whose pipe broke. The email to the CEO with your big proposal. That's all taste in the form of judgment fast enough to feel like instinct. Real life teaches weight, proportion, what's important, what survives contact with reality.

Taste comes from omission too. What you decided not to do, what you cut, the font you avoided, the slogan you didn't choose. Taste is the refusal mechanism that protects quality.

You know it when you see it

Taste itself is hard to describe. It's like porn — you know it when you see it. The difference is almost always in the small details. Ken Griffey Jr.'s swing. Roger Federer's backhand. Paul Rand's logos. Frank Lloyd Wright's homes. Everything looks fluid and effortless, but only because the sweating happened a long time ago. In that sense, taste is a skill. It can be refined.

Poor taste is the easy mirror

You learned this early too, by making a crude comment about a friend's sister and getting punched in the nose. By making a bad joke nobody laughed at. You feel it in the used car salesman's pitch, in the bad piece of copy where you know you're being sold to. You can feel the calibration miss. Something pushy, or polished in the wrong way. A pitch that tried too hard, or didn't try hard enough.

Poor taste and slop feel low intelligence, and that's the point. Models getting smarter won't change it. The feeling is about whether anyone actually chose. Whether any decision in the work cost the maker something. Whether someone even cared. Output quality is downstream of all of that.

Where the moat lives

The copycat vs the originator. The amateur vs the master. Masters don't overlook subtleties, and it's always in the small details, because at some point those details were the whole problem. They had to sweat them to get where they are. The details stopped being small to them a long time ago.

And that is where the moat lives.

Is the DOE Framework Still Relevant in the Age of Claude Skills?

Collin Wilkins — Wed, 22 Apr 2026 19:02:31 +0000

Claude Opus 4.7 shipped this week. Every new release is the kind of moment that makes anyone running their own stack ask whether the shape still fits. Skills exist now, Managed Agents exist, Cowork ships features every few weeks. Why would anyone still roll their own directive layer?

Short answer: yes, still relevant. And the 4.7 rollout is a clean example of why.

MarginLab's Claude Code tracker showed 4.6 degrading through the back half of its run, enough that you could feel it in daily work. The status page had its own bumps alongside. That's the real risk when you're tied to one model: service degradation, uptime dips, mid-cycle quality drift. A stronger harness is what protects you when the model wobbles. Audited skills, reviewed outputs, guardrails around the failure modes you've already watched happen once. DOE is the shape of that harness.

I upgraded my sessions to 4.7 the morning it dropped and re-ran a few existing workflows. The output moved, and not always for the better. The harness is what catches what the model itself can't promise — reliable behavior across releases, not just within one.

A quick DOE recap

If you didn't read the original, the shape is three layers with a clean separation between them:

Directive. Plain-language instructions in markdown. Goals, inputs, expected outputs, edge cases. What should happen.
Orchestration. The control plane. Reads directives, routes work, manages errors and state. When and how.
Execution. Deterministic code doing the actual work. API calls, file ops, data processing. The part you don't trust the model to improvise.

Each layer gets simpler the more explicit the boundary between them is. That's still the point.

Full original appendix lives in an earlier post on enterprise AI practices.

What Skills replaced

Skills are nearly a 1:1 replacement for the Directive layer. They load on demand based on frontmatter descriptions, scope tool access per skill, and compose cleanly. You don't have to write a separate directive router, because Claude's harness does the routing for you through the description field. Invocation is mostly automatic — Claude picks the right skill based on the description, but you can also trigger one directly with a /skill command if you know the name.

Here's the concrete before/after. I used to run a slash-command directive called /review-script in my old DOE sandbox:

---
description: "Review an execution script with fresh eyes for quality, efficiency, and reliability"
argument-hint: [script-path]
model: claude-opus-4-5-20251101
---

## Your Role
You are a code reviewer sub-agent. Your job is to review the execution script at `$ARGUMENTS` with completely fresh eyes...

## Automated Analysis
!`python3 execution/review_script.py --script $ARGUMENTS --format text --output .tmp/review_report.txt 2>&1`

## Review Checklist
### 1. Effectiveness (Does it work?) ...
### 2. Efficiency (Is it fast?) ...
### 3. Reliability (Does it fail gracefully?) ...
### 4. Maintainability ...
### 5. Security ...

## Output Format
**Verdict**: APPROVE | REVISE | REWRITE
**Scores** (1-10): Effectiveness, Efficiency, Reliability, Maintainability
**Critical Issues:** [line] description → fix
**Recommendations:** [HIGH/MEDIUM/LOW] ...

Roughly 60 lines of directive. It worked, but it was bolted onto a slash command, hard-coded to a specific model version, and invoked the review script via a shell call embedded in the prompt. The invocation contract was rigid.

The replacement is a Skill:

---
name: agent-review
description: >
  Implementer-Reviewer-Resolver loop with fresh-context subagents.
  After any non-trivial implementation, spawns a reviewer with NO prior context
  to audit for correctness, edge cases, security, and simplification. If Critical/High
  issues found, spawns a resolver to fix them. Triggers on "review this implementation,"
  "agent review," "self-review," or "/agent-review".
---

Same job, generalized. The description field carries the trigger phrases so Claude auto-routes when I say "review this" without me having to remember a slash command. The model choice moved out of the frontmatter because it's an orchestration decision now, not a directive one — which means you can swap models globally without touching individual skills. The behavior is richer (three-agent loop instead of single reviewer) and the file is shorter. That's the upgrade.

For personal work, the directives/ folder in my old DOE sandbox has been deprecated. Skills do the job better. My agent-review skill replaced a loose "review after implementation" rule I used to stuff into CLAUDE.md. The rule lives in the SKILL.md, the prompts that shape the reviewer live in prompts/reviewer.md, and the whole thing only loads when the trigger fires. That's the directive pattern with better ergonomics.

Skills don't have to be pure directive — they can bundle execution too. The Anthropic docs describe the skill folder structure as:

my-skill/
├── SKILL.md           # Main instructions (required)
├── template.md        # Template for Claude to fill in
├── examples/
│   └── sample.md      # Example output showing expected format
└── scripts/
    └── validate.sh    # Script Claude can execute

Since skills can run bundled scripts in any language, you can pack a directive (SKILL.md) and its execution (scripts/) into one folder — the gray area where D and E collapse into a single unit. I lean D-only when the execution is shared across multiple directives and keep scripts external. But when the script only makes sense inside one directive and you want the whole thing to be copy-pastable, bundling is fine.

If your D layer is just markdown SOPs you read every session, Skills are a pure upgrade. There's also a growing community directory at skills.sh worth browsing before you build something that already exists.

What Skills didn't replace

Orchestration still matters for anything non-trivial. A few cases where I still want my own O layer.

Multi-model routing. When I ingest articles into my knowledge base, the context I need is in the article itself — not much logic involved. That goes to Sonnet because Opus would be burning tokens on a task where the intelligence premium doesn't buy me anything. When I'm planning an architecture decision or designing a workflow, that goes to Opus because the failure mode there is cheap mistakes I'll pay for later. Managed Agents is Anthropic-only, which is fine for isolated tasks but can't make per-step model decisions like this. Once you introduce non-Anthropic tools (n8n, Modal, Supabase triggers, a paid API outside the stack), the orchestration decisions aren't Anthropic's to make.

Hard role separation between sub-agents. A reviewer shouldn't be able to write files. A documenter shouldn't touch production code. That separation has to be enforced somewhere, and the model isn't where. More on this below.

If you're shipping a cloud-deployed workflow with end-users, DOE still earns its keep. Determinism matters more when the stakes aren't just yours.

Where each surface fits

Here's how I split work across the three Anthropic surfaces now.

Claude Code (Codex) plus the DOE pattern is still 70 to 80 percent of what I do. It's where the harness lives: memory, skills, sub-agents with scoped tools, the outer loops. Anything that touches a vault, a repo, a scheduled job, or a client deliverable runs here because the state and the tools need to be local.

Chat (Claude web, ChatGPT, Gemini) is 10 to 15 percent. One-off work that doesn't need memory or file access. Drafting a quick template, iterating on a prompt before it becomes a skill, summarizing a long paste I don't want to keep.

Cowork is a small but growing slice. Anthropic's scheduled-execution environment for tasks you don't want your laptop responsible for. Right now I use it for weekly vault housekeeping and a daily trend scanner that reads like an RSS feed for topics I follow. Cowork handles scheduling and state between runs.

Routines just shipped. Good fit for simple scheduled checks (the housekeeping task is a candidate), but the 1-hour minimum cadence and lack of conditional branching rule them out for the scanner, which needs multi-step logic. /loop is the session-scoped cousin — useful for polling while a terminal is open, dies when the session closes.

None of these surfaces replace DOE. They sit at different points in the stack.

Two kinds of self-annealing

The original DOE post used "self-annealing" for one thing when it should have been two.

Runtime self-annealing is handling a failure gracefully inside one run. My scanner gets rate-limited every so often, and the retry path catches it so the workflow doesn't break. If the primary API hits its token ceiling, it falls back to a second. Both were one-time setups I haven't had to revisit. Skills handle this kind of thing adequately because the model's problem-solving and the fallback logic both sit inside the skill's context.

System self-annealing is the setup getting smarter across runs. That's a different problem, and neither Skills nor Managed Agents solve it. You need your own outer loop, and that's where the interesting work sits now.

Three mechanisms in my setup that address it.

Persistent agent memory. My sub-agents have a memory: user or memory: project field in their frontmatter. There's a MEMORY.md per agent, first 200 lines auto-injected at startup. After each task the agent writes learnings back — recurring antipatterns, codebase conventions, failure modes it hit and wants to remember. Here's a trimmed excerpt from my content-writer agent's MEMORY.md:

- "Measure X. Measure Y. Measure Z." repetitive sentence openers are an AI tell.
  Vary with "Track," "Look at," "Pull."
- Bold-header advisory sections (5 consecutive **Bold.** paragraphs) need rhythm
  variation. Convert 1-2 to narrative paragraphs to break the wall.
- Composited openings (presenting survey/aggregate data as one person's story)
  erode trust. Either use the survey directly or clearly label the composite.

None of that was in the agent on day one. Every line was a correction I gave it during a specific task, which it wrote back so the next task starts ahead of where the last one ended.

The recent autoMemoryDirectory config in the Claude Code changelog turns this pattern into a first-class knob. What was a convention six months ago is now a supported flag, so you can point any sub-agent at a custom memory path without hand-rolling the plumbing.

Session recaps plus Map-of-Content breadcrumbs. Every session ends with a 1-3 sentence breadcrumb on the relevant topic hub, capped at eight entries. When the cap hits, the oldest rolls into a "Topic Status" block so the next session starts with compressed state instead of raw history. That's how context compounds across sessions instead of resetting each time.

A weekly housekeeping task that runs in Cowork. The job asks it to sync the vault index against the filesystem, refresh each topic hub's metrics and breadcrumbs, flag stale drafts and KB entries past their review date, and scan for keywords recurring in 5+ entries without a dedicated concept page. It's a Sonnet cron job that treats the vault like a codebase and runs a weekly lint pass on it.

The LLM doesn't improve itself. The system around it accumulates state that the next run reads. Layer-2 self-annealing lives in that loop. And it belongs to you.

Role bleed and scoped tools

There's a failure mode I watched happen too many times without having a good name for it: the orchestrator does work that belongs to a sub-agent. It fixes code instead of calling the reviewer. It writes docs instead of calling the documenter. You set up the sub-agents, you write delegation rules, and the O still reaches for the easier path because the friction of spawning a sub-agent is higher than the friction of just doing the thing.

A delegation protocol helps — invocation contracts, explicit role boundaries in markdown. That's the protocol layer and it matters, but it's a soft boundary. Nothing mechanically prevents the orchestrator from writing to the wrong file.

The mechanical fix is scoped tools. Here's the actual frontmatter on my code-reviewer sub-agent:

---
name: code-reviewer
description: "Expert code reviewer for security, performance, and quality audits."
tools: Read, Glob, Grep, Bash
disallowedTools: Write, Edit
model: opus
maxTurns: 15
memory: user
---

With disallowedTools: Write, Edit set, this agent cannot modify files. The harness enforces the separation, not the agent's discipline.

Caveat worth knowing: notice memory: user sitting right below disallowedTools. When a sub-agent has persistent memory enabled, Read/Write/Edit come back on automatically. The Claude Code docs put it plainly: "Read, Write, and Edit tools are automatically enabled so the subagent can manage its memory files." The enablement is global — not scoped to the memory path. Those two fields contradict each other silently. If you want hard mechanical scoping, either disable memory on the reviewer, or split the role across two agents — one memory-on for pattern learning, one memory-off with scoped tools for the actual review.

Both matter. Protocol tells the orchestrator when to delegate. Tool scoping guarantees the delegate can't overreach even if the orchestrator gets lazy. Practical rule: every sub-agent you build, ask what the minimum tool set for this role is, and block the rest in frontmatter.

This was missing from the original DOE post. I'm adding it to my operating rules.

The changelog lifecycle I didn't have

One piece of rigor I'd skipped entirely in the original post: a changelog with explicit item states inside the script itself. Every review finding gets a label (PENDING, SUGGESTED, IMPLEMENTED, DEFERRED, or WONT_FIX), and the orchestrator only acts on the ones marked critical.

Here's what it looks like in practice — a scraper with seven review passes tracked in the script's own changelog, each finding keyed with an ID:

## Review Pass 1 — 2026-02-22

**Reviewer verdict:** PASS WITH SUGGESTIONS

### Should Fix

| ID        | Location                    | Issue                                                                 | Status       |
|-----------|-----------------------------|-----------------------------------------------------------------------|--------------|
| VVNO-I001 | `_resolve_rating()`         | Rating guard silently drops a rating of `0.0`. Should use `is not None`. | IMPLEMENTED |
| VVNO-I002 | `extract_preloaded_state()` | Regex terminator requires a trailing newline. No `JSONDecodeError` guard. | IMPLEMENTED |

### Nice to Have

| ID        | Location                        | Issue                                                                 | Status       |
|-----------|---------------------------------|-----------------------------------------------------------------------|--------------|
| VVNO-S001 | `build_output_from_preloaded()` | `grapes` comprehension uses `g["name"]` — raises `KeyError` if missing. | IMPLEMENTED |
| VVNO-S002 | `main()`                        | Output path is cwd-relative instead of script-relative.               | IMPLEMENTED |

Pass 5 flagged three nice-to-haves. Pass 6 implemented them and noted one remaining. Pass 7 confirmed the fix and logged it closed. The orchestrator only touches Should Fix items on its own initiative. Nice to Have sits in the backlog until a human says otherwise.

That distinction prevents the orchestrator treating every finding as equal priority — which is how backlog items get closed silently, meaning the next reviewer run has no record they ever existed.

Karpathy described the same pattern a couple weeks ago — filing outputs back into a markdown wiki so explorations "always add up." A knowledge surface where every run makes the next one smarter.

Where the interesting work moved

DOE described separation of concerns, not 2025 tooling. Skills ate the Directive layer. Managed Agents handles Orchestration for simple cases inside the Anthropic stack. The Execution layer is still your Python, and it always was.

What changed is where the interesting work lives. It used to be in writing good directives. Now it's in building the outer loop that makes the whole system get smarter across runs. The session-level equivalent is context engineering — same harness thinking, aimed at what loads into a single conversation. Both matter. Both belong to you.

That's where I'd point anyone asking what to build next.

If you're wondering whether your version of this actually holds up or if your team is getting the most out of its AI adoption, that's the kind of assessment I do.

LLM Gateway Architecture: When You Need One and How to Get Started

Collin Wilkins — Mon, 06 Apr 2026 11:49:06 +0000

The monthly cloud invoice came in $12K higher than expected and nobody can explain it.

Engineering added Opus for a summarization feature... Product had QA testing vision with GPT-4o... the data team switched from Sonnet to a fine-tuned model on Bedrock three weeks ago and forgot to mention it...

This is the database connection problem, replayed for LLMs. Every service talking directly to an external provider, no abstraction layer, no visibility, no fallback. You solved this for database connections a decade ago with connection pools. The LLM gateway is the same pattern, and most mid-market engineering teams don't have one yet.

What an LLM Gateway Actually Does

An LLM gateway sits between your application code and your model providers. Instead of each service importing the OpenAI SDK or the Anthropic SDK or the Bedrock client and calling providers directly, every request routes through a single layer. Your code talks to the gateway. The gateway talks to the providers.

Think API gateway (Kong, Envoy), but built for LLM traffic patterns specifically. LLM calls stream responses, bill per token, throw provider-specific errors like Anthropic's 529 overloaded, and can run for 30+ seconds on complex prompts. A generic API gateway doesn't handle any of that well.

The practical value comes down to two things: reliability and cost visibility. Everything else the gateway does supports one of those.

On the reliability side, automatic fallback means Anthropic returns a 529 and the gateway retries on Bedrock. The outage becomes a log entry instead of a P1 incident. Prompt format differences between providers require some compatibility work upfront (system message handling, tool schemas), but once that's configured the failover is hands-off. Your application code calls one unified API regardless of which provider handles the request.

On the cost side, tag every request with team, feature, and environment, and suddenly you can say "the summarization feature costs $2,400/month and 80% of that is the QA environment." That sentence is impossible without the gateway. With it, the answer takes five minutes to pull up. Routing rules send classification to Haiku and generation to Opus from a config file instead of hardcoding model names across repositories. Per-team rate limits and budget caps keep a runaway loop from burning through your monthly allocation in an afternoon.

Cost visibility gets the gateway approved. Once the team sees automatic failover survive a provider outage at 2am without a page, nobody proposes removing it.

Do You Actually Need Multiple Providers?

Most teams don't need multiple providers yet. Every major provider ships a model family with tiers designed for exactly this kind of routing. Anthropic has Opus for complex reasoning, Sonnet for everyday code and logic, Haiku for classification and lightweight tasks. OpenAI has a similar spread. Google has Gemini Pro and Flash. One provider, three tiers, handles a surprising percentage of use cases.

The price gaps between tiers make this worth doing even without a gateway. As of April 2026, Claude API pricing per million tokens:

Model	Input ($/MTok)	Output ($/MTok)	Best For
Opus 4.6	$5.00	$25.00	Complex reasoning, coding agents, multi-step tasks
Sonnet 4.6	$3.00	$15.00	Balanced performance, general production workloads
Haiku 4.5	$1.00	$5.00	High-throughput, simple queries, cost-sensitive apps

Routing a classification task from Opus to Sonnet saves 40%. Routing it to Haiku saves 80%. If half your LLM traffic is simple classification and extraction running on Opus, those numbers compound fast.

This setup uses the same API, SDK, billing, and auth. This doesn't need a gateway. You can manage it with a model parameter that changes per task.

I run LeadSync this way. Haiku handles lead scoring, Sonnet handles email content generation, and the routing is a config value per task. Same pattern works for agent orchestration: route expensive models to code review and content scoring where errors cost the most, cheaper models to research and classification. None of it requires a gateway because it all runs through one provider.

So when does a gateway actually earn its keep? Provider redundancy is the big one — if Anthropic goes down, a gateway fails over to Bedrock or Azure OpenAI automatically. Cost arbitrage matters when Bedrock pricing differs from direct API pricing on the same model. Capability gaps force multi-provider setups when no single provider is best at everything (vision, code generation, long context, and structured output might each have a different best-in-class model). And compliance requirements make multi-provider routing mandatory when European customers' data needs to route through EU-hosted models.

If none of those apply yet, single-provider routing is the right starting point. Add the gateway when you actually hit the wall.

When to Add the Gateway

Single provider, fewer than 3 services — No gateway needed. Route by model tier in your app config. Revisit when you cross 3 services or $3K/month.
3+ services OR $3K+/month LLM spend — Centralized gateway. Start with cost tagging and one fallback provider.
Multiple providers required (redundancy, compliance, capability gaps) — Centralized gateway with multi-provider routing.
Data residency requirements — Layer edge routing on top.

If you can't answer "what is each team spending per feature per month," you need the gateway regardless of where you fall on this list.

Three Architecture Patterns

The deployment pattern depends on team size, how many services are making LLM calls, and whether you have data residency requirements.

Pattern	How It Works	Latency Impact	Visibility	Best For
Sidecar Proxy	Gateway runs as a library or sidecar alongside each service	Minimal (in-process or localhost)	Per-service only	Small teams, fewer than 3 services
Centralized Gateway	Dedicated service all LLM traffic routes through	One network hop	Full cross-service visibility	Mid-market teams, 3-20 services
Edge Routing	Gateway at CDN/edge, routing by geography or compliance zone	Variable by region	Full with regional breakdown	Multi-region, data residency

Sidecar proxy is the fastest way in. Import LiteLLM as a Python library, point your existing model calls at it, and you have basic routing and fallback working in an afternoon.

Centralized gateway is where most mid-market teams should land. Deploy LiteLLM in proxy mode (or Portkey) as a standalone service and point each application at the gateway's URL instead of the provider's. One dashboard shows every team's spend, every model's usage, every feature's cost.

Edge routing adds geographic or compliance-based routing on top. European requests go to EU-hosted models for GDPR, APAC to the closest region for latency. Most teams don't need this yet. If you don't have data residency requirements, Pattern 2 covers you.

The decision shortcut: fewer than 3 services, sidecar. Three or more, centralized. Data residency requirements, layer edge routing on top.

Routing Strategies That Actually Save Money

The gateway gives you routing. The strategy determines how much value you extract from it.

Cost-based routing has the highest impact and the simplest logic. A support ticket classifier doesn't need Opus. Haiku handles it for a fraction of the cost with comparable accuracy on well-defined tasks. The gateway lets you make that distinction in one routing table instead of hunting through application code for hardcoded model names.

Capability-based routing sends vision tasks to models with vision support, long-context requests to large-window models, and structured output requests to models with native JSON mode. Without a gateway this means importing four SDKs and writing provider-specific conditionals that nobody wants to maintain. With a gateway you define the capability map once and application code doesn't care which model handles the request.

Latency-based routing sends streaming chat responses to the fastest available provider and batch jobs to the cheapest. The gateway can measure provider performance empirically and shift traffic away from degraded providers before users start complaining. This is where the reliability engineering value shows up, since the gateway is making routing decisions based on real-time performance data rather than static configuration.

A/B testing routes a percentage of traffic to a new model, compares quality against the baseline, and promotes or rolls back. Without a gateway this means feature flags, comparison infrastructure, and new deployment code. With a gateway you change a routing weight and let it run.

Most teams combine cost-based with one other strategy. That covers the vast majority of the value.

Here's what a basic cost-based routing config looks like in LiteLLM proxy mode:

model_list:
  - model_name: "fast-classify"
    litellm_params:
      model: "anthropic/claude-haiku-4-5-20251001"
  - model_name: "generate"
    litellm_params:
      model: "anthropic/claude-sonnet-4-20250514"
  - model_name: "generate"
    litellm_params:
      model: "bedrock/anthropic.claude-sonnet-4-v1"

router_settings:
  routing_strategy: "simple-shuffle"
  num_retries: 2

Your application calls fast-classify for ticket routing and tagging, generate for content and reasoning. Two entries for generate means if the direct Anthropic API fails, the gateway retries on Bedrock automatically. The routing decision lives in this config file, not scattered across your application code.

Build vs. Adopt

Most teams should start with LiteLLM in proxy mode. It's open source, supports 100+ providers through a unified API, runs as a Python library or standalone proxy, and handles cost tracking, fallback, and rate limiting out of the box. SaaS alternatives like Portkey and Helicone exist if you don't want to run the proxy yourself, but the per-request pricing adds up. Building a custom routing layer is almost never justified — routing models by task complexity is a configuration problem, not a software engineering problem.

Getting It Into Production

The sequence matters more than the timeline. With AI-assisted scaffolding you can get through this in a few days, but doing the steps out of order is where teams get burned.

Deploy the proxy with one service. Point a single existing service at LiteLLM without any changes. If something breaks, you want to find out before migrating anything else.
Add cost tags. Team, feature, environment on every request. Let baseline data collect. This is where teams have their first real conversation about LLM spend, because the data almost always surfaces something nobody expected — QA running expensive calls around the clock, a retry loop doubling costs on one endpoint, a feature nobody uses still generating hundreds of requests a day.
Configure automatic fallback. Primary provider returns a 429 or 529, gateway retries on a secondary. Test by blocking the primary in staging while you're watching, not during an actual outage at 2am.
Downgrade one use case. Pick a task where you're using an expensive model for something simple and switch it to Haiku-class. Measure quality against your baseline. If it holds (and it usually does for classification and extraction), that's your first real cost savings. If quality drops, switch back and try a different task boundary.
Roll out and publish the dashboard. I know, another dashboard to worry about.

Migrate remaining services and share the cost dashboard with engineering leadership. Teams that can see their LLM costs start optimizing without anyone writing a policy memo.

What Goes Wrong

This section matters more than the implementation playbook, because the mistakes are where the real money goes.

The QA environment is the silent budget killer. A test suite running Opus calls against every PR, 24/7, with nobody reviewing the results. The fix takes five minutes once cost tagging by environment is in place, but without it the spend is invisible. This is the single most common cost surprise and it's also the easiest to fix, which makes it a good argument for the gateway all by itself.

Retry loops compound faster than you'd expect. A service gets a 429 rate limit, retries with exponential backoff, but the backoff ceiling is set too high and the service hammers the same provider with progressively more expensive calls (longer prompts on each retry because context accumulates). Gateway fallback routing eliminates this entirely since the retry goes to a different provider instead of beating on the rate-limited one.

Over-engineering the routing logic. The first strategy should be simple: expensive model for complex tasks, cheap model for simple tasks, one fallback provider. The teams that get the most value from gateways are the ones that start with simple routing rules and add more only when the cost data shows they need them.

Treating the gateway as a one-time cost savings project. Teams deploy the gateway, save 30% through routing, and call it done. They never build the cost dashboard or set up ongoing tagging for new services. Cost savings are great, but the bigger win is permanent visibility into what you're spending, where, and why. That requires treating the gateway as infrastructure, not a project.

I write about AI infrastructure and engineering every couple weeks. Subscribe to the newsletter if this was useful.

The Claude Code Productivity Paradox

Collin Wilkins — Wed, 11 Mar 2026 18:41:16 +0000

originally published at collinwilkins.com

Anthropic surveyed 132 of their own engineers about Claude Code. The numbers looked incredible. 67% more merged PRs per day. Usage jumped from 28% to 59% of daily work. Self-reported productivity gains between 20% and 50%.

Then someone checked the organizational dashboard. The delivery metrics hadn't moved.

That's the productivity paradox, and the gap between those two sets of numbers is where it gets interesting.

I've been running Claude Code as my main dev tool for months now (it's basically replaced my terminal workflow at this point). I've written about context engineering, specialized agents, and agentic orchestration patterns. All of that assumed AI coding tools deliver net-positive outcomes. Turns out the picture is messier than I thought.

The individual numbers are impressive

Those Anthropic survey numbers are all individual metrics — how much each engineer used the tool, how fast they felt, how many PRs they shipped. By any of those measures, the tool was clearly working.

The solo developer story is even more dramatic. A case study published in February 2026 documented one developer delivering what was scoped as a "4 people x 6 months" project in 2 months, working alone. That's a raw 12x multiplier on person-months (the kind of number that gets screenshot'd and passed around without context), and by my rough math, about 3x when you weight for task mix. The breakdown by task type tells the real story:

Task Type	Speedup
Boilerplate and scaffolding	~10x
Complex logic and debugging	~2x
Architecture and planning	Minimal

That distribution matters. The mechanical work got dramatically faster, but the judgment work barely moved.

This matches my own usage almost exactly. When I'm scaffolding a new module or wiring up boilerplate integrations, Claude Code flies. I can stand up a full project structure in minutes. But the architecture decisions, the "which service owns this data" conversations, the debugging where the root cause is three layers removed from the symptom? Those take the same time they always did.

Faros AI's analysis confirmed the same shape: 21% more tasks completed, 98% more PRs merged.

If you stopped reading here, the conclusion is obvious... Ship Claude Code to your whole team and watch the numbers climb!

Don't stop reading here.

The organizational numbers tell a different story

Faros AI measured DORA metrics on the same teams: deployment frequency, lead time, change failure rate, time to restore service. Unchanged. Meanwhile code review times increased 91%. The METR study found experienced developers on familiar codebases took 19% longer on real-world tasks while estimating they were 20% faster. Developers felt faster. The customer deliverables didn't move.

Where the speedup goes

Code generation got faster but everything downstream of it didn't. Planning, design, prioritization, code review, QA — still run at the same speed. When one stage of the pipeline accelerates and the rest stays flat, you get a pile-up at the next bottleneck, not faster delivery.

What I'm seeing in a lot of teams right now: AI writes the code, opens a PR, and then another AI tool (or human) on the review side suggests meaningful changes. That leads to additional back-and-forth after the PR is already open — churn that doesn't show up in "PRs merged" but absolutely shows up in cycle time.

Anthropic's own survey found that more than 50% of their engineers could "fully delegate" only 0-20% of their daily work to Claude Code. These are Anthropic engineers, on Anthropic's own tool, in an environment optimized for exactly this usage. If the people who built the tool can only fully hand off a fifth of their work, the ceiling for a typical team is lower.

I'd put myself in that 0-20% bucket too. Most of my Claude Code usage is collaborative, not delegated. I'm reviewing output, re-prompting when it drifts, catching architectural decisions the agent doesn't have context for.

The "build it because you can" trap

There's a subtler problem that doesn't show up in any of the studies. Because coding delivery sped up, more features feel feasible. A feature that would have taken two sprints now looks like a long afternoon, and that changes the calculus on whether it's worth building.

It shouldn't. The cost of building a feature was never just the implementation time. It's the maintenance, the cognitive load on the team, the opportunity cost of not building something else, the QA cycles, the documentation, the support burden. AI made the implementation cheaper. It didn't make any of those other costs cheaper.

What I'm seeing is that the bar for "let's just build it" has dropped. It's easy to prompt a new feature into existence, so naturally the threshold for opening a PR lowers. Teams should keep a high bar and think hard about whether a feature is worth shipping at all, regardless of how fast it can be coded.

A lot of teams are also freezing hiring or laying people off based on early perceptions of AI development speed. There's a common assumption that AI simply raises the bar for everybody. In my experience, that's not the case. The gains are uneven, task-dependent, and often illusory when you measure end-to-end.

A Hacker News thread from March 2026 captured the human side of this well. Comments offered the most useful frame: "Do you enjoy the 'micro' of getting bits of code to work, or the 'macro' of building systems that work? If it's the former, you hate AI agents. If it's the latter, you love AI agents." That split is real and doesn't resolve with better tooling. It requires honest conversations about what each person's role becomes.

What this means for team adoption

Your current metrics are probably measuring the wrong things. If you're an engineering manager trying to figure out whether AI coding tools are working, here's what the data actually says.

The first thing I'd change is what you're counting. More PRs per developer is a real number. It doesn't mean the team ships better software faster. If review times are climbing and defect rates are flat, the bottleneck moved from writing to reviewing. Measure the bottleneck, not the part that got faster.

Invest in review infrastructure before scaling AI-generated output. The review time increase isn't a tooling problem you fix with a faster CI pipeline. That's a structural problem. If you're rolling out AI coding tools to a team without simultaneously expanding review capacity, you're building pressure on the part of the pipeline least equipped to absorb it.

Set expectations by task type, not tool type. The speedup distribution from that solo dev case study is the most useful number to take away from this data. Boilerplate flies. Architecture doesn't move.

Track the boring metrics. If you're measuring AI tool ROI through surveys and individual PR counts, you're measuring perception. Track cycle time end-to-end. Look at defects per deploy. Pull time-from-commit-to-production.

Don't let the tool lower your refactoring standards. When you're iterating on a feature, the original design sometimes calls for a refactor. When the LLM can work around the existing structure, the willingness to do that refactor drops. Fight that. Leave the codebase better than you found it, same as always.

Review AI output in a fresh session. AI is biased in the code it writes. Common patterns, familiar abstractions, the path of least resistance. The best way to catch those inefficiencies is to review with fresh eyes, outside the context window that produced the code. A thorough human review in a separate session will catch things that in-context review misses.

Don't use AI to bulldoze friction. The friction you feel during development, the code review pushback, the design debate, the test that keeps failing, that friction exists for a reason. Using AI code generation to power through it faster doesn't remove the underlying problem. It just ships the problem to production. These are the same engineering practices we've always applied.

What you're actually measuring

Confusing "more output" with "better outcomes" is how teams make expensive adoption decisions. The teams that get real value from Claude Code won't be the ones that hand it to every developer and watch PR counts climb. They'll redesign their workflow around the new shape of the work — where writing is cheap, reviewing is expensive, judgment calls haven't gotten any easier, and half the features that feel feasible probably aren't worth building.

I've built my personal workflow around these tools. This piece is about whether they're actually working.

Both questions matter. Most teams are only asking the first one.

Context Engineering for AI

Collin Wilkins — Tue, 03 Mar 2026 11:52:58 +0000

Originally published at collinwilkins.com

This one is long and a more detailed follow up to my first article on the topic on this

Two things before we get into it:
first, you'll walk away with at least one thing you can apply this week to get more consistent results from your AI coding tool.

Second, the examples throughout use Claude Code — that's my daily stack, not an endorsement. Every principle here applies to Cursor, Copilot, or whatever you're running.

Let's start with an example that probably happened to you this week...

You asked your AI coding tool to add a new API endpoint. It generated exactly what you need — right naming convention, file location, and imports. You closed the task in 15 minutes.

Next morning, you asked for another endpoint. It used a naming pattern from a framework you dropped three months ago. The file landed in the wrong directory. It imported a library that's no longer in the dependency tree. You spent 40 minutes cleaning it up.

Then a teammate tried the same tool on the same codebase. Their output matched neither of yours.

Same model and codebase produced three completely different results. The variable nobody names: what the AI could actually see. Controlling that is a discipline and many developers aren't doing it.

The Debate Everyone's Having Wrong

A few weeks ago, an HN thread along the lines of "Cursor's context is 10X better than Claude Code's" hit the front page with 150+ points and hundreds of comments. Developers trading war stories about which tool retrieves the right files, which one hallucinates project conventions, which one actually understands a large codebase.

The thread was comparing tool features — how Cursor auto-indexes and retrieves files by semantic similarity versus how Claude Code relies on explicit file reads and instruction routing. Worth knowing.

But none of it explains why the same tool, same codebase, same developer produces solid output on Tuesday and unshippable output on Thursday. That gap is context engineering.

Context engineering is the discipline of controlling what information an AI coding tool has access to, how that information is structured, and what instructions govern its behavior. It's distinct from prompt engineering (what you say in a given session) and model selection (which AI you use). You can write perfect prompts and pick the most capable model and still get inconsistent results if the context is wrong.

I went into this in more detail here, this variability is actually designed into models to help give them reasoning

Developers who understand this produce more consistent work than any tool comparison would predict. The ones still debating Cursor vs. Claude Code are optimizing the wrong variable.

Why Context Quality Determines Output Quality

Every AI coding tool generates predictions from everything in its context window. That's not just your last message. It includes the files the tool read earlier in the session, the instruction files it loaded at startup, documentation it retrieved, and the full conversation history. You're getting a response to everything the model has seen, not just what you typed.

The Primacy Problem. Models recall information near the beginning and end of their context window better than material buried in the middle. The implication is direct: your most important instructions — naming conventions, anti-patterns, what to never modify — belong at the top of your config files, not tucked into section 7 after a wall of boilerplate. Instructions at line 300 of a bloated CLAUDE.md are functionally invisible no matter how well-written they are.

Ask your AI tool: "What naming conventions does this project use?" If it answers correctly without reading a specific file, your context is working. If it asks for clarification or gives you a generic answer, your context engineering needs work.

Garbage in, garbage out applies at the context level, not just the prompt level. A well-crafted prompt can't compensate for context that's missing, outdated, or structurally wrong.

Most developers understand this in the abstract. They just haven't mapped which layer they're actually investing in. That mapping explains almost everything about inconsistent results.

The Four Layers of Context

Most developers treat context as a single thing: what they've said so far this session. It's actually four layers with very different durability and very different impact.

Layer	What It Is	How Long It Lasts
Project structure	Folder names, file naming, co-location of decisions	Permanent
Instruction files	CLAUDE.md, Cursor rules, .github/copilot-instructions.md	Permanent
File-level docs	Comments, type annotations, explicit naming	Permanent
Session context	Files read this session, conversation history	Ephemeral

Most developers spend all their energy on Layer 4. The right prompt for this session. Better instructions in this particular message. Layer 4 resets every session. Everything you engineer there disappears at session end.

Layers 1-3 are permanent. They work whether you're logged in or not. They benefit every session, every developer on the team, every AI tool that touches the codebase.

The math is simple: one hour invested in a CLAUDE.md instruction file compounds across every future session. One hour spent crafting a better prompt compounds across exactly one. Fix the bottom layers and the top layer takes care of itself. The best place to start is your instruction file.

CLAUDE.md Patterns That Actually Work

Every major AI coding tool has an equivalent to CLAUDE.md. Cursor has .cursor/rules. GitHub Copilot has .github/copilot-instructions.md. OpenAI and others use AGENTS.md. Same principle across all of them: a file the AI reads at session start that shapes its behavior for everything that follows.

this file is loaded automatically at the start of every session!

Most teams write this file wrong. They treat it like a README.

A README explains your project. A table of contents tells you where everything else lives. Your instruction file should be a navigation layer: pointers to where conventions are documented, not exhaustive documentation of those conventions. When the agent needs your GraphQL design patterns, it gets routed to the right file. The patterns don't live in the root config. The root config tells the agent where to find them.

Primacy applies here too. Put project overview and critical anti-patterns at the top. This is deliberate architecture, not formatting preference.

A monolithic root CLAUDE.md that's 800 lines long is context bloat with a reading time penalty every session. Move subdirectory-specific conventions into CLAUDE.md files within those subdirectories. The root file stays lean.

Most teams skip the exclusion zone entirely. Naming what the AI should never touch matters as much as naming what it can do. Generated code, migration files, lock files, vendor directories — put them on an explicit list. Relying on the model to infer off-limits territory is how you get PRs that modify auto-generated files.

Add YAML frontmatter to your project documentation. When docs, ADRs, and notes carry structured metadata, they become machine-queryable. Ask the agent for "anything tagged with payment-flow" and it surfaces the right files rather than grepping blindly. That's the closest thing to semantic search without native support.

Here's a minimal skeleton that reflects the structure that works:

# Project Name -- CLAUDE.md

## Overview
[2-3 sentences: what this is, what it does, the core stack]

## Folder Map
src/api/        - Route handlers. One file per domain.
src/services/   - Business logic. Stateless functions only.
src/models/     - Prisma schema and type definitions.
docs/           - Project documentation. Read before making architectural changes.
docs/decisions/ - Architecture Decision Records. One file per major decision.

## Tech Stack
- Node.js 22 / TypeScript 5
- Prisma + PostgreSQL
- Next.js 15 App Router

## Naming Conventions
- Components: PascalCase
- Utils and hooks: camelCase
- Files: kebab-case
- Database tables: snake_case

## Do Not Touch
- /migrations   - auto-generated, never edit manually
- /generated    - prisma client output, run `npx prisma generate` to rebuild
- src/vendor/   - third-party code, not ours

## Anti-Patterns
- No raw SQL -- use Prisma queries
- No `any` types -- use proper types or `unknown`
- No default exports -- named exports only

This is roughly the structure this vault uses, adapted for a software project. It took many iterations to reach something stable. That's the nature of the document. You evolve it, you don't write it once so version control it with your other code changes.

The ROI data on this is concrete. Aakash Gupta's PM OS (news.aakashg.com, Feb 2026) used a well-crafted CLAUDE.md with skills and sub-agents to reduce PRD creation from 4-8 hours to 30 minutes. Harry Zhang called CLAUDE.md the "highest ROI habit" in Claude Code. Faros AI's 2026 measurement of Claude Code usage across engineering teams found roughly 4:1 ROI — cost per PR around $37.50 against 2 hours saved at $75/hour. Not controlled studies. Consistent practitioner reports. The pattern holds across enough setups that dismissing it as anecdote is a mistake.

A tight instruction file is necessary but not sufficient. It works better when your project structure isn't fighting it.

Project Structure as Context Contract

Before the AI reads a single instruction file, it's already forming a model of your codebase from its structure. Folder names are documentation. File names are documentation. The way you organize things tells the agent what belongs where, what relates to what, and what conventions you follow — automatically, without you saying a word.

Most teams don't think about this as context engineering. I've watched this exact disconnect produce mysterious, inconsistent AI output on every large codebase I've touched. The structure is sending signals the team never intended to send.

Consistent naming matters more than you'd expect. If some components are named UserCard, some are user-card, and some are UserCardComponent, the agent is receiving three different signals about the same thing. It can't infer a convention from contradictions. It produces output that matches whichever form it saw most recently, not the correct form. Three inconsistent names is three opportunities for the wrong suggestion.

Keep tests, docs, and decisions next to the code they describe. A test file two directories away from its source module is context the agent might never retrieve. A test file in the same directory gets read automatically when the source gets opened. Don't make the agent hunt. It won't always find what it's looking for, and you'll pay for that in bad output.

A docs/decisions/ folder earns its keep fast. One file per major architectural choice, written when you make the decision. When the agent is working in the payments layer and a relevant ADR exists, it surfaces the reasoning behind how things are built. Without ADRs, the agent sees the what and invents the why.

A good practice is to write the architectural map or code in a lookup table in this section so there's a quick reference for AI to get up to speed on a codebase (every session is 'new').

Deeply nested folder hierarchies are a hidden context tax. Every level of nesting increases the probability that relevant files fall outside the context window when the agent is working on something nearby. Flat structure with clear naming outperforms deep hierarchies for AI-assisted work. If your project is necessarily deep, your instruction file routing has to be precise enough to compensate.

Structure produces consistent context. Even perfect structure can't fix a bloated context window, though. That's where most sessions quietly break down.

Managing Your Context Window

These examples use Claude Code mechanics because that's what I work in daily. Every serious AI coding tool has equivalents. The pattern matters more than the specific command.

Check your window before it checks you. Claude Code's /context command shows token counts for your current session: input tokens used, output tokens, cache status. When input tokens are approaching the model's limit, output quality degrades. Responses get shorter. Suggestions get less precise. Hallucinations increase. By the time you notice the quality drop, you're already in it. Check before starting long tasks, not after.

130k of 200k tokens used (65%). Messages account for 106.4k tokens — over half the window consumed by conversation history alone. Free space: 35k. This is the threshold where output quality starts slipping.

Compact vs. new session. The /compact command summarizes your current session and rebuilds a condensed version. Use it when you're mid-task, need to shed conversation weight, and the working context (decisions made, files read, direction established) is still relevant to where you're going.

A new session starts clean. Use it when the task is complete, when you're switching domains, or when accumulated context has drifted from what you're actually doing. Old context isn't neutral. It's noise that pulls the model toward decisions you've already discarded.

/compact rebuilds the session: Claude re-reads the key files it needs, restores skills, and collapses the conversation into a condensed summary.

After compact: 51k/200k tokens (25%). Messages dropped from 106.4k to 25.4k. Free space jumped from 35k to 116k. The working context survived. The dead weight didn't.

Compact preserves momentum. A new session preserves clarity. Clarity usually wins.

Three modes, three different situations. Plan mode (/plan) makes the AI propose before touching anything. Use it for multi-file changes, anything touching shared infrastructure, or any task where you're not certain what the blast radius is. The proposal step isn't overhead. It's the difference between reviewing a plan and reviewing a broken implementation.

Accept with edits is the default for most sessions. The AI does the work, you verify.

Bypass or auto-approve is appropriate only when Layers 1-3 are solid. When the AI knows your conventions, when it has explicit anti-patterns to follow, when the context is tight — that's when giving it autonomous scope makes sense. Better context engineering is how you earn the right to give your agent autonomy.

/think before complex decisions. This command forces the model to reason explicitly before responding. Use it for architecture decisions, hard debugging, anything where the first answer is likely wrong. You're not changing the response. You're changing the quality of the reasoning that produces it.

These three files are your agent's persistent identity:

~/.claude/CLAUDE.md (user-level): your preferences across all projects — editor config, communication style, how you want code commented
./CLAUDE.md (project-level): project conventions, folder structure, anti-patterns
AGENTS.md (root): behavioral rules for specific agents or workflows

Without them, every session starts from zero. The agent has no memory of what you've built, what you've decided, or what you've told it to avoid. With them, the agent isn't starting from zero. It already knows what you're building and what you've decided. That's the difference between a tool you configure once and a tool you re-brief every morning.

Build that identity well and you'll want to extend it. That's where the context budget comes in.

Context Budget: Skills vs. MCP

This tradeoff applies to any framework that extends an AI agent's capabilities. It shows up in every serious setup. Most developers don't think about it until sessions start degrading.

Skills are lightweight instruction files loaded at session start. They tell the agent how to do something: a workflow, a content pattern, a code review checklist, a task it performs the same way every time. You write it once. The context cost is fixed and paid once at session start.

MCP (Model Context Protocol) connects the agent to real-time external services. The agent calls a tool mid-session: a database query, a live API call, a current data source. The cost is variable, paid per call, and it stacks. Every MCP call loads tool schemas, the call result, and server responses into the context window.

This compounds. Three MCP calls per task, ten tasks in a session — that's 30 discrete context injections on top of everything else. A skill-based equivalent, where the workflow is pre-written and the agent follows it, has a fraction of that pressure.

Use MCP when you genuinely need live data. Current timestamps, a real database query, an API response that changes between calls. The output can't be pre-written because it depends on real-time state.

Use skills when you have a repeatable workflow. Content patterns, review checklists, tasks the agent performs identically every time. Pre-write it once, reference it for as long as the workflow holds.

The decision rule: if you can write it down and have it work 90% of the time, write it as a skill. Every unnecessary MCP call is a context tax paid on every execution. You can even use the MCP first to get something working quickly then create your own skill to get that output.

Auditing Your Context Setup

Your AI should answer these questions from context alone, without reading a specific file:

What naming convention does this project use for components?
What's the tech stack?
What are the top 3 things I should never change in this codebase?

If it can't answer question 3, you don't have explicit anti-patterns documented. That's the biggest gap in most instruction files, and the first thing to fix.

Signs your context is bloated:

The AI asks you to clarify things it should already know
Suggestions don't match project conventions
Errors reference wrong library versions or deprecated APIs

If this is happening you should /compact or start a new session, you aren't getting optimal results anyways.

Signs your context is working:

The AI refers to project conventions without being prompted
Suggestions match naming patterns on first pass
It knows where things live without being told

The bloated context problem is almost always a Layer 2 issue. The instruction file grew to document everything, contradicts itself in places, and buries the most critical rules in the middle where recall degrades. Trim ruthlessly. Move subdirectory-specific rules to their subdirectory. Keep the root file focused on what's true across the whole project.

That's when context engineering stops feeling like maintenance and starts paying for itself.

The Cursor vs. Claude Code debate will be irrelevant within a year. Something will ship that makes both look dated. The debate restarts around whatever that tool is — same framing, same wrong frame.

Context engineering won't be irrelevant. The principles — what the AI can see, how it's organized, what contracts you've written to govern its behavior — apply to whatever ships next. You're building fluency with a discipline, not a product.

Open your instruction file. Find the first section that doesn't exist yet. Anti-patterns list. ADR folder reference. Exclusion zone. Write it this week. One section, one hour, permanent improvement.

Master the harness. The horse will change.

The AI Coding Model Wars: How Open Source Is Closing the Gap on Proprietary Coding Models

Collin Wilkins — Fri, 27 Feb 2026 15:01:17 +0000

Originally published at collinwilkins.com

Four major coding models launched in six days. Two proprietary. Two open source. The benchmark gap between the best and worst? Just 2.6 percentage points.

That number is the story of February 2026. There isn't a single model that is clearly winning. What matters now is which model fits your workflow, your budget, and how much you care about keeping your code off someone else's servers.

The week that broke the leaderboard

On February 5, Anthropic released Claude Opus 4.6 and OpenAI shipped Codex 5.3. Same day. Two very different philosophies, both claiming the top spot in coding performance.

Six days later, Zhipu AI dropped GLM-5. A 744-billion parameter open-source model under an MIT license. It scored within 1.6 points of Opus on SWE-bench. At roughly 1/45th the cost.

Then Kimi K2.5 from Moonshot AI. One trillion parameters, open source, agent swarm architecture that can coordinate 100 sub-agents in parallel.

Here's where things stand:

Model	SWE-bench Verified	Input Cost (per MTok)	License
Claude Opus 4.6	79.4%	~$5.00	Proprietary
GLM-5	77.8%	~$0.11	MIT
Codex 5.3	~77.3% (Terminal-Bench leader)	~$1.75	Proprietary
Kimi K2.5	76.8%	Open weight	Open Source

Sources: aifreeapi.com, Interconnects.ai, Winbuzzer

Razor-thin. Two years ago, the gap between the best and fifth-best model on any coding benchmark was 15+ points. Now the top four sit within a few points of each other and the rankings shuffle depending on which benchmark you pick.

Interconnects.ai put it well: workflow fit matters more than leaderboard position. I'd go further. If you're choosing a coding model based on SWE-bench scores alone, you're optimizing for the wrong thing.

The real differences are in how these models work, what they cost, and what you're allowed to do with them.

The proprietary heavyweights

Claude Opus 4.6

Opus 4.6 is the deep thinker. Its headline feature is Agent Teams, the ability to spin up 16+ parallel agents that coordinate on complex tasks. Anthropic demonstrated this by having agent teams build a 100,000-line C compiler across 2,000 sessions (Interconnects.ai).

The philosophy is autonomous. Give it a complex problem, set guardrails, let it work. A 1-million-token context window means it can hold entire codebases in memory, and deep reasoning chains let it plan multi-step refactors that other models lose track of halfway through.

The tradeoff is cost. At ~$5/MTok input, a heavy agentic session gets expensive fast. That C compiler demo reportedly cost $20,000 in API spend. I've run smaller agent workflows that still burned through $50-100 in an afternoon. For enterprise teams where engineer time costs more than API credits, that math works. For a solo dev, it probably doesn't.

Best for: Complex multi-file refactors, architectural changes, enterprise workflows where correctness matters more than cost.

Codex 5.3

Codex takes the opposite approach. Where Opus goes deep and autonomous, Codex goes fast and collaborative.

It leads Terminal-Bench at 77.3%, which measures terminal-based coding tasks closer to how developers actually work than isolated benchmark problems (Interconnects.ai). The real strength is interactive steering: you can redirect it mid-task without breaking context or restarting the conversation.

At ~$1.75/MTok input, that's about 3x cheaper than Opus. The ecosystem around it is mature, with deep integration into VS Code, GitHub Copilot, and the broader OpenAI toolchain.

Every.to described the split well: Opus is the model you set loose on a problem. Codex is the model you pair-program with.

Best for: Fast iteration, interactive development, teams already invested in the OpenAI ecosystem.

The philosophical split

This matters more than the benchmarks.

Opus says: "Tell me the goal, I'll figure it out." That works when the task is complex enough that you'd spend hours on it yourself. It fails when you need tight feedback loops or when the cost of an autonomous run gone sideways exceeds the cost of doing it manually.

Codex says: "Let's work on this together." That works for the daily grind. Writing functions, debugging, building features incrementally. It fails when you need sustained multi-step reasoning across a large surface area.

The model you want depends on how you work, not how it benchmarks. I keep Opus for architecture-level tasks and reach for Codex-class models when I'm iterating fast on implementation. Most days are implementation days.

But the proprietary debate is only half the story. The open-source models that showed up a week later made the whole conversation more interesting.

The open-source challengers

GLM-5

GLM-5 is the model that changed the math.

744 billion parameters in a Mixture-of-Experts architecture. MIT license. 77.8% on SWE-bench Verified, within 1.6 points of Opus 4.6.

At ~$0.11 per million input tokens through Zhipu's API, that's roughly 45x cheaper than Opus for comparable coding performance.

But cost isn't even the most interesting part.

GLM-5 was trained entirely on Huawei Ascend chips, no NVIDIA dependency. It's self-hostable. Because it's MIT-licensed, you can fine-tune it on your proprietary codebase without worrying about licensing terms.

The tooling ecosystem moved fast. Within days of release, GLM-5 was working with Claude Code, OpenCode, and Roo Code as a drop-in backend. Simon Willison noted that it handled agentic coding workflows comparably to proprietary alternatives. Multi-step tasks with tool use. The stuff that actually matters for real development work.

$0.11/MTok for 77.8% SWE-bench performance, MIT-licensed, self-hostable. Read that sentence again if you're still paying $5/MTok for routine coding tasks.

Best for: Budget-conscious teams, self-hosted environments, privacy-sensitive codebases.

Kimi K2.5

K2.5 from Moonshot AI takes a different angle on open source. One trillion total parameters with 32 billion active (another MoE architecture), but the standout feature is the agent swarm system. It can coordinate up to 100 sub-agents making 1,500 tool calls in parallel.

It scores 76.8% on SWE-bench Verified. Slightly below GLM-5 on pure coding benchmarks. But it has two things the others don't: strong frontend/visual understanding and native agent orchestration at a scale that would require serious custom infrastructure to replicate with other models.

If you're building something that involves UI generation, design-to-code workflows, or massive parallel agent tasks, K2.5 is worth evaluating. I haven't tested it as deeply as GLM-5, but the agent swarm capability is genuinely novel.

Best for: Frontend and visual tasks, large-scale agent orchestration, teams experimenting with multi-agent architectures.

Why open source matters now

The performance argument is settled. Open-source models match proprietary ones on coding benchmarks. The remaining arguments are about everything else.

GLM-5 at $0.11/MTok vs Opus at $5/MTok. For teams processing thousands of coding tasks per day, that's the difference between a rounding error and a budget line item. At that ratio, you could run 45 GLM-5 tasks for the cost of one Opus task. The volume math gets absurd fast.

Self-hosted means your code never leaves your infrastructure. For regulated industries, defense contractors, or anyone with strict data residency requirements, this isn't a nice-to-have. It's a hard requirement. I've talked to teams in healthcare and fintech who won't touch any cloud-hosted model for their core codebase. GLM-5 with an MIT license is the first model that gives them frontier-tier coding capability without that tradeoff.

There's a harder question behind the self-hosting argument, though. GLM-5 and Kimi K2.5 both come from Chinese companies — Zhipu AI and Moonshot AI, respectively. China's 2017 National Intelligence Law requires organizations to cooperate with state intelligence work. Multiple governments have already responded: the US banned Chinese AI models from government devices, Australia followed, Taiwan and Italy took similar action. CrowdStrike found that DeepSeek-R1 produces insecure code when prompted with politically sensitive topics. The scrutiny isn't theoretical. It's policy.

The distinction that matters is hosted API versus self-hosted weights. Using Zhipu's API at $0.11/MTok means your code routes through Chinese servers — a non-starter for most enterprises and outright banned in some jurisdictions. Self-hosting the MIT-licensed weights means your data never leaves your infrastructure, and Chinese intelligence law doesn't apply to weights you downloaded and run locally. This is actually the strongest argument for the open-source license. The MIT license isn't just a cost play. It's the escape valve that makes these models usable for teams that would otherwise never touch them.

Fine-tuning on your own codebase means the model learns your patterns, your conventions, your internal APIs. Proprietary models can't offer this. And if Zhipu raises prices or changes terms, you have the weights. You can host them anywhere. Bitdoze noted this portability as a key factor driving enterprise adoption.

The catch is real though. Self-hosting a 744B parameter model requires serious hardware. You're trading API costs for infrastructure costs. For many teams, the managed API at $0.11/MTok is the pragmatic choice anyway. But the option to self-host is what creates competitive pressure on pricing across the board.

When to use what

Skip the "which is best?" question. Wrong frame. The right question is "which is best for this task?"

Use Case	Recommended Model	Why
Complex multi-file refactors	Opus 4.6	Deepest reasoning, Agent Teams, 1M context
Fast iteration and pair programming	Codex 5.3	Speed, interactive steering, mature ecosystem
Budget-conscious / high-volume	GLM-5	Frontier quality at 1/45th the price
Self-hosted / privacy-first	GLM-5 (self-hosted)	MIT license, self-hostable, avoids Chinese API data routing concerns
Frontend / visual / design-to-code	Kimi K2.5	Strong vision capabilities, UI generation
Large-scale agent orchestration	Kimi K2.5	100 sub-agents, 1,500 parallel tool calls
Simple tasks (formatting, linting, boilerplate)	Haiku / GPT-4.1 mini / Flash	Don't overthink it. Cheap and fast wins here.

I wrote about the model selection framework in more detail. The core principle is matching capability to complexity. Using Opus to format a JSON file is like renting a crane to hang a picture frame.

The table above is a starting point. Your actual workflow will be messier. You'll find tasks that fall between tiers, models that surprise you on tasks they weren't "supposed" to handle, and edge cases where the cheap model is actually better because it doesn't overthink. Test on your workload. The table gives you a starting hypothesis.

The multi-model future

The teams getting the best results aren't picking one model. They're routing.

Simple tasks go to cheap, fast models. Complex tasks go to frontier models. Nobody runs a single EC2 instance type for their entire infrastructure. Same principle applies here.

The tooling supports this now. Claude Code, Cursor, Continue, and OpenCode all support model switching or multi-model configurations. You can set your default to a cost-efficient model and escalate when the task warrants it.

What a practical multi-model workflow looks like:

Scaffolding, boilerplate, simple edits → Haiku or GLM-5 (~$0.10-0.25/MTok)
Feature implementation, debugging, test writing → Codex 5.3 or Sonnet (~$1-3/MTok)
Architecture decisions, complex refactors, multi-file changes → Opus 4.6 (~$5/MTok)
Privacy-sensitive codebases → GLM-5 self-hosted (infrastructure cost only)

The cost difference compounds. A team that routes 80% of tasks to a cheap model and 20% to a frontier model might spend 5-10x less than a team that runs everything through Opus. The quality difference on those routine tasks? Negligible. I've tested this across a mix of refactoring, test generation, and boilerplate tasks. The cheap model handles 80% of them fine. The 20% where you need Opus, you really need Opus. But you don't need it for the other 80%.

GLM-5 at $0.11/MTok makes a great default for routine tasks, with Opus as the escalation path for hard problems. Even if you never self-host, even if you stay fully proprietary for your critical work, the existence of GLM-5 at that price point changes the economics of your entire workflow.

What comes next

The competitive picture will keep shifting. New models will launch. Benchmarks will get closer. Pricing will drop. That trend line isn't changing.

But the lesson from February 2026 is already clear. No single model wins everything. Each has a philosophy. Open source isn't "catching up" anymore; it's competitive, and the cost and privacy arguments seal it for many teams. Multi-model workflows are the pragmatic path forward, and the tooling finally supports them without duct tape.

If you're still defaulting to one model for every coding task, you're either overpaying or underperforming. Probably both.

Pick one task you're currently routing to an expensive model. Try it on GLM-5 or a smaller model. Measure the difference. You might be surprised how little you lose.