Ian Johnson

Posted on Apr 29 • Edited on Apr 30

Building a Harness: How We Standardized Agentic Coding in a Real Codebase

#ai #productivity #tdd #programming

If you let an AI agent loose on a non-trivial codebase, two things happen. First, it gets a lot done. Second, it gets a lot done in the style of whatever it last read. Drop it into a file with anemic models and inline authorization checks, and the next thing it writes will be an anemic model with an inline authorization check. Agents are mirrors with momentum.

This post is about how we stopped fighting that and started using it. It covers two iterations of a system we call the harness — the set of files, rules, and workflows that constrain what an agent produces in our Laravel + React monorepo. The first iteration was a scattering of CLAUDE.md files in subdirectories. The second is a .claude/ folder with rules, agents, commands, and skills. The second is dramatically better, and the reasons are worth writing down.

The audience here is engineers who are picking up agentic coding and want a concrete pattern they can copy.

The Problem: Agents Are Only as Good as the Code They Read

The single most important sentence in our harness documentation is this one:

The harness is only as strong as the code it governs.

Agents learn from context. If your codebase is internally inconsistent (three ways to spell the same role check, two patterns for authorizing a request, four flavors of API resource, etc.) the agent will pick one, then another, then another. You will get a polite version of the worst code in the repo, smeared across every new feature.

So the harness has two jobs:

Tell the agent what good looks like before it writes a line of code.
Verify mechanically that what came out matches.

Everything else is plumbing.

Iteration One: Subdirectory `CLAUDE.md` Files

Our first attempt was the obvious one. Claude Code reads a CLAUDE.md from the project root automatically. So we wrote one. Then we noticed it was getting long. There were different rules for API controllers vs. legacy web controllers vs. React components vs. database migrations. So we did what felt natural: we split it up by directory.

app/Actions/CLAUDE.md
app/Http/Controllers/Api/CLAUDE.md
app/Http/Controllers/Web/CLAUDE.md
app/Http/Resources/CLAUDE.md
app/Policies/CLAUDE.md
app/Services/CLAUDE.md
database/migrations/CLAUDE.md
resources/js/spa/CLAUDE.md
tests/CLAUDE.md

Each file held the rules for the layer it sat in. The root CLAUDE.md had a table pointing at all of them. We shipped it in a PR and it worked, in the sense that the agent stopped writing fat controllers and started using our action-class pattern.

What it didn't do well:

Loading was all-or-nothing. The agent didn't have a clean way to load only the rules relevant to the file it was editing. The user was loading documentation through grep and intuition.
No tool restrictions. A "code reviewer" was just a prompt — nothing stopped it from writing files mid-review.
No path scoping you could trust. Subdirectory CLAUDE.md files were a convention, not a mechanism. They got ignored as often as they got read.
Workflows lived in our heads. The TDD loop, the pre-commit checks, the review process — all of it was in chat history and tribal memory.

It was scaffolding, not a system.

Iteration Two: The `.claude/` Folder

Six weeks later we restructured the whole thing. The new layout looks like this:

.claude/
├── rules/        # 19 files — guidance, path-scoped or always-loaded
├── agents/       # 8 files — read-only specialists with tool restrictions
├── commands/     # 1 file — the pre-commit runner
├── skills/       # 2 directories — multi-phase TDD workflows
├── settings.json # permissions, denylist, project config
└── settings.local.json

The migration was mechanical: 10 subdirectory CLAUDE.md files became 10 path-scoped rule files. Eight rules that applied everywhere got pulled out of the root CLAUDE.md into their own always-loaded files. CLAUDE.md itself dropped from 606 to 447 lines, HARNESS.md from 854 to 507. No information was lost! The rules just landed in places where the right tool could find them at the right time.

But the real change wasn't the file layout. It was naming the parts.

The Four-Part Model

Once we had room to think, we noticed the harness was doing four distinct things, and that the parts were getting confused with each other. Naming them helped a lot. The picture we ended up with looks like this:

The four parts in detail:

1. Guidance

CLAUDE.md, HARNESS.md, and the .claude/rules/ directory. These tell the agent what to write before it writes anything. Always-loaded rules apply universally (design-principles.md, security.md, prohibited-patterns.md). Path-scoped rules auto-load when the agent touches a matching file (api-controllers.md activates when editing app/Http/Controllers/Api/**).

This is the biggest lever you have. A 50-line rule file that says "thin controllers, business logic in actions, return resources" will reshape every controller the agent writes.

2. Guardrails

make lint, make test, make test-js. Mechanical checks, wrapped in make targets. The agent runs them; it does not bypass them. We added a deny list to settings.json that blocks direct php and npm invocations. Instead, the agent has to go through the make targets, which means it has to go through the same checks the humans do.

Guardrails are not optional. Guidance shapes what the agent writes; guardrails verify it. Skip either and you've only built half a harness.

3. Flywheel

The feedback loop. When a reviewer says "no, we don't do it that way here," the response is not "okay, I'll remember." It's:

Update the relevant file in .claude/rules/.
Reload it into context.
Re-attempt the change.

Every review improves every future conversation. The harness compounds. After three months, the rule files contain decisions that would otherwise live as oral tradition in Slack threads from December.

This is the part that turns a configuration file into a system.

4. Executable Workflows

.claude/agents/, .claude/commands/, .claude/skills/. These encode multi-step processes that humans used to run from memory.

Agents are read-only specialists with tool restrictions. review-security can read files and grep, but it cannot write: its job is analysis, not action. review-functional, review-risk, and review-security use Opus (deep reasoning); review-structural, review-deployment, and ci-diagnose use Sonnet (fast pattern matching). You can spawn five reviewers in parallel and get back five independent reports.
Commands are single-step utilities. /pre-commit detects what changed, runs the right checks, reports results.
Skills are multi-phase TDD workflows. implement-jira-card walks from requirements through tests through implementation through PR.

The same work used to require remembering "okay, now run lint, now run the right test suite, now spawn a reviewer, now update the PR." Now it's one command.

And the thing that makes them all work: discipline

This is the uncomfortable part. The harness can be perfect and your codebase can still poison the agent. If the most-touched controller in the repo has bad patterns, every new controller will have bad patterns, no matter what the rule file says. The agent reads code more than it reads documentation.

So:

Write all new code to harness standards. The rules define the target; new code must hit it.
Refactor when you touch. Leave it better than you found it. Small steps, never big rewrites.
Delete dead code aggressively. Unused code teaches patterns nobody wants.
Prioritize cleaning the code the agent sees most often.

Without this, the other three parts are decoration.

One Way to Picture It

It helps to think of the harness as a series of constraints on the space of possible code the agent could produce. Each component narrows the space. What's left at the end is code you actually want to ship.

Drop any one of these and the funnel leaks. Skip guidance and the agent picks patterns at random. Skip workflows and the process becomes tribal knowledge. Skip guardrails and nothing verifies what came out. Skip discipline and the codebase teaches the agent the wrong things faster than the rules can correct them.

Concrete Benefits We Saw

This isn't theoretical. Specific things got better:

Less context-window waste. Path-scoped rules mean the agent only loads react-spa.md when it's editing React. The 869-line file isn't sitting in context while the agent fixes a database migration.

Reviewers that actually behave like reviewers. Tool restrictions on agents enforce read-only behavior. A security reviewer can't accidentally "fix" a finding it spotted — it has to report and let the implementing agent decide.

Parallelism that pays off. Five review agents (functional, structural, security, risk, deployment) run in parallel on every diff. They produce independent reports. They disagree sometimes, and the disagreements are useful.

Workflows that survive turnover. The TDD loop used to be in our heads. Now it's in .claude/skills/implement-change/. A new engineer can trigger the same disciplined process without having absorbed three months of chat history.

Settings that block the wrong thing. The denylist in settings.json enforces "make targets only." This sounds petty until you realize how often npm test and make test-js should be the same command but aren't, because one of them runs additional checks the other skips.

Documentation that compounds. Every code review now ends with a rule update. The rule file becomes the authoritative answer to "how do we do X here." When the next engineer asks, the rule is already written.

What to Take From This

If you're starting an agentic coding harness in your own codebase, here's the order I'd do it in now:

Write a root CLAUDE.md first. It can be short. Cover tech stack, the one or two architectural decisions that matter most, and how to run tests.
Add path-scoped rules as you notice patterns the agent gets wrong. Don't write them speculatively. Wait until the agent makes a mistake, then write the rule that prevents it.
Wire up your test and lint commands as make targets. Block direct invocation in settings.json. The agent should pass through the same gates a human PR does.
Add review agents once you trust the rules. Read-only, tool-restricted. Spawn them in parallel on diffs.
Treat every review as a chance to update a rule. This is the flywheel. Without it, the harness rots.
Clean the code the agent reads most. This is the discipline part. It's also the part everyone wants to skip.

The harness is not magic. It's a small library of plain markdown files and a permissions config. What makes it work is treating it as a living system. One that gets better every time you use it, and one that depends on the codebase being honest about what good looks like.

The sentence to remember: the harness is only as strong as the code it governs. Build both.

I'm currently working on a tool to generate this style of agent harness. Check it out on GitHub: sellier. You can also install it via PyPI: sellier.

DEV Community

Building a Harness: How We Standardized Agentic Coding in a Real Codebase

The Problem: Agents Are Only as Good as the Code They Read

Iteration One: Subdirectory `CLAUDE.md` Files

Iteration Two: The `.claude/` Folder