Zayhan

Posted on May 25

Why Your AI Coder Keeps Inventing Helpers You Already Have

#ai #claude #harnessengineering #harness

You ask Claude Code to add a new endpoint. It writes the endpoint. The endpoint works. The tests pass. You merge it.

Two weeks later you notice the endpoint:

imports lodash (you removed lodash three quarters ago)
throws raw errors (your repo standardised on Result<T, AppError> two years ago)
defines its own formatTimestamp helper (you have one in utils/date.ts)
puts validation in the controller (your team puts it in a middleware layer)
logs with console.log (you use Pino, with structured fields)

None of it is wrong. It's all plausible code that any reasonable senior would write — on a different codebase. On yours, it's wrong in five separate ways at once, and now you're either reviewing it for the fifth round, rewriting it yourself, or merging it and accruing the kind of quiet entropy that turns a clean repo into a tour of every JavaScript trend since 2017.

This is not a model intelligence problem.

The unwritten rules problem

Every codebase older than a quarter accumulates a dense layer of conventions that nobody bothered to write down because everyone on the team already knows them. Things like:

"We don't add helpers to utils/, they go in the module that owns them."
"Don't throw — return Result<T, AppError> so the boundary handler can pattern-match."
"Domain types live in core/, never import from infra/."
"Errors get logged at the boundary; never inside business logic."
"Zod schemas are the source of truth for both validation and types."

These rules are real. They're enforced by every code review. They're the difference between a codebase that compounds and one that decays. And the model has no way to know any of them, because they don't appear in any file the model can read. They live in your team's review comments, in Slack threads from 2023, in the muscle memory of whoever has been here longest.

When the model writes "plausible code that doesn't fit your codebase," it's not making things up. It's drawing on the median pattern across millions of repos. The median pattern is not your pattern. The median pattern cannot be your pattern, because your pattern is a specific accumulation of choices made in your specific context.

So you tell yourself you'll write a CLAUDE.md.

Why CLAUDE.md alone isn't enough

A flat CLAUDE.md is the right instinct in the wrong shape. Three things go wrong:

It's not based on observation. Most CLAUDE.md files are written from memory — the conventions you happen to think of in the moment. The codebase contains many more rules than any one person remembers, and the ones you remember are biased toward the recent and the painful, not the load-bearing.

It's loaded into every request whether relevant or not. Conventions about error handling matter when generating service code. They're noise when generating a migration script. A single context-blob doesn't know which rules to highlight for which task.

Nothing checks that the code actually followed it. You can list every rule in your CLAUDE.md and the model can still ignore it, and you'll only find out at review time. There's no programmatic gate between "code generated" and "code lands in front of you."

What you actually want is closer to how a tenured engineer onboards a new hire: spend a week reading the code together to find the patterns, write them down, then hold the new hire accountable to those patterns at review.

The espalier metaphor

An espalier is the ancient horticultural practice of training a fruit tree to grow flat along a wall — pruned, wired, productive, and impossible to mistake for a wild one. The Romans started it. The medieval Europeans turned it into an art. Done right, you get more fruit in less space, and the tree is healthier for being constrained.

That's the model. You're not going to make the AI coder smarter (it already is smarter than your CLAUDE.md gives it credit for). You're going to prune and wire it: discover the shape your codebase wants, then constrain generation to grow along that shape.

I built a Claude Code plugin called Espalier Engineering that automates this. Here is the shape of it.

What gets generated

Run /espalier-init once on your repo. Ten to fifteen minutes later you have a per-project espalier/ directory:

espalier/
├── rules/         # always-loaded: structure, coding standards, dev process
├── skills/        # phase-loaded: coding, review, testing, requirements
├── agents/        # delegated: harness-coder, harness-reviewer (different tool sets)
├── wiki/          # on-demand: architecture, data models, critical paths
├── hooks/         # programmatic gates: layer boundary checks, pre-push gate
├── pipeline.md    # 10-stage workflow with explicit gates and rollback
└── changes/       # typed audit trail per requirement

Three things matter about that structure:

Rules are observed, not prescribed. The init phase fires around ten scouts in parallel — separate sub-agents that go read the architecture, the tests, the CI config, the existing layer boundaries, and the historical conventions. They report back, and what they find becomes the rules. Nothing is imported from a template.

Context is layered, not flattened. Rules live in rules/ and are always loaded. Skills live in skills/ and load only during their phase (coding, review, testing). Agents see only their scope — the coder gets Write/Edit; the reviewer gets only Read/Grep/Glob/Bash. The wiki loads on demand when an agent asks for it.

Gates are programmatic. A pre-push hook blocks pushes at the wrong pipeline stage. The reviewer cannot approve a P0 violation. CI status is checked as ci_status == 'success' AND tests_passed == total_tests, not "check if CI looks okay."

The two pipelines

After init, you get two orchestrators.

/espalier <requirement> is the full 10-stage pipeline for features and refactors:

requirement → reqs review → coding (sub-agent) → code review (different sub-agent)
→ tests → test review → push → CI verify → deploy verify → user confirmation

Every stage has a programmatic gate. Failed gates roll back. Repeated rollbacks escalate to a human.

/espalier-fix <bug> is a slimmer 5-stage lane for bugs, with one extra trick: a Stage 0 that uses git blame plus a reverse-lookup cache to find the feature change that introduced the bug, and links the fix back to it. Six months later, when someone wonders "why does this feature have four fixes against it" — the audit trail is right there in pipeline-state.md.

What changes, concretely

This is the part where I'm supposed to show a glossy before/after chart with arbitrary percentages. I'll do something more honest: tell you what shifted in my loop.

Before. I'd describe a feature to Claude Code. It would generate plausible code. I'd notice the lodash import, the wrong error type, the helper that already exists. I'd push back. It would regenerate. Closer, but now it's putting validation in the wrong layer. Push back again. Three to five rounds before code landed in the shape I'd have written it.

After. I describe the feature to /espalier <req>. The coder sub-agent loads my project's rules and writes code that uses Result, imports from the right paths, and reuses the helper. The reviewer sub-agent — a different invocation with no write tools — checks it against the same rules and flags only real things, not style. Most requirements land in one review round. I review business logic, not whether the code "feels like ours."

The token math is the side benefit. Without Espalier, the agent re-discovers your conventions from scratch every request — reading source files into context, burning tokens on the same exploration each round. With Espalier, conventions load once (around 3K tokens, always cached). A medium-size feature without Espalier consumes 2–4× the tokens of a single /espalier invocation, because the agent does 3–5 rounds and re-loads each time. Across a quarter of work, the /espalier-init cost pays itself back many times over.

The /espalier-init run itself costs $2–5 on a medium repo (Opus main + Sonnet scouts + cache hits) and takes 10–15 minutes. It's a one-time tax. Every /espalier and /espalier-fix after that is light.

When not to run it

I want to be honest about where this does not make sense:

Throwaway prototype that won't see 5+ feature requests.
Single-file script with no meaningful conventions to discover.
Solo project where you're happy hand-coding everything.
A codebase still in violent churn — discover-and-encode assumes there is something stable to discover.

If your repo is older than a month, has real conventions, and you're going to keep iterating on it for weeks or longer, the tax pays back inside the first handful of features.

Install

If you're on Claude Code's plugin path:

/plugin marketplace add Junhanliu-dev/espalier-engineering
/plugin install espalier-engineering@espalier-engineering
/espalier-init

Manual clone + symlink is in the README if you'd rather iterate locally.

v0.5.0 added doc-drift detection — a post-merge hook flags rules and wiki entries that have rotted as the codebase evolves, and /espalier-doctor does a periodic re-scout. Nothing is ever auto-overwritten: every refresh is gated.

The philosophy, in one line

The principle that organises everything:

When an agent makes an error, engineer its elimination — not with prompt tweaks, but with files, rules, automated checks, and system structure.

Prompt engineering scales linearly with how much you can hold in your head. Structural engineering scales with the codebase. If you find yourself writing the same "please use our error type" correction for the fourth time, that's not a model problem — that's a missing file.

Project page: junhanliu-dev.github.io/espalier-engineering — full walkthrough, the two pipelines visualised, honest cost breakdown.
Source on GitHub: Junhanliu-dev/espalier-engineering. MIT licensed. Issues and PRs welcome — schema and templates are still moving, and I genuinely want to know what your stack expects that Espalier doesn't yet generate.

If you try /espalier-init on a real codebase, I'd love to hear what it found that surprised you. Half the value of this thing is that the discovery scouts surface conventions even the team had forgotten they had.

Top comments (1)

Harjot Singh • Jun 1

i totally get the frustration with AI-generated code that doesn't align with your standards. it can really derail a project's consistency. if you ever want to spin up a new app with a clean codebase, check out Moonshift. you can get a full next.js + postgres + auth build deployed in about 7 minutes, and you own the code on your github. let me know if you want to give it a try for free.