Yuan

Posted on May 29

I stopped prompt-engineering my AI coding agent. I started engineering the repo instead.

#ai #productivity #opensource #programming

The itch

Quick hello first: I'm a developer based in South Korea, and English isn't my
first language — so I'll keep this plain and let the code and the repo do most
of the talking. (If a sentence reads a little stiff, that's the translation tax,
not the idea.)

If you've handed real work to an AI coding agent, you know the pattern.

It edits the wrong file. It re-introduces an approach your team rejected three
months ago. It invents a folder structure nobody agreed to. You correct it in
the chat, it says "You're absolutely right," and then the next session it does
the exact same thing — because the next session starts with none of that
context.

I spent weeks getting better at prompting. Better system messages, better
examples, sharper instructions. It helped one conversation at a time. But every
new task, every new branch, every teammate's agent started from zero again.

At some point the framing flipped for me: prompt engineering improves one
interaction. It does nothing for the environment that interaction happens in.

The reframe: engineer the repo, not the prompt

So I stopped optimizing prompts and started optimizing the repository. The
idea — I've been calling it harness engineering — is to turn the implicit
context an agent keeps missing into durable artifacts that live in the repo and
survive every session.

It comes down to four components, plus a way to keep them from rotting:

Component	Job	Typical files
Instruction document	Tell the agent how to behave	`AGENTS.md`, `CLAUDE.md`
Architecture constraints	Block invalid structure before merge	linters, type checks, import rules
Feedback loops	Correct behavior fast	tests, CI, pre-commit, examples
Knowledge store	Preserve decisions and dead ends	`docs/decisions`, `docs/failures`

And the part most people skip: garbage collection — drift checks that fail
when docs reference missing files, when temp files sneak in, when the structure
wanders away from what you agreed to.

The operating principle that ties it together:

Every recurring agent failure should become at least one durable artifact — a
clearer instruction, an automated constraint, a test or CI check, a decision
or failure record, or a drift check.

You're not trying to make the agent perfect. You're trying to make the project
easier to understand and harder to damage.

The build

I packaged this into a starter kit. Design constraints I gave myself:

Tool-agnostic. It's prompt-first. You hand any agent the repo URL, it reads the kit, and adapts the pattern to your stack. Not locked to one vendor's agent.
Boring on purpose. MIT licensed, standard-library Python, plain Markdown. No framework to install, nothing to audit for an afternoon.
Conservative. It inspects the target repo first and adds only the missing pieces. It never blindly overwrites your files.

Because I kept tripping over English-only docs myself, I wrote the README in
four languages — English, Korean, Japanese, and Chinese. If you've ever bounced
off a great tool because its docs assumed your first language, you know why that
mattered to me.

A drift check is deliberately tiny — small enough to read in one sitting:

# scripts/check_docs_drift.py (excerpt)
# Fails when a doc links to a path that doesn't exist,
# so your README can't quietly rot.
for doc, reference_value in missing_paths:
    print(f"Missing referenced path in {doc.relative_to(root)}: {reference_value}")
return 1 if missing_paths else 0

The instruction file (AGENTS.md) is just as plain — project overview,
directory rules, exact commands, forbidden actions, PR behavior. Nothing magic.
The value isn't in any one file; it's in having all four pillars present and
enforceable.

The dogfood: building a Django app through the harness

A methodology you only ever describe is a blog post. I wanted to know if it
actually held up, so I took an empty folder and built a small Django app —
a board with post CRUD, ownership permissions, admin user management, search,
pagination, comments — entirely through the harness workflow.

What accumulated in the repo as I went:

8 decision records in docs/decisions/ — why generic-first, why this Django layout, why post-ownership permissions work the way they do.
CI running the same harness checks GitHub Actions runs on every push.
.harness/source.json pinning the exact kit commit the repo absorbed, so "which version of the methodology is this repo on?" has a real answer.

The features weren't the point. The point was that every behavior change shipped
with its verification and its durable memory, automatically, because the
harness made that the path of least resistance.

The moment it bit me — and why that's the best part

Here's the beat I didn't plan for.

I added CI. It immediately went red. The failure was inside my own drift
checker:

Missing referenced path in docs/decisions/0002-initialize-django-config-project.md: .\.venv\Scripts\python.exe
Missing referenced path in docs/harness/adoption-report.md: .\.venv\Scripts\python.exe

My docs documented a Windows virtual-env command, .\.venv\Scripts\python.exe.
My drift checker saw the backslashes and "helpfully" decided it was a file
reference — then checked whether that file existed. It existed on my Windows
machine, so it passed locally. It did not exist in Linux CI. Green on my
laptop, red in the cloud. The classic.

The tool I built to catch drift had drifted.

But this is exactly what the methodology is for. I didn't just patch it and
move on. I fixed the checker to recognize venv commands by executable name
instead of path existence — and then I wrote it down, in
docs/failures/0001-docs-drift-windows-venv-command.md: context, symptoms, root
cause, resolution, prevention. Now the next agent (or the next me, at 1 a.m.)
that touches drift logic reads a one-page record instead of rediscovering the
trap.

A recurring failure became a durable artifact. The principle, applied to the
tool itself.

The result (and an honest limit)

I also built a quick diagnostic, harness doctor, that scores how ready a repo
is for reliable agent collaboration across five axes. Run against the dogfooded
Django app:

Score: 83/100 (baseline evidence scan)
Grade: B+ (baseline)

Now the honest part, because dev.to readers can smell a pitch: that score is a
baseline evidence scan. It checks that durable files and command patterns
exist — it does not yet prove that adopting the harness measurably reduces
repeated agent mistakes. I have the measurement protocol (wrong-file edits,
first-pass verification, repeated-mistake counts) but not enough before/after
runs to claim a number. That's the next thing I'm working on, and it's the
subject of a follow-up post.

So: take this as a strong qualitative result — a real app, real decision and
failure memory, a tool that caught its own bug — not as a benchmark. Yet.

Takeaway

The shift that actually changed how my agents behave wasn't a cleverer prompt.
It was treating the repository as the thing you engineer:

Every recurring agent failure should become a durable artifact.

If your agent keeps making the same mistake, stop re-explaining it in chat.
Write it into the repo once, in a form the next session can't miss.

The kit is MIT-licensed and on GitHub:
https://github.com/baskduf/harness-starter-kit

To try it, open your repo with your agent and point it at that URL — it'll
inspect your stack and add only the harness pieces you're missing. Stars,
issues, and "this broke on my stack" reports are all genuinely welcome — and if
you read the Korean, Japanese, or Chinese README and something sounds off, tell
me; I maintain all four by hand.

Part 2 will be the measurement: can I show, with numbers, that a harnessed repo
makes agents repeat fewer mistakes? Follow along if that's the post you actually
want.

What stack would you try this on? Tell me in the comments — I'm collecting
real targets for the measurement study.