Harness engineering: a self-evolving feature loop in 312 lines of bash

#claudecode #bash #aicoding #agents

Repo: github.com/tufailkhan45/harness-loop — one bash script, drop into any spec-driven repo.
Originally published on: tufail.dev/blog/harness-engineering-self-evolving-loop

Most posts about Claude Code talk about prompts. This one is about the harness — the wrapper around the model that turns a single claude -p invocation into a system that can ship a backlog of features over hours, survive its own failures, and learn as it goes.

I built harness-loop after watching too many headless Claude runs silently spin on the same broken approach for thirty minutes. This post walks through what a harness actually does, why the design comes down to three load-bearing parts, and what I learned writing one in bash.

What is harness engineering?

The model produces tokens. The spec describes the goal. The harness is everything in between: when to invoke the model, what context to feed it, when to stop, when to halt the whole run, and what to trust as a "done" signal.

If the model is the engine and the spec is the destination, the harness is the chassis, fuel system, and dashboard warning lights. Most AI workflows fail not because the model is wrong but because the harness is missing — the model gets called once, returns something that looks plausible, and the human is left to figure out whether the work actually shipped.

A good harness answers four questions on every iteration:

What is the next unit of work?
What context does the model need that it didn't have last time?
Did anything just happen that requires a human?
Is this feature actually done, or does the model just think it is?

The whole loop is built around answering those four questions, mechanically, in a way that survives crashes, quota windows, and the model's own occasional confidence in incorrect things.

What the loop does

The runner is one bash file (scripts/run-features.sh, 312 lines). Every iteration:

Picks the next feature without a .done marker
Builds a prompt from the spec, the feature's prior attempt log, and a global learnings file
Invokes claude -p under timeout
Inspects the resulting log for halt signals (BLOCKED:, no growth, quota errors)
Loops

It exits 0 when every feature has a marker, or with a halt code (3-6) when something demands a human.

specs/auth-login/spec.md   ──┐
logs/auth-login.log        ──┼──> prompt ──> claude -p ──> append log + maybe .done
logs/learnings.md          ──┘

That is the whole architecture. No queue, no database, no orchestrator. The filesystem is the state machine, and .done markers are the source of truth.

Self-evolution: three parts that all have to work

"Self-evolving" sounds hand-wavy until you stare at what it actually requires. There are exactly three mechanisms, and breaking any one breaks the loop:

1. Read. Every iteration tails the last 200 lines of two files into the prompt — the feature's own prior log (so the model does not repeat what already failed), and a cross-feature learnings file (so feature D benefits from a discovery made in feature A). Recency is the model. Switching to head or middle slices wouldn't work as well, because the most recent attempt holds the most relevant signal.

2. Write. The prompt explicitly asks the model to do two things at the end of every iteration: append a progress note to the feature log, and append a one-line lesson to learnings.md — but only if the lesson is broadly applicable. The wording is deliberately load-bearing. Soften it ("you may want to add a note...") and the loop's memory degrades within a handful of iterations.

3. Floor. A circuit breaker. If a feature's log does not grow by more than 32 bytes for STUCK_LIMIT iterations in a row, the runner halts that feature with exit code 5. The runner cannot audit what the model writes, only whether it writes anything. Without this floor, a model that has hallucinated its feedback channel will spin forever and burn quota.

The asymmetry matters. Read and Write are model behaviour — both can fail subtly. The Floor is a hard mechanical guardrail that catches the failure mode the model itself cannot self-detect.

The prompt is the API

Most of the prompt is a fixed heredoc, but two blocks are dynamic:

<<<PRIOR_LOG
[last 200 lines of logs/feature-runner/<slug>.log]
PRIOR_LOG

<<<LEARNINGS
[last 200 lines of logs/feature-runner/learnings.md]
LEARNINGS

Followed by a six-step task list that constrains the iteration to one meaningful step — not "finish the feature," not "make as much progress as you can," but pick the next unfinished piece, do it, verify it, log it. The "one step at a time" framing prevents the model from spending a 30-minute timeout on a megacommit it then cannot verify.

Step 6 is the contract: write <slug>.done only if the spec is satisfied AND verification is green. The runner trusts this signal. Weaken the prompt ("write .done when you think you're close enough") and the whole loop loses its meaning — features get marked done that aren't done.

Four halt codes for four failure modes

Halt categories matter because each one needs a different human response:

Code	Meaning	What you do
3	`HALT` file present	Someone paused it; resume with `rm HALT`
4	`BLOCKED:` in feature log	Model hit something it can't fix; read the log
5	Circuit breaker tripped	Silent spin; feature spec probably ambiguous
6	Quota / auth / rate limit	External issue; wait or rotate keys

Code 5 is the most interesting. It catches the failure where the model is technically running but producing nothing. Without it, you can lose hours of quota on a feature that has gone silent.

Why bash

I considered Python. Bash won for three reasons:

Zero install friction. Copy one script and a settings file into any repo. No venv, no pip install, no version juggling.
Resumability is trivial. State is files on disk. Kill the process, restart it, it picks up exactly where it left off. .done markers are the source of truth.
Coreutils already does the work. timeout for per-call kills, tail -n 200 for windowed context, stat -c %s for the size-delta circuit breaker, df -Pm for the disk warning. None of this needs a programming language.

set -uo pipefail is on; set -e is intentionally off. The runner must survive a non-zero exit from claude — a failed iteration is data, not a fatal error. With -e, the loop dies on the first model error and you lose the entire run.

What it isn't

Not a planner. Specs are the input, not the output. Decomposition happens inside each iteration, by the model.
Not a verifier. Verification is delegated to the model — pytest, npm test, curl, claude-in-chrome MCP for UI smoke tests, whatever fits the feature.
Not language-specific. It runs against any repo with a specs/<slug>/spec.md layout. Python, TypeScript, Rust, Go — the runner doesn't care. The model reads the spec and any project-level CLAUDE.md and picks the right tools.

What I'd do differently

Three things I would change if I rebuilt it:

Make the size-delta threshold configurable per feature. 32 bytes works on average but some features have legitimately quiet iterations.
Add a PARALLEL=N flag. Right now it is strictly serial. For independent features, parallelism would 3-4x throughput.
Stream the run log to stderr unconditionally. I added tee later when I realised I couldn't see what was happening without tailing two files at once.

The deeper lesson from this build: self-evolving systems don't need to be smart, they need to be honest about their own failure modes. The harness loop has no learning algorithm, no graph, no agent framework. It has three text files and a circuit breaker. That turns out to be enough to ship features overnight without a human in the chair — provided the spec is clear and the model is given a way to remember.

Try it: github.com/tufailkhan45/harness-loop. The README has install steps and a dry-run mode that prints the resolved queue and sample prompt without spending tokens.