Most "autonomous agents" are one prompt in a while loop. They run, they drift, they repeat yesterday's mistake, and they keep no memory of anything they learned. After a day you don't have an agent that got better — you have the same agent, more tired.
We've been running a different pattern in production at azena for months, and I want to describe it concretely because it's almost embarrassingly simple: no framework, four markdown files, and one discipline. We open-sourced the templates — azena-ai/self-improving-loop — but the idea matters more than the files, so here's the whole thing.
The core idea: the agent runs on a genome
The agent doesn't run on a fixed prompt. It runs on a genome — a versioned strategy file that it both reads and rewrites.
Every cycle ("tick") the loop does one thing: it picks the single highest-value move, ships it, verifies it actually worked, and then folds what it learned back into its own instructions. The genome goes v001 → v002 → v003…, and each bump is an auditable record of the agent changing its own mind.
That last part is the whole game. A static prompt fights reality the moment the mission shifts. A genome absorbs the shift, because the loop is allowed to edit it.
The loop in one picture
flowchart TD
A([Tick fires]) --> B[Read genome + levers + lessons]
B --> C[Pick the single highest-value lever]
C --> D[Build / act — small, shippable]
D --> E{Gate: verify it really works}
E -- fail --> C
E -- pass --> F[Commit]
F --> G[Self-improve:<br/>rewrite genome, append a lesson, bump version]
G --> H[Schedule the next tick]
H --> A
No human in the inner loop. A human sets the mission and reviews the diffs in the morning. That's the deal.
Four files, three of which the loop edits
| File | Role |
|---|---|
genome.md |
The evolving strategy + state: mission, current focus, what's proven, what's next. The loop mutates this as it learns. Versioned. |
loop-prompt.md |
The orchestrator the agent executes each tick — and improves. Holds the tick cycle, the gate rules, and an append-only lessons log. |
levers.md |
The prioritized backlog. A ranked list of moves with a status log. The loop always takes the top open one. |
lessons |
Hard-won rules, written back into the prompt the moment they're learned. This is the "self-improving" part. |
The tick itself is just a state machine:
read(genome, levers, lessons)
lever = highest_value_open(levers)
result = act(lever) # small, shippable
if not gate(result): # verify the ARTIFACT
reschedule(); return # back to the top — never "commit anyway"
commit(result)
update(genome.status, lever.status)
maybe_mutate(genome) # bump version if strategy changed
maybe_append(lessons) # if something was learned
schedule_next_tick()
Why file-based memory beats a context window
A context window is volatile and small. It evaporates on compaction, restart, or a long enough wall-clock gap. So an agent whose "state" lives in context literally forgets where it was.
A genome file is durable and unbounded. The loop can run for days across many ticks, restarts, and summarizations, and still know exactly where it is — because "where it is" is written down, not remembered. When context gets summarized away, the next tick just re-reads the genome and carries on. That single decision — state on disk, not in the window — is what turns a chatty demo into something that survives a week.
The non-negotiable: gates
Here's the part everyone skips, and it's the part that makes autonomy safe instead of reckless.
An autonomous loop is only as trustworthy as its verification. A gate is a check that must pass before a commit. A failing gate sends the loop back to pick another lever — never forward to "commit anyway."
The minimum gate is three steps, and the order matters:
-
Typecheck green. Run the real check. Do not pipe it through
head/tail— a pipe exits0and will happily print "OK" over a stack of errors. - Build green. Many bundlers strip types and build green despite type errors — so step 1 is not optional.
- Verify the artifact, not the log. This is the one teams skip. "The deploy succeeded" is not evidence that the page renders, the endpoint responds, or the file is non-empty.
That third point is the single most expensive class of bug in an autonomous loop: a step that reports success while producing garbage. A prerender that silently emits an empty SPA shell. A migration that "ran" but touched zero rows. (We dug into this exact failure mode for production agents — why they pass demos but fail live — here.) So assert a concrete property of the real artifact:
# don't trust "build OK" — prove it
test "$(grep -c '<div id=\"root\"></div>' dist/index.html)" -eq 0 # not an empty shell
test "$(wc -c < dist/index.html)" -gt 50000 # has real content
One caveat so you don't fight ghosts: a transient failure on something you didn't touch (a network blip, a cold start) isn't a regression. Re-run the gate once. If it fails deterministically, it's real.
Lessons the loop has already taught itself
These are real, generalized from production runs. The point of the pattern is that this list grows by itself — the loop appends to it the moment it gets burned:
- Verify before you ship. A build step can fail silently and leave an empty shell. Assert the output is non-empty and correct before deploying.
- Don't turn one finding into a destructive sweep. A single odd-looking match is not a mandate for a sitewide find-and-replace. Check intent first.
- When reality contradicts the task's premise, report — don't blindly execute. If the job says "small fix" and you find a load-bearing rewrite, surface it instead of plowing ahead.
-
Tools that need a server don't start one. Bring the server up, wait for it, then run the check. A flood of
connection refusedmeans "nothing's listening," not "everything's broken."
Notice these aren't AI-specific. They're the operating rules of a careful engineer — except the loop wrote them for itself, after paying for them once.
One lever per tick
Last principle, easy to underrate: one lever per tick. Small, shippable units keep every change reviewable and every failure cheap to roll back. The temptation with an autonomous agent is to let it do five things at once "to save time." Don't. A tick that ships one verified thing and stops is worth more than a tick that ships five unverifiable things. The cadence is the safety rail.
Getting started
If you want to try it, it genuinely is four files and an afternoon:
- Copy the four
*.template.mdfiles from the repo into your project. - Fill
genome.mdwith your mission and first focus. - Seed
levers.mdwith a ranked backlog. - Hand
loop-prompt.mdto your agent and tell it to run one tick, then schedule the next. - Review the diffs each morning. Watch the genome evolve.
It works anywhere an agent can schedule its own next turn and commit to git — we run it inside Claude Code, but nothing in the pattern is Claude-specific.
We build this kind of thing for a living — bespoke, EU-sovereign AI systems for the German Mittelstand — at azena, and we teach the craft at the azena Dev Academy. The loop pattern came out of needing our own automation to be trustworthy enough to leave running overnight. If you build something with it, I'd love to hear how the genome evolved.
The templates, docs, and a few reusable skills are all MIT-licensed here: github.com/azena-ai/self-improving-loop.
Top comments (0)