I build and maintain several small apps by myself — a focus timer, a vocabulary trainer, a D-day counter, a few others. No cofounder, no employees, no contractors. One person and a flock of AI agents.
People assume the trick is speed: that AI lets me type code faster. That's not it.
The trick is that I almost never let myself — or the AI — call something "done" without proof.
The real bottleneck isn't writing code
When you work alone, the dangerous failure mode isn't slow typing. It's confident wrongness.
A capable model will happily tell you the tests pass when it never ran them. It will report a feature finished when it only wrote the function signature. With no coworker to catch it, a single "looks good to me" can push a broken build into a store review queue — and you find out three days later, in a rejection email.
So the system I've built around my apps isn't tuned for output. It's tuned to make false completion expensive. I call it the harness: specialized agents, orchestrator workflows, and verification gates that sit between "the AI says it's done" and "I believe it."
Here's how the pieces fit.
One specialist beats one generalist
Instead of one do-everything assistant, I dispatch each task to one of more than forty specialized agents, each with a narrow job: a Flutter reviewer, a security reviewer, a test executor whose only purpose is to actually run the suite, a "reality auditor" whose only purpose is to compare what was claimed against the real diff.
Two rules keep this from turning into chaos:
- Specialists are easier to trust. A reviewer that only reviews Dart, against an explicit checklist, gives sharper feedback than a generalist asked to "check everything."
- Parallel work gets isolated. When several agents edit code at the same time, each runs in its own throwaway git worktree, so they can't overwrite each other. Read-only research runs shared; anything that changes files runs isolated. The rule is one line: state-changing parallel work gets a worktree, read-only parallel work doesn't.
For something like a code review, that means a small fan-out: a few reviewers look at the same change from different angles — correctness, security, performance — and then a separate agent is told to try to refute each finding before it ever reaches me. Whatever survives that is usually worth my time. The redundancy is the point. One reviewer misses things; three reviewers with different mandates, plus a skeptic whose job is to disagree, miss far fewer.
The dispatch itself is a single call. The discipline is in which agent, and whether it needs isolation.
Orchestrator skills with honesty gates
The agents don't free-float. They're wired into orchestrator workflows — I call them skills — that run a fixed pipeline for each kind of work.
My code workflow, for instance, runs roughly:
- Plan — break the request into tasks with risk levels and dependencies.
- Implement — write the change. Forbidden here: "while I'm in here" refactors, skipped tests, swallowed exceptions.
- Run it for real — a dedicated executor agent runs the build, tests, and lint, and captures the actual exit code and output. "I think it'll pass" is not accepted; only real output is.
- Reality audit — a separate agent diffs the claim against the actual changes. Comment-only edits, empty stubs dressed up as features, and "I changed X" when X was never touched all get caught here.
- Review — a language-specific reviewer; anything Critical sends the task back to step 2.
- Verify gate — a final pass that confirms there were real changes, maps each requirement to evidence it exists, runs the build once more, and scans for leaked secrets. It returns PASS or REJECT, nothing in between.
Every one of those stages exists because of a specific way I — or the model — used to lie to me:
- "It should work" dies at the run-it-for-real stage. No real output, no pass.
- "Done," on an empty stub, dies at the reality audit.
- "All green," quietly hiding a warning, dies at the static check — and in strict mode a warning is a failure.
This isn't bureaucracy for its own sake. A solo maker has no second pair of eyes, so the second pair of eyes has to be built into the process. And the loop is bounded: if a task can't pass its gate in three rounds, the workflow stops and tells me plainly that it failed, instead of quietly lowering the bar until something slips through.
Memory that survives a fresh start
Context windows end. My projects don't.
So the harness keeps a three-layer memory: a short index that loads at the start of every session, longer notes read on demand, and a full-text search over my past sessions. When I think "how did I set up that signing key last time?", the system can go read what I actually did, instead of guessing — I used that search half a dozen times while writing this very piece.
There's a quieter loop on top of it. A pattern detector watches for work I keep repeating across sessions, and when the same shape of task shows up often enough, it gets flagged as a candidate to become its own skill. The system notices when I'm doing something by hand too often, and nominates it for automation. I review the nomination; I don't let it promote itself.
The lines I refuse to automate
A harness that does everything is a harness that will eventually do something irreversible on your behalf. So the most important parts are the places it is not allowed to act:
- Publishing is manual. Drafts get generated and handed to me to review. I press the publish button. Always.
- Going live needs a human. Flipping a setting from test to live — payments, a public release — takes a separate, explicit approval, never a flag buried inside a batch of other changes.
- Nothing merges on a failing gate. A failed verify — a broken build, a security finding — blocks the merge. It is not auto-overridden because the schedule is tight.
These aren't limitations I'm slowly removing. They're the design. The leverage of an agent harness comes from automating the work; the safety comes from refusing to automate the judgment.
What this actually buys you
If you take one thing from this: when it's just you, your scarce resource isn't code. It's trust that what you just shipped is what you think you shipped.
AI gives a solo maker a lot of hands. A harness gives those hands the thing they otherwise lack — a process that won't tell you a comforting lie. That's what makes running several apps alone survivable, and it's the part no model hands you for free. You have to build it yourself.
None of this needs a big rig to start. Mine began as one agent and one rule — run the tests for real — and grew a new gate every time I caught myself shipping on a guess. If you're a solo builder, that's the whole method in one line: find the lie you tell yourself most often, and build the single check that makes it expensive.
I'm one person, building small apps this way in the hours after everyone else is asleep. Over the next pieces I'll take the harness apart one component at a time — the multi-agent code review, the memory loop, the automation I killed after eighteen days. This was the map. The rest is the territory.


Top comments (0)