Mario Hayashi

Posted on Apr 30 • Originally published at blog.mariohayashi.com on Apr 6

An autonomous dev pipeline for one

#agents #ai #automation #softwareengineering

If you’re a solo engineer or a technical founder wearing every hat, the gap between planning and implementation is shrinking. One person can own a product while a harness runs the loop. There’s no magic in the stack: bash, cron, tmux and Claude glued together until the behaviour is reliable enough to trust.

The question I keep coming back to is how much implementation work can be delegated in a way that fails safely and produces code that can go from good to great with human review. Worker agents pick up tasks, validate code, open pull requests and address review feedback. I decide what to build and whether to merge. Everything in between is what I’m trying to automate.

The scripts, prompts and guardrails are all still in flux. I update them whenever something breaks. This post is a snapshot of where the setup is today.

Agent workers waiting for an issue

Ralph loop

The foundation is Geoffrey Huntley’s Ralph loop: an autonomous coding cycle that runs task, implementation, testing, PR creation and context reset on repeat. Each iteration gets a fresh context window by design. Memory lives in git and structured files, not in the model.

I spend a lot of energy on isolating phases: plan, build, test, verify. Each phase is clearly defined, can fail on its own and can be retried without affecting others. That matters because agents tend to do well on short focused work and degrade when scope increases. Structure is the differentiator, not a bigger model.

The same principles I applied to a Xero expense auditing CLI applies here: code and framework first, model second. Rules and framework handle the main flow. AI steps in for parts that need some judgement including issue creation, implementation and PR summaries.

Under the hood

There is no special runtime per se. Cron schedules jobs, tmux manages parallel worker shells, shell scripts handle state transitions and Claude runs with streamed JSON so that results can be parsed. GitHub Issues serve as the queue, system of record and state machine. If that sounds fragile, it is! This is why validating the output of each phase matters.

Architecture overview

      HUMAN AUTOMATION GITHUB
      ───── ────────── ──────

    ┌───────┐ manage.sh ┌────────────┐
    │ Ideas │ ───────────────── │ PRD │
    │ .md │ │ draft .md │
    └───────┘ └─────┬──────┘
                                      │
                                prd.sh (Claude)
                                Refine, approve
                                      │
                                      ▼
                                ┌────────────┐ plan.sh ┌──────────────┐
                                │ PRD │ ──────────────────▶│GitHub Issues │
                                └────────────┘ │ │
                                                                  │ ready │
                                                                  │ blocked │
                                                                  │ in-progress │
                                                                  │ done │
                                                                  └──────┬───────┘
                                                                         │
                                worker-loop.sh picks up “ready” issues │
                           ┌─────────────────────────────────────────────┘
                           │
                           ▼
                 ┌───────────────────┐
                 │ tmux: N workers │
                 │ │
                 │ ┌─────────────┐ │
                 │ │ Worker 1 │ │ Each worker runs in
                 │ │ (worktree) │──┼──▶ its own git worktree
                 │ ├─────────────┤ │ with a Ralph loop
                 │ │ Worker 2 │ │
                 │ │ (worktree) │──┼──▶ branch ──▶ code ──▶ PR
                 │ ├─────────────┤ │
                 │ │ Worker N │ │
                 │ │ (worktree) │──┼──▶
                 │ └─────────────┘ │
                 └───────────────────┘
                           │
                           │ opens PRs
                           ▼
                    ┌──────────────┐ ┌──────────────────┐
                    │ Pull │ │ unblock.sh │
                    │ Requests │◀────────│ (cron ~30s) │
                    └──────┬───────┘ │ relabels issues, │
                           │ │ unblocks deps │
                           ▼ └──────────────────┘
                    ┌──────────────┐
                    │ Human merges │
                    └──────────────┘

Issues have labels like “ready”, “in-progress”, “blocked”, “done”, “failed”. Git is the memory and GitHub is the database. I may change parts of this setup completely later. For now, I want cheap iteration and an easy-to-follow trail.

What each part does

1. Backlog capture

“manage.sh” is the entrypoint. When a new backlog item comes in, Claude generates clarifying questions specific to that item rather than asking generic questions. The answers go into a Markdown file and feed into the PRD. So by the time planning starts, the context is there.

2. PRD generation

Claude turns the backlog item into a PRD in an interactive session. I can approve it, edit it directly or ask for a rewrite. Once approved, it gets committed and pushed. From that point it’s the working source of truth. I update the PRD when the plan diverges from reality.

3. Planning

The planner gets “Read”, “Glob” and “Bash”. No “Edit” and no “Skill” on purpose. Skills use hooks that nudge the planner toward implementing, which defeats the point of a dedicated planning phase. The prompt requires JSON as output or fails.

Two constraints run at planning time. If a task has too many acceptance criteria or affects too many files, Claude splits it before any issues are created. Large scope is the fastest way to fill up a context window. The planner also checks for blocking dependencies.

4. Workers (Ralph loop)

Each worker iteration runs: pull “main”, claim the next “ready” issue with an atomic lock, skip refactor issues if feature or bug issues are available, check for blocking issues, create git worktree for isolation, run Claude with streamed JSON, run validation (types, imports, tests), retry if validation fails, then commit, push and open the PR.

The git worktree matters most when workers run in parallel. Without it, they will step on each other’s toes.

5. Model routing

The planner adds a size estimate in every issue body. Workers read it at runtime: small and medium work goes to Haiku, large work goes to Sonnet or Opus for the heavier reasoning. I’ll keep adjusting this as I go.

6. Refactor loop

Individual units of work shipped quickly will, over time, pull the codebase in different directions. A script scans for refactor opportunities, scores them by impact and opens issues labelled “refactor” and “ready”. Cron runs it weekly for P0 issues and biweekly for P1. Workers always perform product work (features, bugs) before picking up refactors.

7. Review feedback

A script scans open PRs for “CHANGES_REQUESTED” reviews, pulls the general comments and inline line-level notes and creates a “review-feedback” issue containing the branch name. Workers pick these review feedback issues up, check out the existing branch, push the fix and close the loop. There’s no second PR opened. The feedback loop is all in one PR.

Design principles

Every task gets a fresh context window. State lives in files or GitHub labels, not in a running conversation. Validation runs after every agent session, before the PR is opened. Three consecutive failures trip the circuit breaker and stop the loop. Where the human (you) step in is approving the PRD and merging the PR. Everything in between is what I’m trying to run autonomously while not trading off too much quality.

What broke

Despite restricting the planner to “Read”, “Glob” and “Bash”, it kept writing code. The “Skill” tool was loading hooks that instructed the agent to implement code despite what the rest of the prompt said. Removing “Skill” entirely and prepending a hard system instruction fixed it for now.

PR summaries were always empty. The culprit was “--output-format json” combined with “2>&1”, which mixed hook events and result JSON into one stream the parser couldn’t untangle. Switching to “--output-format stream-json --verbose” and filtering for lines fixed the bug.

The planner was also linking PRs as blocker dependencies. GitHub issues and PRs share a number namespace, so without an explicit check the planner would happily link to a PR and create a dependency that would never resolve (as issues were expected instead of PRs). Amateur mistake? Yes.

Metrics

Each run adds a CSV row with timestamp, issue number, event type, outcome, duration, model and size estimate. The goal is to build up enough data to see which estimate tiers fail most often, whether routing large tasks to Sonnet actually pays off. The data isn’t telling me much yet but I hope to see some insights months down the line.

What I’m looking to try next

An interesting extension is specialised sub-agents for test passes, QA (Browserbase? Playwright?) and doc updates that run as part of the merge loop. Right now everything goes through the same worker loop regardless of type. Separating those concerns should improve quality.

I also want tracing per phase: worktree setup, agent run, validation, PR creation. Clustering failures by phase would make it much faster to see where the pipeline is spending time or falling over.

The longer-term experiment is a closely observed LLM-as-judge: a second pass that scores whether code changes match the issue description, whether test coverage is adequate and whether a review comment was fully addressed. It could reduce noise before human review but I’m mindful of how it’s incorporated into the flow.

Workers running

Summary

Model quality is not the bottleneck. State, isolation, validation and human review are. The infrastructure is the differentiator.

Implementing things by hand already feels like a distant past. A clear roadmap and a dependable harness can go a long way to do the work of many. I’m still cobbling together the pieces and iterating on the flow. If your setup looks different, I’d love to hear about it!

I write more posts like this at blog.mariohayashi.com. Follow me on X @logicalicy.

DEV Community