NEE

Posted on Jul 4

BMAD Loop: Handing Control of the Dev Loop Back to Deterministic Code

#ai #automation #devtools #llm

If you read my earlier write-up on Story Automator, you might remember my bottom line:

During the day, running things by hand is still faster. But handing a batch of stories to it before bed and checking the results in the morning — that's the use case where it genuinely shines.

I'd left a question dangling in that piece: why was it slower than doing it by hand? At the time I said I hadn't dug into how it worked yet.

BMAD 6.10 rewrites the whole thing, renames it BMAD Loop, and incidentally answers that question. The answer is a single sentence, and it's the key to understanding the entire design:

The control loop should have no LLM in it.

First, let's clear up the most common misconception

When people first meet BMAD Loop, they assume it's a few new skills: bmad-loop-setup, bmad-loop-sweep, bmad-loop-resolve, bmad-dev-auto.

It isn't. On their own, those skills do nothing.

What actually drives the loop is a Python tool installed from Git via uv — the bmad-loop package (repo: bmad-code-org/bmad-loop). Those skills are just the "primitives" (the official word) the orchestrator calls at different stages of the loop — the most basic, individually-dispatchable units of work:

bmad-dev-auto: develop — turn intent into an artifact that survives review
bmad-loop-sweep: triage — clean up the deferred-work ledger
bmad-loop-resolve: interact — disambiguate things together with a human

In other words, the skills are the muscles; the Python orchestrator is the central nervous system. Once that clicks, every design choice below falls into place.

The core tenet: No LLM in the control loop

The README's subtitle states it bluntly:

A deterministic ralph-loop orchestrator for the BMAD-METHOD implementation phase.

"A deterministic loop orchestrator." Deterministic is the single most important word in this entire article.

It splits the development loop into two very different kinds of work:

Work	Done by	Why
Control logic: which story to pick, how many retries, what counts as done, whether to commit	Pure Python	Needs to be deterministic, debuggable, reproducible, and free
Creative work: writing code, writing tests, doing adversarial review	An LLM (in a one-shot session)	This is what LLMs are good at — and the only thing only they can do

Go back and ask why Story Automator was slow: its control loop was stuffed with "ask the LLM by prompt what to do next" steps. Every single ask costs tokens, waits on inference, can drift, and when it drifts you ask again. Handing scheduling to an LLM is like putting a distractible, per-token-billing rookie in charge of dispatching on the assembly line.

BMAD Loop's move: swap the dispatcher for a piece of Python code that doesn't get distracted and doesn't bill by the word. The LLM only does its creative job at each station and leaves.

This buys four things, and they're the foundation for every mechanism that follows:

Deterministic: run the same sprint twice, get the same scheduling path
Debuggable: the flow is code — when something breaks you set a breakpoint or read a log, instead of guessing which prompt was unclear
Reproducible: every run's decisions are recorded as a state machine on disk
Cheap: control logic costs zero tokens

Four mechanisms that make "hands-off" safe

"No LLM in the control loop" sounds clean, but it raises a sharp question: how does the orchestrator know an LLM session finished — and finished correctly? The old answer was to make the orchestrator itself an LLM and have it "watch" the session's output. That was exactly Story Automator's burden.

BMAD Loop sidesteps that burden with four mechanisms.

Mechanism 1: every step is a one-shot session with fresh context

Dev and review are two independent sessions, and the review session never inherits the dev session's context.

This is counter-intuitive but critical. If the review session carried the memory of the dev session writing the code, it would naturally go soft on it — humans get anchored to code they just wrote (the psychologists call it the anchoring effect), and so do LLMs. Put review in a fresh session that has never seen the dev's code, and it will actually nitpick instead of rubber-stamping.

Analogy: you can't let the person who wrote the code and the person who reviews it be the same brain. Context isolation gives review a pair of eyes that has "never seen this code."

Mechanism 2: communicate via hook event files, never scrape panes

How does the orchestrator know a session ended? By registering hooks on the coding CLI (Claude Code / Codex / Gemini): Stop, SessionStart, SessionEnd, PreCompact. These hooks write structured event files to disk at key moments, and the orchestrator just watches those files.

Each skill, when it finishes in automation mode, writes a machine-readable result.json declaring what it produced and its status.

Old way (Story Automator):        BMAD Loop's way:
┌──────────────┐                  ┌──────────────┐
│ Orchestrator │                  │ Orchestrator │
│   (LLM)      │                  │  (Python)    │
│ reads screen │ ←fragile/costly  │  watches     │ ←stable/free/
│              │  /error-prone    │   files      │  structured
└──────────────┘                  └──────────────┘
       ↑                                 ↑
   scrape pane /                    read Stop-hook events
   read the conversation            read the skill's result.json

Pane-scraping was last era's pain: change the terminal output format or have the model say one extra sentence and the orchestrator is lost. Swap to "hooks write files, orchestrator reads files" and the interface drops from natural language down to structured data — robustness jumps a level.

Mechanism 3: Trust nothing, verify everything

This is the most hard-core part. After every LLM session, the orchestrator does not trust the session's own "I'm done" — it independently verifies on disk:

spec frontmatter status: did the story spec's status field actually flip to done?
baseline-commit match: do the files the session claims to have changed line up with git's actual diff? — a cheap "LLM lie detector"
non-empty diff: did it actually change anything?
sprint-status sync: does the status file agree with real progress?
your tests / lint: right before committing, run your own tests and lint

All checks pass, then commit. Any check fails, retry or escalate.

This philosophy is worth remembering on its own: LLMs hallucinate; git doesn't. Move the "is it really done?" judgment from "ask the LLM" to "look at the disk evidence," and the whole system gets solid.

Mechanism 4: the deferred-work ledger + sweep — finally, someone reads it

Loops always hit work that "can't be done right now" — an edge case waiting on another story, a decision a human has to make. You can't force it and you can't drop it, so it goes into a ledger: deferred-work.md.

What's interesting is this ledger's backstory. In earlier BMAD versions it was a famous half-feature: bmad-code-review wrote deferred items into deferred-work.md, but no skill ever read it back (the community even filed an issue about the bug). Debt written down, never repaid.

BMAD Loop's bmad-loop-sweep finally closes that loop. It does read-only triage: take every open entry in the ledger, verify it against the real codebase (grep for the symptom, check git log, read the relevant files), then partition it into five buckets:

Bucket	Meaning	What the orchestrator does
`already_resolved`	Later work incidentally fixed it, but nobody marked it	Closes it automatically, with evidence (file:line / commit)
`bundles`	Buildable now; group entries sharing a file/subsystem into one dev session	Executes the bundle
`blocked`	Needs a future story/epic to land first	Records the blocker and leaves it
`skip`	Stale, irrelevant, or explicitly excluded by the project	Skips it
`decisions`	A human has to decide (changing a frozen spec, changing an API shape, etc.)	Escalates to a human

"Debt gets repaid" — and repaid with evidence, not by guessing from stale ledger state. This is the mechanism that lets the loop run unattended for a long time without debt snowballing.

The whole loop in one diagram

Stack those four mechanisms together, and a story's full life cycle in BMAD Loop looks like this:

The entire control flow is Python. Only steps ② ③ ④ — the "creative stations" — are LLMs working in one-shot sessions. That's the complete meaning of "deterministic orchestrator."

Multi-model orchestration: three CLIs, mixed per role

BMAD Loop drives three coding CLIs through a generic tmux adapter: claude (default), codex, gemini. And you can mix them per stage — configured in the project's .bmad-loop/policy.toml:

[adapter]
name = "claude"          # default for every stage

[adapter.review]
name = "codex"           # but the review stage runs on codex

Why mix? Because different models are good at different things. A practical combo: have one model write the code and a different model do the adversarial review — two models from different families scrutinizing each other is far more ruthless than one model self-reviewing. This stacks on top of "mechanism 1: review in fresh context" for a double bias kill.

This isn't "calling a model" anymore — it's model orchestration.

When it stops and asks you: the CRITICAL escalation

Unattended does not mean no human involvement. There's one case where the orchestrator proactively pauses the whole run to wait for you — a CRITICAL escalation.

The trigger is usually: a dev or review session finds the frozen spec (the <frozen-after-approval> block) self-contradictory, or silent on a key case, and can't safely continue. Instead of guessing or forcing, it parks the run and waits for:

bmad-loop resolve --story <story-key>

That starts an interactive session. A human is present (you), so it asks you questions, gives 2–4 concrete options with a recommendation. Once you decide, it goes and edits the spec itself — not the code — to remove the ambiguity, and then the orchestrator re-drives the story against a corrected, contradiction-free spec.

The design is restrained, with a few hard rules worth applauding:

the resolve session only edits spec content — it writes no feature code, runs no tests, makes no commits
it does not touch sprint-status.yaml and does not set the spec's status field — those are produced deterministically by the orchestrator on resume
if the information isn't there, or the right fix is out of scope for a spec edit (e.g. it needs a PRD/architecture change), it says plainly "I can't resolve this", writes no completion marker, and the run stays parked — the safe default

In one line: when unsure, it stops and waits for you rather than fabricating an answer and charging ahead. That's the most responsible reading of "unattended."

How to use it: three steps

There's one prerequisite: you need a BMAD v6 project where bmad-sprint-planning has already run and produced a sprint-status.yaml. In other words, the PRD / architecture / epics & stories / sprint planning chain has to be done first — otherwise the loop has no stories to grind.

After install (via the bmad-loop-setup skill, which installs the Python tool from Git and runs bmad-loop init to register hooks, lay down skills, and write policy.toml), the core commands are actually few:

bmad-loop init        # install bmad-loop-* skills + hooks + policy.toml + gitignore
bmad-loop validate    # preflight: config / sprint-status / git / tmux / CLI / hooks
bmad-loop run --dry-run   # print the plan without spawning anything
bmad-loop run         # go
bmad-loop tui         # …or drive everything from the dashboard

The full command list covers run / sweep / resume / resolve / decisions / status / attach / stop / clean, but 90% of daily use is the handful above. The bmad-loop tui dashboard is genuinely nice — run picker, sprint tree, deferred-work ledger, a live per-story task table, color-coded journal tail, all on one screen.

One must-know one-time gotcha: if the coding CLI has never run in the target project (e.g. claude has never been started in that directory), start it once manually first and accept the workspace-trust and hooks-approval dialogs. Sessions spawned by the orchestrator can't click those first-run dialogs for you, and a pending dialog gets misread by the orchestrator as a "session timeout."

Lineage: from Story Automator to BMAD Loop

Drop BMAD Loop onto the timeline and its position is obvious:

Story Automator          bmad-automator / bmad-auto          BMAD Loop
(early 2026, the          (intermediate form,                (6.10, rewritten as a
 version I tested)         tool name bmad-auto)               deterministic Python orchestrator)
      │                        │                                 │
      └──── LLM in control loop ─┴─── rewrite ───► no LLM in control loop ──┘
              (slow / costly / drifty)                (deterministic / debuggable / free)

The README is candid about it: "Inspired by the original bmad-automator (a separate, legacy project)." It explicitly treats the previous generation as legacy and itself as a ground-up rewrite.

And it positions itself as "a deterministic ralph-loop orchestrator." If you follow the autonomous-dev scene you've heard of Ralph — the tool that has Claude Code run its own dev loop. BMAD Loop borrows the "ralph-loop" pattern (unattended, repeatedly-iterating small loops) but implements it deterministically on top of BMAD's story system. So it's "Ralph's spirit + BMAD's skeleton + a Python core."

Who it's for, who it's not for

Keeping the honest tone from my Story Automator review, here's a no-hype take.

Good fits:

You've gone through the full BMAD planning chain (PRD → architecture → epics → sprint planning) and have a string of clear, independently-implementable stories
You're fine with async "run it overnight" delivery — check results in the morning, not by staring at it live
Your project has reliable tests and lint (mechanism 3's last gate leans on them), otherwise verify is theater
You want multi-model cross-review without manually switching CLIs

Not for / be careful:

Stories are still vague or dependencies tangled — running those through the loop mostly generates a pile of CRITICAL escalations, which is more tiring than helpful
Projects with no tests — the orchestrator can be as smart as it likes, but the final verify gate spins empty, which is basically running without a net
Expecting "fast + good + fully automatic" — deterministic orchestration makes it more stable and cheaper, but a single story's absolute speed isn't necessarily faster than a practiced human hand-running it. Its value is in batch, async, resumable — not single-point speedup

The biggest difference from the previous generation, and the reason I'm most optimistic about it: because the control loop is deterministic code, it's debuggable, reproducible, and trustworthy. The "why is it so slow" black box from the Story Automator era is finally opened — the flow is Python, and you can see exactly what each step is doing and why it decided that. That alone is enough to graduate it from "experiment" to "something you can seriously use."

Closing

From v6.8's "locking intent" (have AI nail down what you actually want first) to 6.10's BMAD Loop (deterministic code as the dispatcher, LLMs only writing code), BMAD's direction over these years has been remarkably consistent:

Take back, one job at a time, the work that shouldn't be done by an LLM.

Lock down what should be locked (intent → SPEC). Make deterministic what should be deterministic (scheduling → Python). Hand to a human what a human should decide (park the run and wait). LLMs get progressively narrowed down to the creative work they're actually good at.

This isn't distrust of LLMs — it's respect for them: don't make them do what they're bad at, hallucinate at, and bill you per token for.

If you're using BMAD on a project, I'd genuinely recommend trying BMAD Loop on one sprint for real. Even if only so the deferred-work.md ledger — "the debt no one ever repays" — finally has someone managing it.

References:

BMAD Loop repo: bmad-code-org/bmad-loop
BMAD Method docs: docs.bmad-method.org
Previous generation (my review): BMAD Story Automator
The v6.8 context: BMad v6.8 — the "intent-locking" era

DEV Community