Mixture of Experts

Posted on Jun 9

Build reliable long running agents w/ verification, worktrees, skills, subagents, & HIL/review gates

#ai #programming #productivity #coding

There’s been a lot of buzz and discussion around loops and workflows in the past few days. There have also been many people chiming in that they’ve been doing this all along, posts about how it works, but less code examples that work for production codebases. At the end of the day, the goal is really around making long running agents reliable and steerable inside of real codebases. This article shows how to do that and how to build your own with the actual code provided.

You’ll see how an engine was built to power these loops or workflows. These terms are used interchangeably because it doesn't matter what word is used to churn hype, it's about how it works and what result it drives for agent reliability. You’ll also see the path to this architecture and why it was built the way it was and then the code (skip to whatever part you want).

There’s been good work to properly define what a loop is, but how do you just build one, and how do you build it in such a way that it doesn’t burn tokens and is reliable? @mvanhorn had a good writeup on the history of the concept, how developers are employing it, and how there is still such a major gap between AI used in real-world deployments because you do need a “production” version of a loop:

Which is why every serious 2026 write-up on loops converges on the same three hard stops: a maximum iteration count, no-progress detection, and a token or dollar budget ceiling. The romantic version of loops is that you write the loops and a thousand agents build your company overnight. The production version is that you write the loops, and most of your job is making sure they halt. Gartner puts agentic AI at the peak of inflated expectations, with only about seventeen percent of organizations actually deploying agents. The gap between the timeline and the receipts is the real state of play.

Additionally, @dexhorthy clearly pointed out that if you were to hypothetically start looping through everything without a care, you’ll end up in the zone of not understanding your codebase and slop:

Here’s what’s gonna happen:

you replace your code review with feedback loops (sentry, datadog, support tickets, etc)

you stop reading the code

software factory fixes everything

one day something breaks at 3am, agent can’t fix it

nobody’s read the code in 3 months

you have 3 weeks of downtime trying to re-onboard and fix it

you lose significant % of your contracts and users

your company is now dead.

If you’ve been trying to figure this out, you have probably noticed that coding agents don’t scale when it comes to work that is complex and ambiguous. You may also disagree with the sentiment that all you need is the model and for it to delegate its own orchestration. Real production code needs management across dependencies and teams, and you don’t have infinite tokens to burn. We feel the same.

We started with a harness that orchestrated multiple agents into workflows powered by the Opencode, Claude Code, and GitHub Copilot CLI SDKs. This approach didn't scale well because the SDKs all have slightly different interfaces, each has its own limitations for what you may want control over, and, lastly, as these teams are shipping fast, they frequently introduce bugs.

Then we learned about Pi and started exploring it as a minimal coding agent implementation. Pi is great because it is a simple harness and still fully extensible. The benefit of Pi is that you can leverage its large ecosystem of extensions/tools/MCPs, and it is a well-maintained project. We took Pi and used it as the runtime with control over its functionality and security, so we could focus on building a powerful engine for the coding agent to execute autonomously rather than worry about supporting multiple interfaces, squashing bugs, and managing three separate backends. The migration to Pi was the third rewrite of the system, but well worth it.

The next challenge we ran into are around traditional orchestration, reliability, and observability. The coding agent is an incredibly powerful primitive, but it becomes clear that it needs proper scaffolding with observability to work. From working with coding agents, you know that you need to be able to pass the proper context, steer, and have an easy way to observe what is going on so you can interrupt if things go wrong.

You also probably know that a lot of coding agent work feels reactive instead of proactive. It gets just close enough to what you want, then falls apart in the last 30% of the work, or worse, completely goes off track. You, like most, probably still spend most of your efforts manually prompting, writing markdown programs, or running the same skills in repetition. From all of this, it naturally makes sense to look at a way to automate this process.

After spending many hours thinking about problems around testing, intent alignment, the software development lifecycle, and speaking to developers about what they saw, experienced, and, very importantly, felt, we ended up with the following features to solve these challenges: review gates, support for parallel execution, human-in-the-loop gates, resume/pause capability for any stage in a loop, being able to steer mid-run, verbatim compaction (not what you see in coding agents today), and defining the workflow with any model that you desire so you can measure and manage your costs. Honestly, each part of the system has been specifically fine-tuned to work exceptionally for long-running coding-agent loops and deserves a blog post of its own for the architecture and how it was conceived. (Let us know if that is interesting, by the way).

This is not trivial to construct. There is a reason Claude Code dynamic workflows doesn’t have any of these features, because frankly it’s hard to build, and you have to go back to software engineering principles. Sorry, but the model won’t save you here. Secondly, there is a reason Peter, who shared the loops heard round the world tweet, didn’t share his setup. He has likely heavily optimized the OpenClaw ecosystem, and it would require serious reworking to generally work on all codebase shapes. So you need a solution that has good DX and works across different codebases and teams.

So that is what was built called Atomic. You can call it a workflow engine, a loop, loop engine, workflows, it doesn’t matter. The point is how it works and how easy it is to build your own.

The inner loop is the traditional ReAct loop where a model will execute until it finishes making tool calls. In contrast, the outer loop is organizing more complex/long-horizon tasks like software engineering into atomic units (like GH issues) and delegating each to a separate instance of an agent. You can take it a step further and even ensure your inner loop is consistent and verifiable. This is where static and dynamic workflows come in. This is not through simple prompts, but rather by giving the model a workflow meta-tool in the inner loop: the ability to define its own subroutines through subagent chaining and tool calling.

Atomic treats the coding agent as the inner loop and the workflow runtime as the outer loop.

The inner loop is still the normal agent loop: model reads context, calls tools, observes results, repeats until it produces an answer. Atomic does not replace that. The workflow engine wraps it with a typed, observable, resumable execution layer that decides which agent session runs, with what context, on what model, with which tools, in what order, and with what validation or human gate before continuing.

A workflow in Atomic is a TypeScript module, not just a prompt. See code at the bottom of the post for what the public API looks like.

The important part is the boundary it creates. Every ctx.task, ctx.chain, ctx.parallel, ctx.stage, ctx.workflow, and ctx.ui.* call creates explicit runtime structure. Atomic can see the graph, persist the state, attach to a stage, pause it, resume it, steer it, kill it, inspect transcripts, and preserve artifacts. That is very different from asking one model to “please do the following steps.”

Atomic supports both static and dynamic workflows.

A static workflow is the versioned TypeScript definition of a workflow. It has declared inputs, declared outputs, stage names, model choices, fallback chains, concurrency limits, human gates, worktree options, and artifact paths. You can commit it to .atomic/workflows, ship it through an Atomic package, or bundle it into the product.

A dynamic workflow is when the agent uses the workflow tool at runtime to create a tracked one-off task, chain, or parallel fan-out without a saved workflow file. That gives the model a workflow meta-tool inside its normal ReAct loop. Instead of merely saying it will ask three agents, it can actually spawn three tracked stage sessions, give each one clean context, collect their outputs, and synthesize them. If the pattern proves useful, you can promote it into a real TypeScript workflow.

The execution model is a DAG, but the developer does not need to manually draw the DAG. Atomic infers it from runtime control flow. Sequential awaits become dependent stages. ctx.parallel or Promise.all creates concurrent branches. Loops can create repeated stage groups. Child workflows called with ctx.workflow(...) are nested under the parent and shown in the same expanded graph. This matters because the graph is not just visualization, it becomes the control plane.

The runtime tracks:

stage name and status
input/output contracts
session file and transcript path
model and reasoning effort
fallback model attempts
errors and warnings
artifacts and output files
live pause/resume/interrupt handles
pending human input
nested workflow boundaries

The core design principle is that large context moves through files and artifacts, not through the model prompt. Small handoffs can use previous or {previous}. Large handoffs should be written to files with outputMode: "file-only" and passed forward with reads. This avoids the common failure mode where every stage inherits the full transcript of every prior stage and token usage explodes.

Atomic also separates context modes. Implementation stages can use forked context when continuity matters. Reviewer stages should usually use fresh context so they are not biased by the implementation agent’s reasoning. That distinction is one of the simplest ways to make review loops more reliable, where the reviewer reads the diff, artifacts, tests, and criteria, not the implementer’s biased thinking.

Self-verification is also a key part of the design. Atomic’s built-in Goal Runner and Ralph loops do not let the worker declare completion by itself: the worker leaves receipts, then fresh-context reviewers inspect the actual diff, spec, implementation notes, artifacts, and validation expectations. Those reviewers can run or delegate focused validation and must return structured review decisions; malformed decisions, reviewer errors, or unresolved validation gaps fail closed instead of approving. The review round is saved as an artifact, and the loop only stops when the structured verdict or reducer says the work is clean. That turns “the model says it checked its work” into a verification stage another agent or a human can inspect.

Human input is also part of the runtime. A workflow can call ctx.ui.input, ctx.ui.confirm, ctx.ui.select, or ctx.ui.editor at the exact point where a decision is needed. The run enters an awaiting-input state, the prompt appears in the workflow UI, and the answer is routed back to the correct stage. This is how you build approval gates, review gates, release gates, and “stop before destructive action” behavior without relying on the model to remember a markdown instruction.

Reliability comes from making the outer loop explicit. You can set a max iteration count. You can require structured reviewer outputs. You can run independent reviewers and reduce their decisions deterministically. You can fail closed when declared outputs do not validate. You can retry provider failures with fallbackModels. You can isolate work in git worktrees. You can cap output size. You can pause or interrupt a runaway stage and resume it with a steering message. These are boring software engineering controls, but they are exactly what makes long-running agent work survivable.

The model and cost are also prioritized. A workflow can use a cheap model for classification, a stronger model for implementation, a different model for review, and a fallback chain for critical stages. Model strings can include reasoning effort, e.g. openai/gpt-5.5:high or anthropic/claude-haiku-4-5:off, so cost and latency are controlled per stage instead of globally. Parallelism is explicit too, so a broad fan-out is a deliberate choice rather than the model deciding a path and burning tokens.

Observability is also a core mechanism. Atomic gives you /workflow status, /workflow connect, /workflow attach, /workflow pause, /workflow interrupt, /workflow resume, and /workflow kill. You can inspect the graph, attach to a single stage, send a steering message, answer a pending prompt, or resume paused work. The system keeps run history and terminal state around for inspection. A failed workflow is not just a giant lost chat transcript. It is a run with named stages, artifacts, errors, and receipts that you can analyze, measure, and improve.

So the architecture is:

Use TypeScript to define the outer loop.
Use separate agent sessions as atomic stage units.
Use fresh or forked context intentionally.
Use artifacts for large handoffs.
Use typed inputs and outputs as contracts.
Use parallel branches where independence is real.
Use review/human gates where correctness matters.
Use model selection and fallback per stage.
Persist everything needed to inspect, resume, debug, and measure ROI.
Let the model operate inside the loop, but do not let it be the loop.

Why not just skills? Skills are useful, but they are instruction bundles. A skill can teach the agent how to do something: how to review code, how to write tests, how to use Bun, how to follow a release process. But a skill does not give you durable execution state. It does not create a graph. It does not create independent sessions. It does not enforce typed inputs or outputs. It does not give you concurrency, pause/resume, stage-level model choices, human gates, artifacts, fallback models, or run inspection.

In Atomic, skills and workflows are complementary. A workflow decides when and where work happens. A skill improves how a specific stage performs its work. For example, a review stage inside a workflow can invoke a code-review skill. But the skill itself is not the orchestration engine.

Why not just markdown? Markdown is good for instructions, specs, and documentation. It is bad as a runtime. A markdown checklist cannot validate input schemas. It cannot schedule parallel branches. It cannot persist stage state. It cannot enforce output contracts. It cannot pause a running model call and resume it later. It cannot route a human approval answer to the correct stage. It cannot automatically keep large artifacts out of model context. It cannot expose a graph of what is happening.

Most markdown “workflows” are really prompts asking the model to simulate a workflow. That works for short tasks, but it degrades as soon as the task becomes long, ambiguous, or failure-prone. The model forgets steps, over-compresses context, repeats itself, hides uncertainty, or claims completion too early. Markdown can describe the process, but something else needs to execute the process.

Why not any other harness or hooks? You may have seen plenty of lightweight hook examples. This is not sufficient. You can build this on top of any harness, but this would require you to own the provider adapters, tool calling, sessions, transcript persistence, UI, MCP, extension loading, model configuration, auth, permissions, package discovery, and runtime controls. If you try the route of orchestrating several external CLIs and SDKs, it does not work well because every backend has slightly different semantics, failure modes, streaming behavior, context handling, and bugs.

Atomic builds on Pi because Pi is already a small, extensible coding-agent harness. That means the workflow engine can focus on orchestration instead of the agent runtime. Atomic gets the extension ecosystem, MCP/tools, model/provider plumbing, TUI surfaces, sessions, and package system, then adds the missing outer loop: typed workflow definitions, tracked stages, graph execution, human gates, artifacts, live control, and resumability.

The reason this matters is that production agent work is not just “call a better model.” It is systems engineering. You need boundaries, contracts, state, observability, failure handling, and cost controls. The model is powerful, but the model should be inside a system that constrains and verifies it. Atomic is an attempt to make that system open, inspectable, and easy to modify.

This is not necessary for every feature or bug fix. It’s most useful for large, ambiguous, long-running tasks where the difficulty is in managing the process, validation, and context. A GitHub-issue-to-PR run was recorded for you to review as well, so you can see a canonical example of how you can use it.

It’s not perfect, but it’s a shift toward less prompting and more designing the process for the agent to coordinate, validate, and reduce cognitive load. This is especially important for messy, large codebases. Atomic (bastani-inc) has a public implementation of this style of architecture. You can ask your coding agent to inspect and explain it.

There is no "right way" to build a scaffold or a system as we're all experimenting, but there are principles that are key to making sure that it does scale inside of any codebase shape. You should try it out on a scenario where a coding agent has failed you in the past, whether that be because your context was too large or because the agent misunderstood your intent and generated slop. Don’t take anyone’s word for it. Try it and see if this method can make it better.

Public API:

import { defineWorkflow, Type } from "@bastani/workflows";

export default defineWorkflow("review-change")
  .description("Research, review, and synthesize a change")
  .input("target", Type.String({ description: "Diff, PR, issue, or task" }))
  .output("result", Type.String())
  .run(async (ctx) => {
    const scoutPath = ".atomic/workflows/runs/review-change/scout.md";

    await ctx.task("scout", {
      prompt: `Map the relevant context for: ${ctx.inputs.target}`,
      context: "fresh",
      output: scoutPath,
      outputMode: "file-only",
    });

    const reviews = await ctx.parallel(
      [
        {
          name: "correctness-review",
          prompt: `Read ${scoutPath} and review correctness, regressions, and tests.`,
          reads: [scoutPath],
          context: "fresh",
          model: "openai/gpt-5.5:high",
        },
        {
          name: "maintainability-review",
          prompt: `Read ${scoutPath} and review maintainability and edge cases.`,
          reads: [scoutPath],
          context: "fresh",
          model: "anthropic/claude-sonnet-4:high",
        },
      ],
      { concurrency: 2 },
    );

    const final = await ctx.task("synthesis", {
      prompt: [
        "Synthesize the reviewer findings.",
        "Keep only evidence-backed issues.",
        "Separate blockers from optional suggestions.",
      ].join("\n"),
      previous: reviews.map((r) => r.text).join("\n\n"),
    });

    return { result: final.text };
  })
  .compile();

As with any writing, it’s important to know the background of the authors so you can decide how to weigh their observations and thoughts. Alex is an AI researcher at Microsoft Research, and he's spent time building and working on coding agents and developer tools over the past couple of years for MSR and Windows, solving challenges for using coding agents inside of a codebase that is 1 billion+ LoC. Prior to that, he's worked on uncertainty estimation at an MIT startup, and he has a research background in 3D vision and world models. Outside of his job, he's an open source builder on Atomic, where he's spent many hours refining, building, and working on coding agents. Norin's worked on AI/ML products at big tech and start ups on areas of productivity, hospitality, netdevops, and healthcare having shipped to 1B+ users worldwide. Norin is now working on better reliability for coding agents, search, and other interesting projects, including contributing to Atomic. Despite our profiles reading as quite AGI-pilled, we more often than not will exclaim how these models are being incredibly dumb. So hopefully we came across at least somewhat level headed, but you can decide that.

-Norin & Alex

DEV Community

Build reliable long running agents w/ verification, worktrees, skills, subagents, & HIL/review gates

References

Top comments (0)