The Machine That Builds the Machine, and the Studio That Runs Itself: Two Ways to Organise an Agent Swarm

#ai #automation #devops #architecture

Why I am writing this

I thought people might find this comparison useful, because it is rare to get two fully built agent-orchestration systems, designed in complete isolation from each other, solving the same class of problem with enough written detail on both sides to compare them honestly, and rarer still to catch the differences while both are still warm. Shortly after publishing my DAG TOML article I went looking for neighbours and found wpank's write-up, Building the Machine That Builds the Machine, which describes Bardo: a meta-system that takes a 234,657-line specification across 343 files and turns it into 26 compiled Rust crates through coordinated agent swarms. I have my own horse in this race, a system called atelier-studio (roughly 80,000 lines of Rust, built across about five months), and reading his post was the strange experience of recognising my own decisions in a stranger's codebase, and then, more usefully, recognising the places where he and I made opposite calls.

I am not a neutral reviewer here, I built one of the two systems being compared, so please take this as nothing more than one practitioner reading another practitioner's work with respect and an honest ruler. Where I describe Bardo I am working from the write-up alone, not the code, and any misreadings are mine.

The factory: Bardo

Bardo is project-shaped. It exists to finish one enormous build: a 26-crate Rust workspace implementing autonomous agents with mortality, dreaming, emotion and economic incentives, specified down to the academic citations (467 of them, Hans Jonas on metabolic freedom and Damasio's somatic markers, to name a few). The orchestrator, bardo-ctl, is 42,744 lines of Rust, and the part I admire most is around 2,000 lines of bash.

The bash is a three-stage context engineering pipeline, and frankly it is the heart of the whole design. Stage one extracts specification sections using a two-source weighted model (inline spec references get double weight over crate-mapped directories). Stage two decomposes a plan into ordered steps under a 102.4KB context cap, with the rule that each step must compile when combined with all previous steps. Stage three distils each step down to a 5 to 15KB context slice, carrying forward a one-line summary of what previous steps accomplished, so the agent implementing step 7 never sees the scaffolding from step 1. The design came, in his words, from watching agents drown in 80KB payloads where maybe 12KB was relevant.

Above that sits a genuinely complete orchestration layer: around 100 task TOML files declaring files, acceptance criteria, cross-plan dependencies (a task can depend on "17:T1", task T1 of plan 17, which lets the scheduler extract parallelism across plan boundaries) and exclusive file claims; a dual-layer DAG with wave scheduling via Kahn's algorithm; a next_runnable() check that refuses to start any task whose files overlap an in-flight task; 25 agent roles routed to three backends by competence (Codex for refactoring and diagnosis, Cursor for review verdicts, Claude for orchestration and implementation); a gate gauntlet (compile, dependency-deny, test, spec compliance) with a three-failure halt; a parallel three-reviewer panel synthesised by a Critic; git worktrees per plan with a shared sccache so parallel builds cache-hit each other; and a Conductor that nudges silent agents at 300 seconds, restarts stalled ones at 600, and never lets itself starve an Implementer of a spawn slot.

Two smaller mechanisms deserve a nod because they encode real scars. The iteration memory builds cumulative DO NOT RETRY lists from compiler errors and review blockers, born from watching an agent hit the same type mismatch four iterations running, each time "fixing" it differently and wrongly. And the golden-path index records plans that succeeded on the first attempt, categorised, so future decompositions are shown up to two worked examples of the same category. Failure memory and success memory, both fed forward.

The studio: atelier-studio

Atelier-studio is institution-shaped. Where Bardo exists to finish a build, atelier exists to keep running: a set of standing councils (research, engineering, QA, go-to-market, product and operations) that take a product idea through the whole lifecycle, from market analysis and competitive intelligence through work package decomposition, test planning, service level objectives and launch messaging, backed by a local knowledge graph of around 23,000 ingested items (papers, standards, bodies of knowledge, model registries).

The design bet is different, and the difference matters. Bardo diversifies it's agents by skill, routing each role to the backend best at that job. Atelier diversifies by perspective: each council runs multiple independent planner "flavours" against the same inputs, a Conservative Analyst worrying about risk and compliance, an Optimistic Explorer chasing emerging technology, a Pragmatic Synthesizer weighing cost against time to market (the engineering council has its own trio along minimalism, scalability and maintainability lines), and the outputs are merged through critique and ranking rather than simple voting. Bardo never argues with itself. Atelier is built to argue with itself, because in business strategy work the failure mode is not a type mismatch, it is a confident plan that nobody stress-tested from a hostile angle.

The memory systems differ the same way. Bardo's learning is textual and rule-shaped, DO NOT RETRY lists an agent must read. Atelier's is statistical: an attempt tracker feeding a failure oracle that forecasts the probability the next attempt fails (Dirichlet modelling), and a calibration tracker (isotonic regression and Platt scaling) that keeps the system's confidence honest against its actual hit rate. One remembers what failed, the other models how likely failure is. Atelier also crosses a line Bardo never attempts: a self-improvement subsystem that proposes changes to atelier's own code, which is exactly why it carries a human-approval safety gate and adversarial review, because a system that rewrites itself needs governance in a way a build factory does not.

Where two strangers built the same parts

The convergence list is long enough that I stopped finding it spooky and started finding it instructive. Both systems independently arrived at: atomic work units carrying their own acceptance criteria and file sets; explicit dependency DAGs over those units; file-level conflict detection as the precondition for safe parallel agents (Bardo's exclusive-files check is functionally identical to the conflict groups in my DAG TOML runtime); a panel of reviewers with a synthesising verdict; a three-strikes failure budget; failure memory fed forward into the next attempt; success exemplars fed forward as worked examples (his golden paths are, almost word for word, the clean one-pass approvals I used as a negative class when mining my review archive); and isolation of parallel writers via separate working copies.

None of this was copied. I found his write-up after building mine, his post does not reference any of my work, and yet the load-bearing safety mechanisms match almost one for one. When two builders who have never met converge on file-level conflict detection and cumulative do-not-retry memory, that is not fashion, that is the problem itself dictating the shape of the solution, the same way every culture that builds bridges discovers the arch.

Where the philosophies split

Three genuine divergences, and each one traces back to the shape of the work rather than to taste.

First, static distillation versus living retrieval. Bardo can precompute context slices because the specification is frozen; the spec is the territory and the pipeline is a map-making exercise done once. Atelier cannot freeze anything, the knowledge graph keeps growing and the councils query it at run time through a librarian layer with per-council token budgets. Bardo compiles context, atelier retrieves it. His closing line, that context engineering is the whole game, the right 12KB delivered at the right time, is the frozen-world statement of the same conviction that made me build the knowledge graph for the unfrozen one.

Second, skill diversity versus perspective diversity, which I described above and will not repeat, except to note the consequence: Bardo's review panel exists to catch defects, atelier's flavour consensus exists to catch blind spots, and a mature swarm probably needs both.

Third, the cockpit versus the control plane. His attempt at headless operation was, in his words, like driving blindfolded, an agent stuck in a compile-fix loop for 15 of 20 unobserved minutes, and his answer was a terminal dashboard with 26 widgets, pause and force-advance controls, and per-role colour coding. My answer to the same pain was structured event streaming and, eventually, an external control plane that evaluates fleet state from data rather than from watching. An interactive cockpit against a queryable instrument panel, and I suspect his converts stuck agents into intervention faster, whilst mine scales past the number of screens one person can watch.

What I take from it

The safety mechanisms converge, the strategy layers do not. Conflict detection, acceptance criteria, failure budgets and iteration memory showed up in both systems unprompted, whilst context strategy, diversity strategy and observability strategy split cleanly along the grain of each system's purpose. If you are building an orchestrator, copy the first list with confidence and choose the second list deliberately.
Project-shaped and institution-shaped systems want different memory. A factory can carry it's lessons as text, an institution needs calibration, because the institution will still be making forecasts long after any individual lesson has gone stale.
Context engineering keeps winning. Two systems, opposite architectures, same conclusion: not better models, not longer windows, but the right small context at the right moment.
Synchronicity is evidence. When isolated builders keep meeting at the same mechanisms, those mechanisms are probably load-bearing for the whole field, and they are the parts I would now least want to be without.

Credit to wpank for a write-up generous enough with internals to make a real comparison possible, that generosity is rarer than the engineering. Thanks for reading this far, I hope you find some value in my reading of the two machines. If you have built your own orchestrator and recognise these mechanisms (or, better, if you made a third set of choices entirely), I would genuinely like to hear how the wall pushed back on you.