The naive assumption when building with LLMs is that a smarter single agent solves more problems. Just give it better tools, a longer context window, more powerful model and it scales.
It doesn't. Not because the models aren't capable, but because the architecture is wrong.
Human attention is a bottleneck. A single agent running a 30-day engineering task can't be supervised at every step and if it can't be supervised, failures compound silently until something breaks in production.
Add task complexity to the picture and single-agent systems hit a wall fast: they either hallucinate, lose context, or produce inconsistent output that requires constant human correction.
The fix isn't a better model. It's a different architecture. Here's the one that actually works.
First: understand what multi-agent actually means
'Multi-agent' gets used to describe everything from two LLMs talking to each other to a 50-agent distributed system. The taxonomy matters because the design pattern you choose determines how the system fails.
Five patterns, each solving a different problem:
Most production systems use a combination. The architecture I've found most reliable for complex engineering tasks uses delegation as the spine with creator-verifier embedded at every output boundary.
The 3-role architecture: Orchestrator, Workers, Validators
Every mission has three roles. They are not interchangeable and they should not share context prompts.
The Orchestrator
The Orchestrator doesn't write code. It doesn't implement features. Its job is three things: strategic planning, milestone definition, and creating validation contracts.
A validation contract is a pre-agreed specification of what 'done' means for each milestone written before any worker starts. This is the most important artifact the Orchestrator produces. Without it, validation becomes subjective and workers optimize for the wrong outcomes.
The Orchestrator's output is a plan and a set of contracts. If it's writing implementation, something is wrong with your architecture.
Workers
Workers implement. Each one gets a clean-slate context no accumulated baggage from previous milestones. This is deliberate. Context bleed from earlier in a task contaminates new work in ways that are hard to debug.
Three things Workers produce: feature implementation, structured git commits, and nothing else. The structured commit format matters it's how the Orchestrator tracks milestone progress and how Validators know what to test.
Workers don't know about other Workers. They don't know the full plan. They know their milestone, their validation contract, and their tools.
Validators
Validators are adversarial by design. Their job is to find reasons the Worker's output fails the validation contract not to be helpful, not to suggest improvements, not to give partial credit.
Two types of validation: Adversarial verification (does the implementation actually match the contract?) and end-to-end behavior checks (does the system work correctly in the way a real user would encounter it?).
The Validator never sees the Worker's context. It only sees the contract and the output. If a Validator needs to read the Worker's reasoning to evaluate the output, the output isn't complete enough.
Three validators, not one
A single validation pass misses things. The architecture uses three distinct validators at different layers:
Running all three isn't optional. The Validation Contract Validator catches bad specs before they waste a Worker's run. The Scrutiny Validator catches implementation drift. The User Testing Validator catches the gap between 'technically correct' and 'actually works' which is where most bugs live in production.
Execution strategy: serial by default, parallel only where safe
The instinct is to parallelize everything. More agents running at once = faster output. In practice, unconstrained parallelization creates consistency problems that are harder to debug than the time you saved.
The execution strategy that works:
Serial execution for consistency
Milestones run in sequence by default. Each milestone's output is validated before the next starts. This eliminates the class of bugs where two parallel workers make conflicting assumptions about shared state.
Internal parallelization for read-only tasks
Within a milestone, tasks that don't write state can run in parallel. Research, analysis, context gathering these are safe. Anything that modifies shared state runs serially.
Structured handoffs for context retention
When a Worker hands off to a Validator, the handoff includes: what was built, what the contract required, and what tests were run. Not the full context window a structured summary. The Validator should be able to evaluate from that summary alone.
Self-healing milestone boundaries.
When a milestone fails validation, the system doesn't restart from scratch. It retries the specific milestone with a fresh Worker context and an updated contract that incorporates the failure information. The Orchestrator owns the retry logic. Workers don't know they're retrying.
Droid whispering: matching models to roles
Not every role needs the same model. The architecture is model-agnostic by design and that's not a hedge, it's a deliberate cost and capability decision.
The Orchestrator does strategic reasoning over long context. It benefits from the most capable model available.
Workers do focused implementation over a narrow scope. A smaller, faster, cheaper model often performs as well or better here because the task is bounded and the context is clean.
Validators do adversarial checking with a specific contract. Again, a focused model with a well-structured prompt outperforms a general-purpose large model given a vague instruction.
The prompt-driven orchestration layer is ~700 lines. That's not a sign of over-engineering it's what makes the architecture model-agnostic. When better models ship, you swap the model string, not the architecture.
Resilience to model updates is a design requirement, not an afterthought. Any architecture that assumes a specific model's behavior will break when that model changes. Prompt-driven orchestration with explicit contracts means the system degrades gracefully rather than catastrophically when model behavior shifts.
What this actually produces
After running this architecture on real engineering tasks over 16-30 day missions:
90% code coverage
Asynchronous 'Mission Control' view of active runs every milestone, every validator, every retry visible
Meaningful productivity gains for engineer teams not in demo environments, in production tasks
Runs that don't need babysitting because validation contracts mean the system knows when it's done
The 90% coverage number matters because it's not a target it's a byproduct. When Validators are adversarial and contracts are explicit, coverage follows from the architecture rather than being enforced by a test-writing pass at the end.
What to take from this
The human attention bottleneck is real. You cannot supervise every step of a multi-week agentic run. The architecture has to be self-validating not just self-executing.
The three things that make this work: validation contracts written before implementation starts, adversarial validators that see only output (not process), and serial execution with targeted internal parallelization.
The thing that almost always gets skipped: the Validation Contract Validator. Teams jump straight to building and discover mid-run that their success criteria were ambiguous. That's not a model problem. It's a spec problem.
What's the biggest failure mode you've hit running multi-agent systems and was it an architecture problem or a model problem?





Top comments (0)