Venkata Manideep Patibandla

Posted on May 31

Why Single Agents Fail at Scale And the 3 Role Architecture That Fixes It

#ai #agentskills #architecture #failures

The naive assumption when building with LLMs is that a smarter single agent solves more problems. Just give it better tools, a longer context window, more powerful model and it scales.

It doesn't. Not because the models aren't capable, but because the architecture is wrong.

Human attention is a bottleneck. A single agent running a 30-day engineering task can't be supervised at every step and if it can't be supervised, failures compound silently until something breaks in production.

Add task complexity to the picture and single-agent systems hit a wall fast: they either hallucinate, lose context, or produce inconsistent output that requires constant human correction.

The fix isn't a better model. It's a different architecture. Here's the one that actually works.

First: understand what multi-agent actually means

'Multi-agent' gets used to describe everything from two LLMs talking to each other to a 50-agent distributed system. The taxonomy matters because the design pattern you choose determines how the system fails.

Five patterns, each solving a different problem:

Most production systems use a combination. The architecture I've found most reliable for complex engineering tasks uses delegation as the spine with creator-verifier embedded at every output boundary.

The 3-role architecture: Orchestrator, Workers, Validators
Every mission has three roles. They are not interchangeable and they should not share context prompts.

The Orchestrator

The Orchestrator doesn't write code. It doesn't implement features. Its job is three things: strategic planning, milestone definition, and creating validation contracts.

A validation contract is a pre-agreed specification of what 'done' means for each milestone written before any worker starts. This is the most important artifact the Orchestrator produces. Without it, validation becomes subjective and workers optimize for the wrong outcomes.

The Orchestrator's output is a plan and a set of contracts. If it's writing implementation, something is wrong with your architecture.

Workers

Workers implement. Each one gets a clean-slate context no accumulated baggage from previous milestones. This is deliberate. Context bleed from earlier in a task contaminates new work in ways that are hard to debug.

Three things Workers produce: feature implementation, structured git commits, and nothing else. The structured commit format matters it's how the Orchestrator tracks milestone progress and how Validators know what to test.

Workers don't know about other Workers. They don't know the full plan. They know their milestone, their validation contract, and their tools.

Validators

Validators are adversarial by design. Their job is to find reasons the Worker's output fails the validation contract not to be helpful, not to suggest improvements, not to give partial credit.

Two types of validation: Adversarial verification (does the implementation actually match the contract?) and end-to-end behavior checks (does the system work correctly in the way a real user would encounter it?).

The Validator never sees the Worker's context. It only sees the contract and the output. If a Validator needs to read the Worker's reasoning to evaluate the output, the output isn't complete enough.

Three validators, not one

A single validation pass misses things. The architecture uses three distinct validators at different layers:

Running all three isn't optional. The Validation Contract Validator catches bad specs before they waste a Worker's run. The Scrutiny Validator catches implementation drift. The User Testing Validator catches the gap between 'technically correct' and 'actually works' which is where most bugs live in production.

Execution strategy: serial by default, parallel only where safe
The instinct is to parallelize everything. More agents running at once = faster output. In practice, unconstrained parallelization creates consistency problems that are harder to debug than the time you saved.

The execution strategy that works:

Serial execution for consistency

Milestones run in sequence by default. Each milestone's output is validated before the next starts. This eliminates the class of bugs where two parallel workers make conflicting assumptions about shared state.

Internal parallelization for read-only tasks

Within a milestone, tasks that don't write state can run in parallel. Research, analysis, context gathering these are safe. Anything that modifies shared state runs serially.

Structured handoffs for context retention

When a Worker hands off to a Validator, the handoff includes: what was built, what the contract required, and what tests were run. Not the full context window a structured summary. The Validator should be able to evaluate from that summary alone.

Self-healing milestone boundaries.

When a milestone fails validation, the system doesn't restart from scratch. It retries the specific milestone with a fresh Worker context and an updated contract that incorporates the failure information. The Orchestrator owns the retry logic. Workers don't know they're retrying.

Droid whispering: matching models to roles

Not every role needs the same model. The architecture is model-agnostic by design and that's not a hedge, it's a deliberate cost and capability decision.

The Orchestrator does strategic reasoning over long context. It benefits from the most capable model available.

Workers do focused implementation over a narrow scope. A smaller, faster, cheaper model often performs as well or better here because the task is bounded and the context is clean.

Validators do adversarial checking with a specific contract. Again, a focused model with a well-structured prompt outperforms a general-purpose large model given a vague instruction.

The prompt-driven orchestration layer is ~700 lines. That's not a sign of over-engineering it's what makes the architecture model-agnostic. When better models ship, you swap the model string, not the architecture.

Resilience to model updates is a design requirement, not an afterthought. Any architecture that assumes a specific model's behavior will break when that model changes. Prompt-driven orchestration with explicit contracts means the system degrades gracefully rather than catastrophically when model behavior shifts.

What this actually produces

After running this architecture on real engineering tasks over 16-30 day missions:

90% code coverage
Asynchronous 'Mission Control' view of active runs every milestone, every validator, every retry visible
Meaningful productivity gains for engineer teams not in demo environments, in production tasks
Runs that don't need babysitting because validation contracts mean the system knows when it's done

The 90% coverage number matters because it's not a target it's a byproduct. When Validators are adversarial and contracts are explicit, coverage follows from the architecture rather than being enforced by a test-writing pass at the end.

What to take from this

The human attention bottleneck is real. You cannot supervise every step of a multi-week agentic run. The architecture has to be self-validating not just self-executing.

The three things that make this work: validation contracts written before implementation starts, adversarial validators that see only output (not process), and serial execution with targeted internal parallelization.

The thing that almost always gets skipped: the Validation Contract Validator. Teams jump straight to building and discover mid-run that their success criteria were ambiguous. That's not a model problem. It's a spec problem.

What's the biggest failure mode you've hit running multi-agent systems and was it an architecture problem or a model problem?

Top comments (5)

Gilder Miller • May 31

Yeah, this is almost never a model issue.
It’s usually the boundaries; agents assume things that were never explicitly passed forward, so small gaps turn into compounded wrong decisions.
Once you force strict input/output contracts and stop relying on implicit context, most of the weird behavior disappears.
What failure shows up first in your setup - silent bad outputs or mismatched assumptions between steps?

Venkata Manideep Patibandla • Jun 1

Exactly boundaries are usually where the system starts breaking, not the model itself.

In my setup, the first failure is usually mismatched assumptions between steps. Silent bad outputs come later, after one worker makes an assumption that was never explicitly passed forward and the validator doesn’t have a strong enough contract to catch it.

That’s why I’ve started treating input/output contracts almost like APIs between agents. If the handoff is vague, the whole run becomes fragile.

Gilder Miller • Jun 3

Exactly, treating those handoffs like APIs is key. Once each step has a strict contract, you catch mismatches before they cascade. Silent outputs only sneak in when a contract is missing or weak, so tightening that boundary is the real win.

Harjot Singh • May 31

The single-agent-fails-at-scale observation is right. One agent juggling everything (planning, doing, checking) degrades because the roles have conflicting objectives, the doer wants to ship, the checker needs to be skeptical, and one model can't hold both at once. Splitting into roles fixes it by giving each a clean objective and creating productive tension, a critic that actually pushes back instead of rubber-stamping its own work. 3 roles is a sensible cut. I lean on exactly this separate-the-objectives design in Moonshift. What are your 3 roles, planner/executor/verifier, or a different split?

Venkata Manideep Patibandla • Jun 1

Yes, that’s exactly the split I’m using, with slightly different naming:

Orchestrator -> handles planning, milestones, and validation contracts.

Workers -> execute bounded implementation tasks with clean context.

Validators -> check the output adversarially against the contract.

The biggest unlock for me has been separating “shipping energy” from “skeptical checking.” When one agent owns both, it tends to justify its own work instead of actually challenging it.