This post was originally published on Towards AI on Medium.
Your $4M agent project just failed.
Not because the LLM wasn't smart enough. Not because the prompts were wrong.
Because you built a god-agent.
One LLM handling routing, validation, tool calling, synthesis, formatting, and error recovery. Ten responsibilities. Zero supervision. Infinite loops guaranteed.
God-agents don't scale. They collapse.
I've watched three production systems die this way. Same pattern: works perfectly in demo (3 steps, happy path), breaks catastrophically in production (12 steps, edge cases, retries).
The God-Agent Failure Mode
Here's what breaks when one LLM does everything:
Scenario: Insurance claim processing
Your agent needs to classify claim type, validate policyholder, check coverage limits, calculate deductible, verify provider credentials, cross-reference diagnosis codes, check prior authorizations, determine approval/denial, generate explanation, and format output.
God-agent approach: One LLM loops through all 10 steps. Maintains full conversation history. Re-summarizes context at each step.
The math is brutal:
- 5 steps: 85% success rate
- 10 steps: 41% success rate
- 15 steps: 12% success rate
God-agents are structurally unstable beyond step 7.
Total tokens: 89,000 | Cost: $2.37 | Time: 14.3s | Result: Timeout
The Supervisor Pattern: Decompose Before Execution
Stop giving one agent ten jobs. Give ten agents one job each. Put a supervisor in charge.
Worker agents are dumb and fast. Supervisor agent is smart and decisive.
class SupervisorAgent:
"""Orchestrates workflow. Never executes tasks directly."""
def __init__(self):
self.supervisor = Llama31_8B() # Small, fast model
self.workers = {
'classifier': ClaimClassifier(), # 3B model
'validator': PolicyValidator(), # 3B model
'calculator': DeductibleCalculator(), # Deterministic
'verifier': ProviderVerifier(), # 8B model
'approver': ApprovalEngine(), # 70B model
'formatter': OutputFormatter() # 3B model
}
def process_claim(self, claim_data):
plan = self.supervisor.decompose(claim_data)
results = {}
for step in plan:
worker = self.workers[step]
task_input = self._extract_input_for_step(
step, claim_data, results
)
result = worker.execute(task_input)
results[step] = result
if result.get('status') == 'FAILED':
return self._handle_failure(step, result)
return self.supervisor.aggregate(results)
Why this doesn't loop:
- Decomposition happens once
- Workers are stateless
- Linear execution - no worker decides "what next"
- Structured handoffs - typed objects, not conversation
- Early exits - failures stop immediately
Results:
| Metric | God-Agent | Supervisor |
|---|---|---|
| Tokens | 89,000 | 5,900 |
| Cost | $2.37 | $0.18 |
| Time | 14.3s | 2.1s |
| Success | 41% | 94% |
15x cheaper. 7x faster. 2.3x more reliable.
System 2 Thinking: The Critique-and-Refine Loop
Before any high-stakes decision reaches the user, a second agent audits it.
class CriticAgent:
def __init__(self):
self.critic = Llama31_70B() # Larger model for deeper reasoning
def critique(self, worker_output, original_input, policy_rules):
# Audit checklist:
# 1. Does reasoning cite correct policy sections?
# 2. Are there logical contradictions?
# 3. Does decision match cited policy?
# 4. Are there hallucinated facts?
if not critique['approved']:
return {'status': 'FLAGGED', 'requires': 'HUMAN_REVIEW'}
if critique['confidence'] < 0.85:
return {'status': 'UNCERTAIN', 'requires': 'HUMAN_REVIEW'}
return {'status': 'APPROVED', 'decision': worker_output}
Before critic: 87% accuracy, 8% false approvals
After critic: 96% accuracy, 1.2% false approvals
ROI: Spend $380/month on critic agents, save $163,200/month on fraud prevention. 430x return.
The Handoff Protocol: Stop Re-Summarizing
Don't pass conversation history between workers. Pass typed data structures.
@dataclass
class TaskContext:
claim_id: str
claim_type: str
classification_result: Dict[str, Any]
validation_result: Dict[str, Any]
def to_worker_input(self, worker_name: str) -> str:
if worker_name == 'verifier':
return f"""Verify provider credentials.
Claim ID: {self.claim_id}
Provider ID: {self.classification_result['provider_id']}
Return: valid/invalid + reason"""
At 10,000 workflows/day:
- Conversational: 382M tokens/day = $11,460/day
- Structured: 54M tokens/day = $1,620/day
- Savings: $295,200/month
The 3-7 Rule
3 workers minimum - below this, supervisor overhead isn't worth it
7 workers maximum - above this, communication tax kills efficiency
Sweet spot: 5-7 specialized workers. Peak success rate (93-94%), acceptable latency (< 3s), reasonable cost (< $0.45).
Instead of 12 hyper-specialized workers, group related tasks:
- Classifier (claim + subtype classification)
- Validator (policy + coverage + limits)
- Calculator (deductible + coinsurance - deterministic)
- Verifier (provider + credentials + diagnosis codes)
- Approver (approval decision engine)
- Formatter (output generation)
Only use LLMs for ambiguity. The rest is code.
Implementation Checklist
- Week 1: Map your god-agent's responsibilities. If 8+ distinct jobs, split it.
- Week 2: Build supervisor with 3-5 workers. Test on 10% traffic.
- Week 3: Replace conversational context with typed data structures.
- Week 4: Deploy critic for high-stakes decisions.
- Week 5: Optimize worker count (stay under 7).
- Week 6: Validate - token-to-action ratio < 2,500, latency < 3s, success > 90%.
Production Results (8 Deployments)
- 72-86% token reduction
- 65-83% latency improvement
- 2-2.3x success rate increase
- 70-88% cost reduction
Stop building god-agents. Build supervisor patterns.
Piyoosh Rai builds AI infrastructure at The Algorithm where orchestration is deterministic, not probabilistic. 8 deployments across healthcare and financial services.
Top comments (0)