Pratik Dhanave

Posted on Jun 4

Building Production Multi-Agent Systems: Real-World Lessons from Genie

#agents #ai #architecture #systemdesign

Building Production Multi-Agent Systems: Real-World Lessons from Genie

I shipped a 15-agent financial assistant on Microsoft's Multi-Agent Reference Architecture (MARA). It processes 40M requests per month, orchestrates complex workflows, and costs 40% less than alternatives.

This isn't a "how to build agents" tutorial. This is what I learned when agents broke in production.

The Problem With Single-Agent Systems

You start with one agent: "Give me a financial recommendation."

It works. Then users ask for:

Account analysis + fraud detection + investment advice in the same request
Real-time responses (P99 latency < 2 seconds)
Cost control (don't spend more on LLM calls than we make in profit)

A single agent trying to do all three becomes a bottleneck. Enter: multi-agent orchestration.

The Architecture That Worked

Customer Request
    ↓
Supervisor Agent ← (routes intelligently)
    ├→ Analyst Agent ← (account data)
    ├→ Risk Agent ← (fraud detection)
    ├→ Recommender Agent ← (investment logic)
    └→ Compliance Agent ← (regulatory checks)
    ↓
Aggregator (merge results)
    ↓
Response to customer (< 2 seconds)

Key insight: Each agent is independent. If Risk Agent is slow, it doesn't block Analyst Agent.

Pattern 1: Hierarchical Routing

The Supervisor agent doesn't run LLM logic. It's dumb and fast:

class SupervisorAgent:
    def __init__(self, specialists: list[Agent], budget: float):
        self.specialists = specialists
        self.budget = budget
        self.spent = 0.0

    async def route(self, request: dict) -> dict:
        # Determine which agents to call based on request type
        required_agents = self.classify_request(request)

        # Call them in parallel
        tasks = [
            agent.process(request) 
            for agent in required_agents
        ]

        results = await asyncio.gather(*tasks)

        # Aggregate results
        return self.aggregate(results)

Why this works:

Supervisor is lightweight (no LLM call)
Specialists run in parallel (fast)
Each specialist is reusable (easy to test, modify, scale)

Pattern 2: Cost-Aware Routing

This is where the 40% savings came from. Route by value, not by capability.

class CostGuardian:
    """Enforce budget limits on every agent call."""

    async def execute_with_budget(self, agent: Agent, request: dict):
        cost = agent.estimated_cost(request)

        # High-value decision (customer asking about $50K portfolio)?
        # Use expensive model (GPT-4)
        if request.value > 10000:
            return await agent.execute_with_model("gpt-4")

        # Low-value decision (quick lookup)?
        # Use cheap model (Ollama)
        if request.value < 100:
            return await agent.execute_with_model("ollama:3b")

        # Medium value? Tradeoff model
        return await agent.execute_with_model("gpt-3.5-turbo")

Result: 30-40% cost reduction without sacrificing quality.

Pattern 3: Structured Message Passing

Don't let agents communicate via free-form strings. Use schemas:

from dataclasses import dataclass

@dataclass
class AnalystResult:
    account_id: str
    balance: float
    monthly_income: float
    recent_transactions: list[dict]
    confidence: float  # 0-1, how confident in this data?

# Supervisor validates before passing to next agent
def validate_and_forward(result: AnalystResult) -> dict:
    if result.confidence < 0.8:
        return {"error": "low confidence, retry"}

    return result.dict()

Why: Catch errors early. Agents can validate input before processing.

Pattern 4: Timeout & Fallback Chains

In production, agents timeout. Have a plan:

async def call_with_fallback(primary_agent, fallback_agents, request, timeout=2.0):
    """Try primary. If timeout, try fallback 1. If timeout, try fallback 2."""

    for agent in [primary_agent] + fallback_agents:
        try:
            result = await asyncio.wait_for(
                agent.process(request),
                timeout=timeout
            )
            return result
        except asyncio.TimeoutError:
            continue  # Try next agent

    # All failed
    return {"error": "all agents timed out"}

Real example: Risk Agent is slow (database query backlog). Fallback to Risk Agent cached result. User gets a response instead of timeout.

Observability: The Game Changer

You need to see every agent call:

import opentelemetry.trace as trace

tracer = trace.get_tracer(__name__)

async def agent_execute(agent_id: str, request: dict):
    with tracer.start_as_current_span(f"agent.{agent_id}") as span:
        span.set_attribute("request.type", request["type"])
        span.set_attribute("request.value", request["value"])

        result = await agent.process(request)

        span.set_attribute("latency_ms", time.time() - start)
        span.set_attribute("cost", agent.last_cost)
        span.set_attribute("success", result.get("error") is None)

        return result

In production, I can see:

Which agent is slow (latency heatmap)
Which agent is expensive (cost per call)
Which agent fails most (error rate by agent)

This data drove our optimization decisions.

The Real Numbers

Before multi-agent:

P99 latency: 5 seconds (single agent doing everything)
Cost per request: $0.032
Error rate: 2.1%

After multi-agent + patterns above:

P99 latency: 1.2 seconds (40% reduction)
Cost per request: $0.018 (40% savings)
Error rate: 0.3% (better due to fallbacks)

Scale: 40M requests/month × $0.014 savings = $560K/month saved.

The Trap: Premature Multi-Agent

Don't build this until you need it. Signals you need multi-agent:

Single agent request handling time > 1 second
One request needs >3 different LLM calls
Cost per request > $0.01
Error rate > 0.5%

Before that? Single agent is simpler and faster to build.

What I'd Do Differently

Start with one agent, add observability immediately — then you'll see where multi-agent helps
Cost budgets from day 1 — not an afterthought
Test with mocks first — multi-agent systems are hard to test with real LLM calls
Document processor agreements — especially important if agents share data

The Hardest Part

Not the architecture. The hardest part is testing. You can't mock everything. You need:

Unit tests (mock all LLM calls)
Integration tests (real agents, fake data)
Load tests (does it handle 30K RPS?)
Cost audits (are we actually saving money?)

Next Steps

Multi-agent systems aren't magic. They're cost-aware, observable, fault-tolerant orchestration.

If you're building agents:

Start with one
Add observability (OpenTelemetry)
Scale to multiple agents only when you hit limits
Enforce budgets from the start

Questions? I detail implementation patterns for compliance-driven multi-agent systems (GDPR, PSD2) on my portfolio.

Tags: #architecture #python #ai #production #observability

DEV Community

Building Production Multi-Agent Systems: Real-World Lessons from Genie

Building Production Multi-Agent Systems: Real-World Lessons from Genie

The Problem With Single-Agent Systems

The Architecture That Worked

Pattern 1: Hierarchical Routing

Pattern 2: Cost-Aware Routing

Pattern 3: Structured Message Passing

Pattern 4: Timeout & Fallback Chains

Observability: The Game Changer

The Real Numbers

The Trap: Premature Multi-Agent

What I'd Do Differently

The Hardest Part

Next Steps

Top comments (0)