DEV Community: Deep Mehta

Mixture-of-Agents: Making LLMs Collaborate Instead of Compete

Deep Mehta — Fri, 20 Feb 2026 18:55:22 +0000

What if instead of picking the best model for your prompt, you made all models collaborate on the answer?

That's the core idea behind Mixture-of-Agents (MoA) — a technique from a 2024 research paper that showed LLMs produce better outputs when they can see and improve upon each other's responses. The paper demonstrated that even weaker models can boost the quality of stronger ones through this iterative refinement.

I implemented MoA as a production API endpoint. This post covers the architecture, the six strategies I built, the engineering decisions that weren't obvious, and the parts that surprised me.

The Problem With "Just Pick the Best Model"

Most developers approach multi-model setups with a simple question: which model is best for this task? But the answer changes depending on the prompt, the domain, the time of day, and honestly a bit of luck.

I noticed something while building a Compare mode that runs the same prompt through multiple models simultaneously. When I looked at the side-by-side outputs, the best answer was rarely from a single model. One model would nail the structure. Another would have a better code example. A third would catch an edge case the others missed.

The insight: the best response doesn't exist yet — it's a synthesis of what each model does well.

How MoA Works: The Two-Phase Architecture

Every MoA request follows the same skeleton:

Phase 1: Source Generation
  └── N models answer the prompt independently

Phase 2: Synthesis
  └── A synthesizer model combines the best parts

Phase 1 is embarrassingly parallel — all models run concurrently. Phase 2 is where the strategy matters.

async def blend(models, synthesizer, messages, strategy):
    # Phase 1: Get source responses (concurrent)
    tasks = [call_model(m, messages) for m in models]
    source_responses = await asyncio.gather(*tasks, return_exceptions=True)

    # Filter failures
    successes = [r for r in source_responses if not isinstance(r, Exception)]

    if len(successes) == 0:
        raise AllSourcesFailedError()

    # Phase 2: Synthesize based on strategy
    return await synthesize(synthesizer, messages, successes, strategy)

This looks simple, but the synthesis step is where the engineering complexity lives.

Six Strategies, Six Different Behaviors

I didn't build just one synthesis approach. Different use cases need different synthesis behaviors.

Strategy 1: Consensus (Default)

The synthesizer gets all source responses and one instruction: combine the strongest points while resolving contradictions.

CONSENSUS_PROMPT = """You are a synthesis expert. You have received multiple 
responses to the same question from different AI models. 

Your job:
1. Identify the strongest points from each response
2. Resolve any contradictions by weighing the majority view
3. Produce one definitive answer that's better than any individual response

Do not mention that multiple models were consulted.
"""

This is the workhorse strategy. For most prompts, consensus produces noticeably better answers than any single model. The synthesizer naturally picks the best explanation from one model, the best code from another, and structures it coherently.

Strategy 2: Council

Same input, but the synthesis output is structured differently:

{
  "final_answer": "The synthesized conclusion",
  "agreement_points": ["Where all models aligned"],
  "disagreement_points": ["Where they diverged + analysis"],
  "follow_up_questions": ["Areas needing exploration"]
}

Council mode is invaluable when you need transparency about model consensus. If you're using LLMs for research or decision support, knowing where models agree vs. disagree is often more useful than a single blended answer.

Strategy 3: Best-Of

The synthesizer picks the single best response and enhances it with useful additions from the others. Minimal rewriting — focused on augmentation.

This is the fastest synthesis approach and works well when one model clearly dominates but the others have minor additions worth incorporating.

Strategy 4: Chain

The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution.

Step 1: Start with Model A's response as base
Step 2: Read Model B's response, integrate new points
Step 3: Read Model C's response, integrate new points
Step 4: Final coherence pass

Chain produces the most thorough output but tends to be longer. Use it when completeness matters more than conciseness.

Strategy 5: MoA (The Real Thing)

This is where it gets interesting. The previous strategies are all single-pass synthesis. True MoA adds refinement layers where models iterate on each other's work.

Here's how it works:

Layer 0: Each model answers independently
         GPT → Response A₀
         Claude → Response B₀  
         Gemini → Response C₀

Layer 1: Each model sees Layer 0's answers as "references"
         GPT sees [B₀, C₀] → produces A₁ (improved)
         Claude sees [A₀, C₀] → produces B₁ (improved)
         Gemini sees [A₀, B₀] → produces C₁ (improved)

Layer 2: Each model sees Layer 1's answers
         GPT sees [B₁, C₁] → produces A₂
         Claude sees [A₁, C₁] → produces B₂
         Gemini sees [A₁, B₁] → produces C₂

Final: Synthesizer combines Layer 2 outputs

Each layer's responses are injected as reference material via system message:

REFERENCE_INJECTION = """Below are responses from other AI assistants 
for the same question. Use them as references to improve your answer.
Identify what's strong, correct any errors, and expand where needed.

{references}

Now provide your improved response to the original question.
"""

The Engineering Decisions That Mattered

Reference budget management. You can't just dump three 4,000-token responses into the context of every model at every layer. I set a total reference budget of 12,000 characters across all references, with a 3,200-character cap per individual answer. Anything longer gets truncated. This keeps costs sane while preserving the most useful content.

MAX_TOTAL_CHARS = 12_000
MAX_PER_ANSWER = 3_200

def prepare_references(responses):
    truncated = [r[:MAX_PER_ANSWER] for r in responses]

    total = sum(len(r) for r in truncated)
    if total > MAX_TOTAL_CHARS:
        # Proportionally reduce each
        ratio = MAX_TOTAL_CHARS / total
        truncated = [r[:int(len(r) * ratio)] for r in truncated]

    return truncated

Early stopping. If a layer produces zero successful responses (all models hit rate limits or errors), the system keeps the previous layer's successes and skips to synthesis. This prevents total failure when one bad layer would cascade.

async def run_moa_layers(models, messages, num_layers):
    prev_responses = None

    for layer in range(num_layers):
        layer_responses = await run_layer(
            models, messages, prev_responses
        )

        successes = [r for r in layer_responses if r is not None]

        if len(successes) == 0 and prev_responses:
            # Early stop: keep previous layer's results
            break

        if len(successes) > 0:
            prev_responses = successes

    return prev_responses

Layer count sweet spot. The paper tested up to 3 layers. In practice, I found that 1-2 layers give the best quality-to-cost ratio. Layer 0 to Layer 1 produces the biggest quality jump. Layer 1 to Layer 2 is marginal improvement for double the API calls. I default to layers: 1 and let users override.

Strategy 6: Self-MoA

What if you trust one model but want to hedge against its variance? Self-MoA generates multiple diverse candidates from a single model by varying the temperature and system prompt.

TEMPERATURE_OFFSETS = [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3]

AGENT_PROMPTS = [
    "Focus on technical accuracy and precision.",
    "Prioritize practical examples and real-world applications.",
    "Emphasize clarity and make the explanation accessible.",
    "Be thorough and cover edge cases others might miss.",
    "Challenge assumptions and flag potential weaknesses.",
    "Focus on brevity and directness.",
]

For a request with temperature: 0.7 and 4 samples:

Candidate 1: temp 0.45, prompt "accuracy"     → conservative
Candidate 2: temp 0.70, prompt "practical"     → baseline
Candidate 3: temp 0.95, prompt "clarity"       → creative
Candidate 4: temp 1.15, prompt "edge cases"    → exploratory

The synthesizer then combines these four perspectives into one answer. It's surprisingly effective — you get diversity without paying for multiple model providers.

What Surprised Me

Weaker models genuinely improve stronger ones. I was skeptical, but the data backs the paper's finding. When Gemini Flash (a fast, cheap model) is included alongside GPT and Claude in MoA, the final synthesized answer is often better than a 2-model blend of just GPT + Claude. The weaker model catches things the stronger ones miss or phrases things differently enough to trigger better synthesis.

The synthesizer model matters more than the source models. If I had to pick where to spend my budget, I'd put the best model as the synthesizer and use cheaper models as sources. The synthesis step is where quality is won or lost.

Consensus beats MoA for simple prompts. Full MoA with refinement layers is overkill for straightforward questions. The extra API calls and latency aren't worth it. I use MoA for high-value outputs — technical architecture decisions, long-form content, complex code generation — where the quality improvement justifies 3-4x the cost.

Streaming MoA is an UX challenge. In Compare mode, you can stream each model's response as it arrives. In MoA, the user sees nothing until Phase 2 starts. I solved this by streaming status events during Phase 1 so the user knows progress is happening:

{"event": "source", "model": "gpt-5.2", "status": "complete", "tokens": 847}
{"event": "source", "model": "claude-sonnet-4.5", "status": "complete", "tokens": 1203}
{"event": "source", "model": "gemini-3-flash", "status": "complete", "tokens": 692}
{"event": "synthesis", "status": "starting", "strategy": "consensus"}
{"event": "chunk", "content": "The key difference between..."}

When to Use What

Here's my decision framework after running thousands of requests through each strategy:

Strategy	Best For	Cost	Latency
Consensus	General-purpose blending	4 credits	Moderate
Council	Research, decision support	4 credits	Moderate
Best-Of	When one model usually wins	4 credits	Fast
Chain	Maximum thoroughness	4 credits	Moderate
MoA (1 layer)	High-value outputs	4 credits	Higher
Self-MoA	Single model, want diversity	4 credits	Moderate

All strategies cost the same from a billing perspective because the credit cost is fixed per Blend request. The real cost difference is in the underlying API calls — MoA with 2 layers and 3 models makes 9 API calls (3 per layer × 3 layers including synthesis), while Consensus makes 4 (3 source + 1 synthesis).

Try It Yourself

If you want to experiment with these strategies, the full API is at LLMWise. A Blend request looks like:

curl -X POST https://llmwise.ai/api/v1/blend \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
    "synthesizer": "claude-sonnet-4.5",
    "strategy": "moa",
    "layers": 1,
    "messages": [
      {"role": "user", "content": "Design a rate limiter for a distributed system"}
    ],
    "stream": true
  }'

The complete technical documentation covering all six strategies, the scoring algorithms, and the reference injection system is at llmwise.ai/llms-full.txt.

The Bigger Picture

MoA represents a shift in how we think about LLMs. Instead of asking "which model is best?", we ask "how can models collaborate?" The answer turns out to be: surprisingly well, when you give them the right architecture.

The techniques here aren't theoretical. They're running in production, handling real requests, and consistently producing better outputs than any single model alone. The cost overhead is real, but for high-value use cases, the quality improvement is worth it.

If you're running multi-model setups in production, I'd love to hear your approach. Are you blending outputs or just routing to the best model? What's working?

Circuit Breakers for LLM APIs: Applying SRE Patterns to AI Infrastructure

Deep Mehta — Sat, 14 Feb 2026 20:53:29 +0000

You've shipped your AI feature. Users love it. Then OpenAI returns a 429 Too Many Requests at 2pm on a Tuesday and your entire product goes down.

If you've built anything on LLM APIs, you've felt this pain. The irony is that we solved this exact problem in distributed systems fifteen years ago. Circuit breakers, health checks, failover chains — these patterns are standard in every microservice architecture. But most LLM integrations today are just raw API calls with maybe a try/except block and a prayer.

I'm an SRE who spent the last decade building reliability into production systems at scale. When I started building AI products, I was shocked at how fragile the infrastructure layer was. So I applied what I knew.

This post covers the patterns that actually work.

The Problem: LLM APIs Are Unreliable by Design

LLM providers aren't like your typical REST API. They have unique failure modes:

Rate limits are aggressive and unpredictable. OpenAI's 429s can hit mid-conversation with no warning.
Latency variance is extreme. The same prompt might take 800ms or 12 seconds depending on load.
Outages are frequent. Every major provider has had multi-hour outages in the past year.
Degraded performance is silent. A model might respond but with noticeably worse quality during high load.

Most developers handle this with retry logic:

# The "prayer-based reliability" approach
for attempt in range(3):
    try:
        response = openai.chat.completions.create(
            model="gpt-4o",
            messages=messages
        )
        return response
    except Exception:
        time.sleep(2 ** attempt)
raise Exception("All retries failed")

This is better than nothing, but it has serious problems. You're waiting through all retries before failing (potentially 15+ seconds of user-facing latency). You're hammering a provider that's already struggling. And you're not learning anything from the failures.

Pattern 1: The Circuit Breaker

The circuit breaker pattern comes from electrical engineering — when current exceeds safe levels, the breaker trips and stops the flow. In software, it means: if a service is failing, stop sending it traffic immediately instead of waiting for each request to time out.

Here's the state machine:

CLOSED (healthy)
  │
  ├── Request succeeds → stay CLOSED, reset failure count
  │
  └── Request fails → increment failure count
        │
        └── Failures >= threshold? → trip to OPEN
                                       │
OPEN (broken)                          │
  │                                    │
  ├── All requests → instant fail      │
  │   (no API call made)               │
  └── Cool-down timer expires?         │
        │                              │
        └── Move to HALF-OPEN ─────────┘
                │
HALF-OPEN (testing)
  │
  ├── Send ONE probe request
  │     │
  │     ├── Succeeds → back to CLOSED
  │     └── Fails → back to OPEN (reset timer)
  │
  └── All other requests → instant fail

In practice, this means:

class CircuitBreaker:
    def __init__(self, failure_threshold=3, cool_down_seconds=30):
        self.state = "CLOSED"
        self.failure_count = 0
        self.threshold = failure_threshold
        self.cool_down = cool_down_seconds
        self.last_failure_time = None

    def can_execute(self):
        if self.state == "CLOSED":
            return True

        if self.state == "OPEN":
            # Check if cool-down period has passed
            elapsed = time.time() - self.last_failure_time
            if elapsed >= self.cool_down:
                self.state = "HALF_OPEN"
                return True  # Allow one probe request
            return False  # Fail fast

        if self.state == "HALF_OPEN":
            return False  # Only one probe at a time

        return False

    def record_success(self):
        self.failure_count = 0
        self.state = "CLOSED"

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.threshold:
            self.state = "OPEN"

The key insight: when a circuit is open, you fail instantly — zero latency wasted. Your user gets a fast fallback instead of watching a spinner for 30 seconds while retries pile up against a dead endpoint.

Pattern 2: Failover Chains

A circuit breaker alone just tells you something is broken. You still need somewhere to send the traffic. That's where failover chains come in.

Instead of one model, you define an ordered list:

failover_chain = [
    {"model": "gpt-5.2", "provider": "openai"},
    {"model": "claude-sonnet-4.5", "provider": "anthropic"},
    {"model": "gemini-3-flash", "provider": "google"},
]

The routing logic walks the chain:

async def route_with_failover(messages, chain):
    route_trace = []

    for i, target in enumerate(chain):
        model = target["model"]
        breaker = get_circuit_breaker(model)

        # Skip if circuit is open
        if not breaker.can_execute():
            route_trace.append({
                "model": model,
                "action": "skipped",
                "reason": "circuit_open"
            })
            continue

        try:
            start = time.monotonic()
            response = await call_model(model, messages)
            latency = time.monotonic() - start

            breaker.record_success()
            update_latency_tracker(model, latency)

            route_trace.append({
                "model": model,
                "action": "success",
                "latency_ms": int(latency * 1000)
            })

            return response, route_trace

        except RateLimitError:
            breaker.record_failure()
            route_trace.append({
                "model": model,
                "action": "failed",
                "reason": "rate_limit_429"
            })
            continue

        except Exception as e:
            breaker.record_failure()
            route_trace.append({
                "model": model,
                "action": "failed",
                "reason": str(e)
            })
            continue

    raise AllProvidersFailedError(route_trace)

The route trace is important — it gives you observability into what happened. A typical trace might look like:

{
  "trace": [
    {"model": "gpt-5.2", "action": "failed", "reason": "rate_limit_429"},
    {"model": "claude-sonnet-4.5", "action": "success", "latency_ms": 1847}
  ],
  "saved_time": "~12.4s vs waiting for rate limit reset"
}

Your user got a response in under 2 seconds instead of waiting for OpenAI's rate limit to reset (typically 20-60 seconds).

Pattern 3: Latency Tracking With Exponential Smoothing

Static failover chains aren't enough. Model performance changes throughout the day. You need to track which models are fast right now and route accordingly.

Simple averaging doesn't work because it weighs a latency spike from yesterday the same as current performance. Instead, use exponential smoothing:

def update_latency(model, new_latency_ms):
    alpha = 0.2  # Weight for new measurement
    current_avg = get_avg_latency(model)

    # Recent measurements matter more
    smoothed = (current_avg * (1 - alpha)) + (new_latency_ms * alpha)

    set_avg_latency(model, smoothed)
    return smoothed

With alpha = 0.2, a model that recovers from a slow period will show improved latency within 5-6 requests. A model that suddenly degrades will show it within 2-3 requests. This is much more responsive than a simple moving average.

You can use this to make routing decisions:

def select_fastest_healthy_model(chain):
    candidates = []
    for target in chain:
        model = target["model"]
        breaker = get_circuit_breaker(model)
        if breaker.can_execute():
            candidates.append({
                "model": model,
                "avg_latency": get_avg_latency(model)
            })

    # Sort by latency, pick fastest
    candidates.sort(key=lambda x: x["avg_latency"])
    return candidates[0]["model"] if candidates else None

Pattern 4: Multi-Strategy Routing

In production, you don't always want the fastest model. Sometimes you want the cheapest. Sometimes you need the most reliable. The routing strategy should be configurable per request or per use case.

Define weight vectors for different goals:

ROUTING_STRATEGIES = {
    "balanced":    {"success_rate": 0.4, "latency": 0.3, "cost": 0.3},
    "speed":       {"success_rate": 0.2, "latency": 0.6, "cost": 0.2},
    "cost":        {"success_rate": 0.2, "latency": 0.2, "cost": 0.6},
    "reliability": {"success_rate": 0.7, "latency": 0.2, "cost": 0.1},
}

def score_model(model, strategy="balanced"):
    weights = ROUTING_STRATEGIES[strategy]
    stats = get_model_stats(model)

    # Normalize metrics (higher = better)
    inv_latency = 1.0 - (stats["avg_latency"] / max_latency)
    inv_cost = 1.0 - (stats["avg_cost"] / max_cost)

    score = (
        weights["success_rate"] * stats["success_rate"] +
        weights["latency"] * inv_latency +
        weights["cost"] * inv_cost
    )

    return score

This lets you set "optimization_goal": "cost" for a background batch job and "optimization_goal": "speed" for a real-time chat interface — using the same infrastructure.

Pattern 5: Replay Testing

Here's the pattern most people skip, and it's arguably the most valuable: testing routing policy changes against historical traffic before deploying them.

The concept is simple. You log every request (model chosen, latency, cost, success/failure). When you want to change your routing policy — say, switching from "balanced" to "speed" or adding a new model to your chain — you replay recent traffic through the new policy and compare:

def replay_test(historical_requests, old_policy, new_policy):
    old_results = simulate(historical_requests, old_policy)
    new_results = simulate(historical_requests, new_policy)

    return {
        "cost_delta": new_results["total_cost"] - old_results["total_cost"],
        "latency_delta": new_results["avg_latency"] - old_results["avg_latency"],
        "success_rate_delta": (
            new_results["success_rate"] - old_results["success_rate"]
        ),
    }

This prevents the "I changed the routing and now everything is slower" surprise. You get confidence in the change before it hits production.

Putting It All Together

Here's what a production-ready LLM routing layer looks like when you combine these patterns:

Request comes in
    │
    ├── Check optimization goal (speed/cost/balanced/reliability)
    │
    ├── Score available models using strategy weights
    │   (filtered by circuit breaker state)
    │
    ├── Route to top-scored model
    │     │
    │     ├── Success → update latency tracker, return response
    │     │
    │     └── Failure → record failure, update circuit breaker
    │           │
    │           └── Try next model in failover chain
    │                 │
    │                 ├── Success → return with route trace
    │                 └── All failed → return error with full trace
    │
    └── Log everything for replay testing

Each request produces a route trace showing exactly what happened. Your monitoring dashboard can show which models are healthy, which circuits are open, and where your latency is going.

The Practical Takeaway

If you're building on LLM APIs and you're not using these patterns, you're one provider outage away from a bad day. The good news is that none of this is complicated — these are well-understood patterns that have been battle-tested in distributed systems for over a decade.

Start with the circuit breaker. It's the highest-impact, lowest-effort change. Then add a two-model failover chain. Then add latency tracking. Each layer compounds the reliability.

I built LLMWise specifically to give developers this entire reliability stack out of the box — circuit breakers, failover chains, latency-optimized routing, and replay testing — without building it from scratch. But whether you use a managed solution or build your own, these patterns should be part of any production LLM integration.

The full technical documentation covering all the algorithms in detail is at llmwise.ai/llms-full.txt if you want to dig deeper.

What reliability patterns are you using in your LLM integrations? I'd love to hear what's working (or breaking) in your setup.