Deep Mehta

Posted on Feb 20

Mixture-of-Agents: Making LLMs Collaborate Instead of Compete

#ai #api #llm #programming

What if instead of picking the best model for your prompt, you made all models collaborate on the answer?

That's the core idea behind Mixture-of-Agents (MoA) — a technique from a 2024 research paper that showed LLMs produce better outputs when they can see and improve upon each other's responses. The paper demonstrated that even weaker models can boost the quality of stronger ones through this iterative refinement.

I implemented MoA as a production API endpoint. This post covers the architecture, the six strategies I built, the engineering decisions that weren't obvious, and the parts that surprised me.

The Problem With "Just Pick the Best Model"

Most developers approach multi-model setups with a simple question: which model is best for this task? But the answer changes depending on the prompt, the domain, the time of day, and honestly a bit of luck.

I noticed something while building a Compare mode that runs the same prompt through multiple models simultaneously. When I looked at the side-by-side outputs, the best answer was rarely from a single model. One model would nail the structure. Another would have a better code example. A third would catch an edge case the others missed.

The insight: the best response doesn't exist yet — it's a synthesis of what each model does well.

How MoA Works: The Two-Phase Architecture

Every MoA request follows the same skeleton:

Phase 1: Source Generation
  └── N models answer the prompt independently

Phase 2: Synthesis
  └── A synthesizer model combines the best parts

Phase 1 is embarrassingly parallel — all models run concurrently. Phase 2 is where the strategy matters.

async def blend(models, synthesizer, messages, strategy):
    # Phase 1: Get source responses (concurrent)
    tasks = [call_model(m, messages) for m in models]
    source_responses = await asyncio.gather(*tasks, return_exceptions=True)

    # Filter failures
    successes = [r for r in source_responses if not isinstance(r, Exception)]

    if len(successes) == 0:
        raise AllSourcesFailedError()

    # Phase 2: Synthesize based on strategy
    return await synthesize(synthesizer, messages, successes, strategy)

This looks simple, but the synthesis step is where the engineering complexity lives.

Six Strategies, Six Different Behaviors

I didn't build just one synthesis approach. Different use cases need different synthesis behaviors.

Strategy 1: Consensus (Default)

The synthesizer gets all source responses and one instruction: combine the strongest points while resolving contradictions.

CONSENSUS_PROMPT = """You are a synthesis expert. You have received multiple 
responses to the same question from different AI models. 

Your job:
1. Identify the strongest points from each response
2. Resolve any contradictions by weighing the majority view
3. Produce one definitive answer that's better than any individual response

Do not mention that multiple models were consulted.
"""

This is the workhorse strategy. For most prompts, consensus produces noticeably better answers than any single model. The synthesizer naturally picks the best explanation from one model, the best code from another, and structures it coherently.

Strategy 2: Council

Same input, but the synthesis output is structured differently:

{
  "final_answer": "The synthesized conclusion",
  "agreement_points": ["Where all models aligned"],
  "disagreement_points": ["Where they diverged + analysis"],
  "follow_up_questions": ["Areas needing exploration"]
}

Council mode is invaluable when you need transparency about model consensus. If you're using LLMs for research or decision support, knowing where models agree vs. disagree is often more useful than a single blended answer.

Strategy 3: Best-Of

The synthesizer picks the single best response and enhances it with useful additions from the others. Minimal rewriting — focused on augmentation.

This is the fastest synthesis approach and works well when one model clearly dominates but the others have minor additions worth incorporating.

Strategy 4: Chain

The synthesizer works through each response sequentially, building a comprehensive answer by incrementally incorporating each model's contribution.

Step 1: Start with Model A's response as base
Step 2: Read Model B's response, integrate new points
Step 3: Read Model C's response, integrate new points
Step 4: Final coherence pass

Chain produces the most thorough output but tends to be longer. Use it when completeness matters more than conciseness.

Strategy 5: MoA (The Real Thing)

This is where it gets interesting. The previous strategies are all single-pass synthesis. True MoA adds refinement layers where models iterate on each other's work.

Here's how it works:

Layer 0: Each model answers independently
         GPT → Response A₀
         Claude → Response B₀  
         Gemini → Response C₀

Layer 1: Each model sees Layer 0's answers as "references"
         GPT sees [B₀, C₀] → produces A₁ (improved)
         Claude sees [A₀, C₀] → produces B₁ (improved)
         Gemini sees [A₀, B₀] → produces C₁ (improved)

Layer 2: Each model sees Layer 1's answers
         GPT sees [B₁, C₁] → produces A₂
         Claude sees [A₁, C₁] → produces B₂
         Gemini sees [A₁, B₁] → produces C₂

Final: Synthesizer combines Layer 2 outputs

Each layer's responses are injected as reference material via system message:

REFERENCE_INJECTION = """Below are responses from other AI assistants 
for the same question. Use them as references to improve your answer.
Identify what's strong, correct any errors, and expand where needed.

{references}

Now provide your improved response to the original question.
"""

The Engineering Decisions That Mattered

Reference budget management. You can't just dump three 4,000-token responses into the context of every model at every layer. I set a total reference budget of 12,000 characters across all references, with a 3,200-character cap per individual answer. Anything longer gets truncated. This keeps costs sane while preserving the most useful content.

MAX_TOTAL_CHARS = 12_000
MAX_PER_ANSWER = 3_200

def prepare_references(responses):
    truncated = [r[:MAX_PER_ANSWER] for r in responses]

    total = sum(len(r) for r in truncated)
    if total > MAX_TOTAL_CHARS:
        # Proportionally reduce each
        ratio = MAX_TOTAL_CHARS / total
        truncated = [r[:int(len(r) * ratio)] for r in truncated]

    return truncated

Early stopping. If a layer produces zero successful responses (all models hit rate limits or errors), the system keeps the previous layer's successes and skips to synthesis. This prevents total failure when one bad layer would cascade.

async def run_moa_layers(models, messages, num_layers):
    prev_responses = None

    for layer in range(num_layers):
        layer_responses = await run_layer(
            models, messages, prev_responses
        )

        successes = [r for r in layer_responses if r is not None]

        if len(successes) == 0 and prev_responses:
            # Early stop: keep previous layer's results
            break

        if len(successes) > 0:
            prev_responses = successes

    return prev_responses

Layer count sweet spot. The paper tested up to 3 layers. In practice, I found that 1-2 layers give the best quality-to-cost ratio. Layer 0 to Layer 1 produces the biggest quality jump. Layer 1 to Layer 2 is marginal improvement for double the API calls. I default to layers: 1 and let users override.

Strategy 6: Self-MoA

What if you trust one model but want to hedge against its variance? Self-MoA generates multiple diverse candidates from a single model by varying the temperature and system prompt.

TEMPERATURE_OFFSETS = [-0.25, 0.0, +0.25, +0.45, +0.15, +0.35, -0.1, +0.3]

AGENT_PROMPTS = [
    "Focus on technical accuracy and precision.",
    "Prioritize practical examples and real-world applications.",
    "Emphasize clarity and make the explanation accessible.",
    "Be thorough and cover edge cases others might miss.",
    "Challenge assumptions and flag potential weaknesses.",
    "Focus on brevity and directness.",
]

For a request with temperature: 0.7 and 4 samples:

Candidate 1: temp 0.45, prompt "accuracy"     → conservative
Candidate 2: temp 0.70, prompt "practical"     → baseline
Candidate 3: temp 0.95, prompt "clarity"       → creative
Candidate 4: temp 1.15, prompt "edge cases"    → exploratory

The synthesizer then combines these four perspectives into one answer. It's surprisingly effective — you get diversity without paying for multiple model providers.

What Surprised Me

Weaker models genuinely improve stronger ones. I was skeptical, but the data backs the paper's finding. When Gemini Flash (a fast, cheap model) is included alongside GPT and Claude in MoA, the final synthesized answer is often better than a 2-model blend of just GPT + Claude. The weaker model catches things the stronger ones miss or phrases things differently enough to trigger better synthesis.

The synthesizer model matters more than the source models. If I had to pick where to spend my budget, I'd put the best model as the synthesizer and use cheaper models as sources. The synthesis step is where quality is won or lost.

Consensus beats MoA for simple prompts. Full MoA with refinement layers is overkill for straightforward questions. The extra API calls and latency aren't worth it. I use MoA for high-value outputs — technical architecture decisions, long-form content, complex code generation — where the quality improvement justifies 3-4x the cost.

Streaming MoA is an UX challenge. In Compare mode, you can stream each model's response as it arrives. In MoA, the user sees nothing until Phase 2 starts. I solved this by streaming status events during Phase 1 so the user knows progress is happening:

{"event": "source", "model": "gpt-5.2", "status": "complete", "tokens": 847}
{"event": "source", "model": "claude-sonnet-4.5", "status": "complete", "tokens": 1203}
{"event": "source", "model": "gemini-3-flash", "status": "complete", "tokens": 692}
{"event": "synthesis", "status": "starting", "strategy": "consensus"}
{"event": "chunk", "content": "The key difference between..."}

When to Use What

Here's my decision framework after running thousands of requests through each strategy:

Strategy	Best For	Cost	Latency
Consensus	General-purpose blending	4 credits	Moderate
Council	Research, decision support	4 credits	Moderate
Best-Of	When one model usually wins	4 credits	Fast
Chain	Maximum thoroughness	4 credits	Moderate
MoA (1 layer)	High-value outputs	4 credits	Higher
Self-MoA	Single model, want diversity	4 credits	Moderate

All strategies cost the same from a billing perspective because the credit cost is fixed per Blend request. The real cost difference is in the underlying API calls — MoA with 2 layers and 3 models makes 9 API calls (3 per layer × 3 layers including synthesis), while Consensus makes 4 (3 source + 1 synthesis).

Try It Yourself

If you want to experiment with these strategies, the full API is at LLMWise. A Blend request looks like:

curl -X POST https://llmwise.ai/api/v1/blend \
  -H "Authorization: Bearer mm_sk_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["gpt-5.2", "claude-sonnet-4.5", "gemini-3-flash"],
    "synthesizer": "claude-sonnet-4.5",
    "strategy": "moa",
    "layers": 1,
    "messages": [
      {"role": "user", "content": "Design a rate limiter for a distributed system"}
    ],
    "stream": true
  }'

The complete technical documentation covering all six strategies, the scoring algorithms, and the reference injection system is at llmwise.ai/llms-full.txt.

The Bigger Picture

MoA represents a shift in how we think about LLMs. Instead of asking "which model is best?", we ask "how can models collaborate?" The answer turns out to be: surprisingly well, when you give them the right architecture.

The techniques here aren't theoretical. They're running in production, handling real requests, and consistently producing better outputs than any single model alone. The cost overhead is real, but for high-value use cases, the quality improvement is worth it.

If you're running multi-model setups in production, I'd love to hear your approach. Are you blending outputs or just routing to the best model? What's working?

DEV Community