Mike

Posted on Feb 18

Fowler's GenAI Patterns Are Missing the Orchestration Layer — Here's What I Built

#ai #llm #mcp #architecture

Last year, Martin Fowler's team published one of the best pattern catalogs for GenAI systems I've read. Nine patterns. Real production experience. Honest about what works and what doesn't. If you're building anything with LLMs and haven't read it yet — stop here and go read it. I'll wait.

But after applying these patterns in my own work, I kept running into a problem they don't address. There's a pattern-shaped hole right in the middle of the catalog.

Their patterns describe how to get better answers from one model. But what happens when one model isn't enough?

What Fowler got right

First, credit where it's due. The article maps a clear pipeline from proof-of-concept to production:

Direct Prompting gets you started. Embeddings and RAG ground the model in your actual data. Hybrid Retrieval and Query Rewriting improve what you retrieve. Rerankers filter out noise before the model sees it. Guardrails enforce safety. Evals measure quality. Fine-Tuning is the last resort when nothing else works.

The "Realistic RAG" pipeline they describe — input guardrails, query rewriting, parallel hybrid retrieval, reranking, generation, output guardrails — is genuinely useful. It's the kind of diagram you can hand to a team and say "build this."

I especially like their framing of an LLM as a "junior researcher — articulate, well-read in general, but not well-informed on the details of the topic." That's honest. Most LLM marketing pretends the junior researcher is a senior partner.

The authors also note they intend to revise and expand. Good. Because there's a pattern family they haven't written about yet.

The missing layer

Look at the pipeline again:

Input → Guardrails → RAG → Rerank → [One LLM] → Guardrails → Output

To be fair, this pipeline already uses multiple models — a reranker here, an LLM-based guardrail there. But they all serve a single generator. One model produces the answer. Every other model in the pipeline exists to feed it better input or catch its worst output.

What's missing are patterns for coordinating multiple generators — and treating their disagreement as a signal. In higher-stakes settings, teams are increasingly adding this layer.

Guardrails are optimized for safety, not correctness. Fowler's guardrails — LLM-based, embedding-based, rule-based — are designed to prevent harmful or off-topic output. They don't reliably catch "this architectural recommendation is subtly wrong because the model conflated Redis Streams with Redis Pub/Sub." For that, you need a second opinion.

The gap is coordination across models. Specifically:

Verification: when two models agree, confidence goes up. When they disagree, that disagreement is information.
Adversarial testing: one model generates, another attacks the weaknesses. Catches blind spots no single-model guardrail can.
Structured consensus: not just "ask twice and compare" — quantified voting with confidence scores across multiple models.

This isn't a future concern. Teams are already building multi-model systems. The pattern just hasn't been named.

Here's what the pipeline looks like when you add the missing layer:

Input → Guardrails → RAG → Rerank → [LLM A] ─┐
                                    [LLM B] ─┤→ Orchestrate → Output
                                    [LLM C] ─┘
                                       ↕
                                  Consensus / Debate / Judge

I'd like to name five patterns that belong in that orchestration layer. I've been building and using all of them. I'll number them 10–14 — not to be presumptuous, but because I genuinely think they extend Fowler's catalog.

Pattern 10: Parallel Comparison

Ask the same question to multiple models. Compare the outputs side by side.

When to use it: you need confidence that an answer isn't one model's hallucination.

I asked three models: "Does DynamoDB support transactions across multiple tables?" One confidently said no — "DynamoDB is a key-value store, transactions are single-table only." Another correctly explained that TransactWriteItems works across multiple tables with up to 100 items. The third hedged. Two out of three agreeing on cross-table support gave me confidence — and saved me from trusting a confidently wrong answer.

A caveat upfront: models share training data and can converge on the same mistake. Multi-model agreement isn't proof — it's a signal, strongest when models are diverse and you combine their output with deterministic checks (tests, schemas, retrieved sources). But even with that limitation, this is the simplest orchestration pattern and the one with the highest ROI. If you do nothing else, do this.

Pattern 11: Consensus Voting

Models vote on options with reasoning and confidence scores. The result includes a consensus level.

When to use it: multi-option decisions where you want a quantified signal, not vibes.

I asked four models to vote on a caching strategy for a read-heavy API: Redis TTL, CDN edge cache, application-level memoization, or PostgreSQL materialized views. Redis won 3–1. But Gemini dissented — argued materialized views handle the read pattern better at our scale, and cut an entire infrastructure dependency.

The consensus came back as "majority, not unanimous." That dissent made me benchmark both. Gemini was right.

The value isn't the winner. It's the structured disagreement.

Pattern 12: Adversarial Debate

Models argue opposing positions in structured rounds. A synthesizer draws conclusions.

When to use it: high-stakes decisions where you need failure modes surfaced.

Oxford-format debate: "Should we migrate from REST to GraphQL mid-project?" Three rounds. The pro side made a compelling case for query flexibility and reduced over-fetching. But in round 2, the con side raised a point none of us had considered: our existing monitoring dashboards and CDN caching rules all assume REST path-based routing. Migrating the API means migrating the entire observability stack.

The synthesis called it "a 6-month migration disguised as a weekend refactor." We stayed on REST.

Single-model advice would have said "it depends." The debate told us exactly what it depends on.

Pattern 13: Iterative Refinement

Two models take turns improving an output — one generates, the other critiques, repeat.

When to use it: code generation, technical writing, anything where quality compounds with iteration.

One model wrote a sliding window rate limiter. The other critiqued it: "This leaks memory — you never clean up expired entries." Round 2: fixed, but the critic found an edge case with concurrent requests mutating the window simultaneously. Round 3: both converged on the same thread-safe implementation.

Three rounds, and I got code I'd trust in production. A single model would have given me the leaky version and called it done.

Pattern 14: Model-as-Judge

One model evaluates and ranks other models' outputs against explicit criteria.

When to use it: structured quality assessment beyond "which answer looks longer."

Three models implemented a circuit breaker pattern. I had a fourth judge them on correctness, error handling, and readability — playing the role of a senior backend engineer. The winner wasn't the longest answer. It was the only one that handled the half-open state correctly — the subtle part where the circuit tentatively allows a single request through to test if the downstream service has recovered.

The judge's per-criterion breakdown told me exactly why. Scores, reasoning, ranked. Not "they're all pretty good."

This is Fowler's Evals pattern applied to multi-model output — systematic quality assessment, but across competing implementations instead of just one.

None of these patterns are novel academically — self-consistency, ensemble methods, LLM-as-judge, and iterative refinement all exist in research. The contribution isn't the idea; it's packaging them as reusable, named production patterns that teams can discuss and adopt. And to be clear: for simple Q&A or low-stakes tasks, a single model with good RAG is still the right call. These patterns earn their cost when the stakes justify a second opinion.

Pattern	Goal	Cost	Failure mode
Parallel Comparison	sanity check	2–4x calls	correlated hallucination
Consensus Voting	discrete decision	Nx calls	bad options / rubric
Adversarial Debate	surface risks	many round-trips	performative rhetoric
Iterative Refinement	quality convergence	2x per round	infinite loop / local optimum
Model-as-Judge	structured ranking	+1 judge call	judge bias (verbosity, position)

Why MCP makes this possible

All five patterns above are tools in MCP Rubber Duck — an open-source MCP server I built that implements multi-model orchestration.

If you haven't heard of it: Model Context Protocol is an open standard for connecting AI tools to external services. Every article calls it "USB-C for AI" and I'm not going to be the one to break the streak — one protocol, any tool, any host.

MCP is what makes orchestration composable rather than bespoke. These patterns show up as native tools inside Claude Desktop, Cursor, VS Code — wherever MCP is supported. You don't build custom integrations per model. You build one server, and every MCP-capable host gets access to multi-model consensus, debate, voting, iteration, and evaluation.

Guardrails (Fowler's guardrails pattern) run across the whole system — rate limiting, token budgets, pattern blocking, and PII redaction apply to every model, not just one. And through the MCP Bridge, ducks can call external tools — documentation servers, databases, APIs — with approval-gated security.

The protocol is the leverage. Without it, multi-model orchestration is a pile of bespoke API calls. With it, it's a composable layer any tool can use.

What's still missing

Five patterns isn't the end. There are more emerging that nobody has fully named yet:

Reasoning-time branching — the Tree-of-Thoughts paper (NeurIPS 2023) showed that exploring multiple reasoning paths beats linear chain-of-thought. But doing this across different models in parallel — where each branch is explored by a different LLM — is still research-grade, not production-ready.
Calibrated uncertainty — LLMs are notoriously overconfident. Knowing when a model doesn't know and escalating to a stronger model or human review is the holy grail of multi-model orchestration. Today's best proxy is sampling the same question multiple times and measuring divergence. A real confidence signal would change everything.

Fowler's 9 patterns gave us a shared language for single-model GenAI systems. The best pattern catalogs don't close conversations — they open them.

These next patterns are being written in production code right now. It's time to name them.

MCP Rubber Duck implements all five orchestration patterns as MCP tools. Open source, works with any OpenAI-compatible API plus CLI agents like Claude Code and Codex.

DEV Community