Why I make LLMs argue with each other before I make architecture decisions

#ai #productivity #architecture #llm

The problem with asking one model

You ask Claude about your API design. It gives you a confident, well-structured answer. You move on. Two weeks later, during code review, someone spots the thing the model didn't mention — the thing you would have caught if you'd thought about it from a different angle.

This happens because LLMs are agreement machines. Ask one model a question, you get one perspective wrapped in confidence. The model won't naturally play devil's advocate against its own answer. It'll give you the best answer it can produce, not the best answer the problem deserves.

I started doing something simple: same prompt, same codebase context, two different models. And I noticed that the interesting part was never where they agreed — it was where they disagreed.

Structured disagreement as a design tool

The idea isn't new. Adversarial review exists in every serious engineering culture: red teams, architecture review boards, RFC processes. What's new is that you can now run a lightweight version of this with LLMs, grounded in your actual code, in minutes instead of days.

The setup I converged on uses two roles:

Critic — opens each round by pressure-testing the thesis. It looks for fragilities, unexamined assumptions, missing invariants. Its job is to break the argument, not improve it.

Builder — responds with implementation choices, sequencing, and safeguards. Its job is to defend or adapt the approach while staying concrete.

Both see the full transcript at every turn. This matters — it forces each model to actually address the other's points instead of talking past them.

After the rounds, a third model (or one of the two) produces a synthesis. Not a "both sides have good points" summary — a structured recommendation that incorporates the strongest objections.

The hard part: grounding the debate in code

A debate between two models about abstract architecture is just two blog posts arguing. The value comes from grounding.

This is the technical problem I spent the most time on. When you point the tool at a codebase, it needs to:

Build a file tree — from a local repo, uploaded files, or a GitHub repository (public or private). This means handling auth tokens, ignoring node_modules/.git/dist, and respecting size limits.
Resolve excerpts — you can't dump an entire repo into a prompt. The tool scores and selects the most relevant excerpts from each selected file, given the debate topic and objective. The default limits are 3 excerpts per source, 18 excerpts max per pack, 2 MB per text file, 10 MB per PDF.
Inject as an evidence pack — the selected excerpts are injected into both the debate prompts and the synthesis prompt, with [SRC-x] markers. This forces the models to cite specific files when making claims, instead of hand-waving about "your codebase."

The citation system is what makes the output actually useful. When the Critic says "your pagination contracts are inconsistent across 6 endpoints [SRC-2]", you can go straight to the file and verify.

What a debate actually looks like

Say you're debating whether to migrate a REST API to GraphQL. You select the relevant route files, the API client code, and the schema definitions.

The Critic opens by pointing out that migration won't fix existing inconsistencies — it just relocates them from endpoint logic to resolver logic. It flags the risk of nested query over-fetching before rate-limiting is in place.

The Builder responds: the inconsistencies are scoped to a handful of endpoints that can be normalized as a pre-migration step. Depth limiting and query cost analysis are standard tools. The migration unlocks typed schema sharing across client teams that are currently maintaining hand-written API wrappers.

The synthesis doesn't split the difference. It says: proceed with a scoped migration, normalize the problematic endpoints first, gate production access on query cost limits, and prioritize the two client teams that benefit most.

That's more useful than either model's answer alone. Not because the models are smarter together, but because the structure forced the second-order questions to surface.

When this works (and when it doesn't)

This works well for decisions that are:

Reversible but expensive — you're not sure, and being wrong costs weeks
Multi-dimensional — there are real tradeoffs, not a single correct answer
Groundable — there's actual code or documentation to anchor the discussion

It doesn't work well when:

The answer is obvious — no need for a debate if best practices clearly apply
The context is too large — an entire monorepo won't fit, and excerpt selection can miss things
You need domain-specific expertise — the models are limited by what they know about your specific business logic

Sometimes the debate is noise. Two models going back and forth without surfacing anything you didn't already know. That's fine — it takes a few minutes and costs a few cents. The signal-to-noise ratio improves a lot when you give it well-scoped files and a clear question.

Implementation choices

A few technical decisions worth mentioning:

Provider-agnostic routing. The tool supports OpenRouter (single key for any model), direct OpenAI/Anthropic keys, or a mix. Credential resolution happens server-side with a clear priority: UI-provided key > env variable > direct provider routing. This means you can use Claude as the Critic and GPT as the Builder, or any other combination.

Strict round structure. Exactly 2 participants, strict alternating rounds, one synthesis pass. I tried freeform multi-turn and it degenerates quickly — models start being polite instead of precise. The constraint makes the output better.

Format-aware synthesis. The synthesis step has format presets: tech/architecture, decision/strategy, factual/practical, proof/validation, or auto-detect. This shapes the synthesis prompt to produce the right kind of output rather than a generic summary.

No persistence. No database, no localStorage for keys, no saved sessions. Each run is self-contained. The output is a Markdown or JSON export you take with you. This was deliberate — the tool is for decision-making, not conversation management.

The stack

Next.js App Router, TypeScript, React, Tailwind. Provider adapters for OpenRouter, OpenAI, and Anthropic. unpdf for PDF text extraction. Nothing exotic — the complexity is in the prompt engineering and the evidence pack resolution, not the framework.

What I'd like to figure out next

The biggest open question is evaluation. How do you measure whether a debate-produced synthesis is actually better than a single-model answer to the same question? I have intuitions from using it, but no systematic way to prove it.

The second question is excerpt selection. Right now it's scoring-based, but there's probably room for a retrieval step that's more context-aware.

If you've tried similar multi-model approaches — or if you think this whole idea is flawed — I'm interested in hearing why.

Repo: github.com/CommonLayer/model-debate