I Made 4 LLMs Argue With Each Other to Write Better Runbooks. Here's What Happened.

#ai #devops #llm #sre

A single LLM writing a production runbook is like asking one engineer to design, review, and approve their own code. It works. Sometimes. But the failure mode is silent: confident-sounding instructions that miss edge cases, skip the rollback step, or hallucinate a flag that doesn't exist.

I spent the last few months building an alternative for RunDoc, the SaaS I run for generating runbooks and SOPs. I call it the AI Council: four LLMs generate the runbook independently, cross-review each other's output, and a fifth model (the Chairman) synthesizes the final version.

This post is about why cross-review turned out to be the part that actually matters — more than the model choice, more than the prompting, more than the synthesis step. If you're building anything with multiple LLMs, the lesson here probably applies to you too.

The naive version: just ask 4 models and pick the best one

My first attempt was embarrassingly simple. Send the same prompt to GPT-4o, Gemini 2.5 Flash, Claude Sonnet, and Grok 3 Mini. Get four runbooks back. Have a Chairman model pick the best parts and stitch them together.

The output was… fine. Slightly better than any single model, but not dramatically. The Chairman tended to favor whichever response was longest or most confidently worded, which is exactly the wrong heuristic for technical documentation.

The deeper problem: the four models were making the same kinds of mistakes. They all skipped rollback steps when the prompt didn't explicitly ask for one. They all invented plausible-but-wrong CLI flags. They all preferred narrative prose over checklist format unless told otherwise. Aggregating four similar failures doesn't fix them.

What changed everything: making them critique each other

The second iteration added a cross-review step. After each model generates its draft, every other model reviews it with a structured prompt that boils down to:

"Here's a runbook written by another AI. Find the errors. Specifically: missing prerequisites, hallucinated commands or flags, missing rollback steps, unsafe ordering, missing verification steps. Be specific. Cite line numbers."

This is where it got interesting.

Models are much better at finding errors than avoiding them. When Claude generated a runbook, GPT-4o caught hallucinated kubectl flags that Claude had written confidently. When GPT-4o wrote one, Gemini noticed when a rollback step assumed a backup that hadn't been verified to exist. They didn't catch the same errors in their own output, but they caught each other's reliably.

I think this is because critique mode forces a different kind of attention. When you're generating, you're optimizing for fluency and completeness. When you're reviewing, you're optimizing for finding what's wrong. These pull in opposite directions, and a single model trying to do both at once tends to be optimistic about its own output.

The Chairman's actual job

Once you have four drafts plus twelve cross-reviews (each model reviewed by the other three), the Chairman's role becomes clearer. It's not "pick the best draft" — it's "synthesize a draft that survives all the critiques."

The Chairman prompt now looks roughly like:

You have 4 candidate runbooks and 12 peer reviews.

For each step in the final runbook, you must:
1. Include only steps that appear in at least 2 candidates OR
   are uniquely justified by a critique
2. Apply every correction that has not been refuted by another reviewer
3. Default to the most conservative version when candidates disagree
   (e.g., add the verification step, include the rollback)
4. Flag any disagreement that you couldn't resolve

That last point matters. The Chairman doesn't pretend to know everything — it surfaces disagreements as warnings in the final runbook. "Two models suggested --force, two recommended against it. Verify your cluster's policy before using."

What the output looks like in practice

I haven't run formal benchmarks yet — honest disclosure. What I see consistently when comparing AI Council output to single-model output is that the runbooks are denser. More verification steps. More explicit prerequisites. More rollback detail. Fewer assumptions about the reader's environment.

Whether "denser" equals "better" depends on what you're using runbooks for. For a senior SRE running a familiar procedure, the extra detail is noise. For a junior on-call engineer at 3am, it's the difference between fixing the incident and making it worse. The AI Council output is calibrated for the second case, which is the case that actually matters.

What didn't work

A few things I tried that I'd skip if I started over:

More models is not better. I tested with 6 and 8 models. The marginal gain past 4 was tiny and the latency and cost got rough. Four seems to be the sweet spot, probably because that's enough diversity to surface different failure modes without diminishing returns.

Self-review is mostly useless. Asking a model to review its own draft produces polite, surface-level corrections. The critique only works when it comes from a different model with different priors.

Forcing consensus is worse than surfacing disagreement. Early versions of the Chairman tried to force a single "correct" answer. The output was worse than just admitting when the models disagreed and letting the user decide.

Latency is real. A single-model runbook generates in ~5 seconds. The full AI Council takes 30-45 seconds. For a SaaS, this means you need clear UI signaling so the user knows something is happening. I made the mistake of treating this as a backend problem when it was actually a UX problem.

The honest tradeoff

The AI Council costs about 5x more in API calls than a single-model approach. For a free tier, that's not viable. RunDoc only runs the Council on the Pro plan precisely because the economics force it.

Whether the quality gain justifies the cost depends entirely on what the runbook is for. Generating a docs page about how to restart a service? Overkill. Generating the runbook your on-call engineer is going to follow when production is on fire at 3am? Probably worth it.

If you're building something similar

The pattern generalizes beyond runbooks. Any time you need an LLM to produce output where errors are expensive and silent — legal drafting, medical summaries, financial analysis, security reviews — the cross-review pattern is worth trying. The key insight is that diversity of priors does the work, not the synthesis step. Pick models that were trained differently. Make them critique structurally, not just summarize.

If you want to see what AI Council output actually looks like on a real runbook, RunDoc has a free tier — 5 runbooks per month, no card needed. The Council itself is on the Pro plan, but the architecture is what this post is about, not the product.

Happy to answer questions in the comments. Especially curious to hear from anyone who's tried multi-model architectures for other domains — what worked, what didn't.