J.S_Falcon

Posted on May 2

Why I Run Two AIs Against Each Other: An Ops Engineer's View on AI Governance

#ai #devops #discuss #watercooler

TL;DR

I run two different AIs (Claude and Gemini) against each other, with myself as a human router carrying messages between them. No auto-orchestration framework.
The setup is a complementary view to Dev-centric AI automation tools, not a replacement. Right tool for the right job.
Operations background suggested treating this as a two-layer design: internal diversity (prompts within one model) and external diversity (cross-vendor models). Both layers contribute, neither alone is enough.
Five practices and five caveats below — drawn from one engineer's one-month operation, framed as a hypothesis, not a recipe.
Each part of this series stands alone — Part 1 is the entity resolution case study, Part 2 is the AI collaboration patterns, this is the architecture-level view.

1. The Problem: When Multi-AI Becomes an Echo Chamber

Multi-agent AI setups have been everywhere for the last year. AutoGen, CrewAI, LangGraph, and others let you spin up several agents with role prompts (planner, reviewer, executor) and let them talk to each other. Useful, fast, automated.

There's a quiet failure mode, though. When all the agents share the same underlying model, "multi-agent" can drift into "multi-prompt-on-one-model." The agents end up reasoning from the same training distribution, hitting the same blind spots, agreeing too quickly. You get the appearance of diverse perspectives without actual diversity.

I noticed this on my own setup. I was using Claude Code for development and asking the same Claude to play "devil's advocate" against its own proposals. Most of the time the devil's advocate was thoughtful, but on hard questions it tended to go quiet — the model couldn't really argue against itself when the alternatives lived in the same prompt context.

That's where the cross-vendor experiment started.

2. An Operations View: Internal vs External Diversity

I come from an operations / systems engineering background. When I look at an AI workflow, I tend to ask the questions an SRE would ask: where are the single points of failure? What happens on Day 2? What does the audit trail look like?

Through that lens, multi-AI setups have two layers of diversity, and they're not interchangeable.

Internal diversity is what you get from prompts inside a single model. "You are now arguing the opposite case." "Critique this design as a security reviewer." "List three reasons to reject this proposal." The model switches voices but keeps the same underlying reasoning substrate. Useful, cheap, fast.

External diversity is what you get from a different vendor's model. Claude and Gemini don't share weights. Their training data overlaps but isn't identical. Their bias profiles differ. When Claude proposes a design and Gemini critiques it, the critique comes from a different statistical posture, not a different mood from the same speaker.

These are not equivalent. They're substantially comparable for the surface task (producing diverse views), but they differ on operational properties:

Property	Internal diversity	External diversity
Audit independence	Single context log	Independent logs per model
Vendor lock-in	High (one vendor's model)	Low (cross-vendor)
Failure isolation	Context corruption affects all roles	Independent failure domains
Real parallelism	Sequential within one context	True parallel calls possible

So the design I ended up with is two-layer: internal diversity inside each model, plus external diversity across vendors. The two layers compound. Neither alone reproduces what both together produce.

The transport between models, in my setup, is email. Claude proposes; I copy the proposal into a script that emails it to a Gemini-readable inbox; Gemini reads, critiques, and responds; I paste the response back to Claude. Slow on purpose. We'll come back to that.

Part 1: Five Practices

Practice 1 — Force adversarial framing into prompts explicitly

When you ask a model "what do you think of this?" the default answer is usually a polite agreement. That's not what cross-review is for.

In every cross-vendor exchange, I include explicit phrasing that requires the responder to argue against. The wording I actually use, more or less verbatim:

Identify the weakest link in this design.
Give three reasons to reject this proposal.
Where would this hypothesis fail?
If you find yourself agreeing, name the assumption you are least sure about.

Without that prompt-level push, the second AI tends to mirror the first, which is the failure mode the whole setup is supposed to prevent.

This sounds obvious until you watch a model respond to a soft prompt and realize how much friction the polite default introduces.

Practice 2 — Adopt an audit-first protocol (Seq / Re style) with hard-stop on gap

Every message between Claude and Gemini in my system carries a header: [MAAR-Session: <topic> | Seq: N | Re: M]. Sequence number from the sender, reply-to number for the message being answered. Borrowed straight from TCP and email threading.

The point is not the format. The point is that any message is self-documenting about which conversation thread it belongs to, which message it answers, and what came before. When something goes wrong — a misrouted reply, a missing message, an out-of-order paste — the gap is visible.

When the gap shows up, the protocol response is hard-stop: pause the conversation, request retransmission, do not paper over the missing piece. That's the operations-side instinct showing up — a missing log line stops the change window, period. "Best-effort continue" is the failure mode that produces silent data loss in production; it's no better in AI workflows.

In a fully automated multi-agent system, this kind of bookkeeping is implicit in the framework. In a human-routed system, you need it explicitly. The cost is small (a few characters in each header). The benefit is the audit trail you actually get, plus the explicit stop signal when reality drifts from the protocol.

Practice 3 — Use vendor-neutral persona prompts

The prompts I send to Claude and Gemini for the same role (e.g., "critique this design") are kept as close to identical as possible. Same wording, same structure, same evaluation criteria. Differences in the responses come from differences in the models, not from differences in how I asked.

This matters because it lets me actually compare the two outputs. If Claude pushes back hard and Gemini agrees, I know that's a real signal about the proposal — not an artifact of having asked Gemini in a softer way.

The temptation is to tune each prompt to the strengths of the model. Resist it. Vendor-neutral prompts give you a comparable signal across vendors, which is the whole point.

Practice 4 — Switch between polite mode and adversarial mode

Most of my cross-reviews run in what I call "polite mode": measured language, restrained framing, "please consider" phrasing. That's the right default for normal review.

But sometimes the second AI agrees with everything I say. That's when I switch to "adversarial mode" deliberately: explicit framing that this is a thought experiment, instructions to drop politeness, demand for a forced disagreement, even (carefully) raising questions about whether the model has biases that explain its agreement.

The mode switch is intentional, time-boxed, and announced inside the prompt. It's not the default — overdoing it produces performative dissent (more on that in the caveats). But used sparingly, it's the mechanism that breaks the over-agreement spiral when it shows up.

Practice 5 — Hold to hypothesis discipline (sample-size humility)

This whole setup is N=1. One engineer, one month, one workflow. That doesn't make it wrong — but it doesn't make it a recipe either.

In every external description of this setup, including the article you're reading, I try to keep the framing as "individual observation, presented as a hypothesis." Avoid words like paradigm, grand theory, the right way. The discipline is partly about honesty (these claims aren't tested at scale yet) and partly about staying open to evidence that breaks the model.

If the hypothesis is right, it'll show its strength against contrary cases over time. If it's wrong, the framing makes it easier to walk it back without losing face.

Part 2: Five Caveats

Caveat 1 — Sample size of one

This is one engineer's one-month experiment. Patterns described as "practices" here might fail to generalize, might depend on my specific workflow or domain, might not survive contact with a different setup. Read the practices as hypotheses, not recommendations.

Caveat 2 — Right tool for the right job

Auto-orchestration frameworks (AutoGen, CrewAI, LangGraph) exist for good reasons. They're faster, cheaper, and better-suited to many use cases. The human-routed setup described here is a complement, not a replacement. If your task fits an autonomous loop, use one. The two-layer design is most useful where audit independence and cross-vendor signal matter more than throughput.

Caveat 3 — Internal and external diversity are not fully equivalent

Internal diversity (prompt-based persona switching) covers a substantial portion of what external diversity provides — but not all of it. Audit independence, vendor lock-in resistance, failure isolation, and real parallelism are properties that internal diversity simply cannot match. Claiming "we got the same effect with one model" is a category error.

Caveat 4 — Performative dissent risk

If you push the second model into adversarial mode too often, it learns to manufacture disagreement. You get pushback that's syntactically critical but substantively empty. The mode switch only works because it's the exception, not the default. Used as a routine technique, it produces noise instead of signal.

Caveat 5 — Maintenance overhead is real

Audit trails, mode switching, header conventions, hypothesis framing — this is more discipline than a casual workflow. The overhead is justified for high-stakes decisions and design reviews, less so for everyday tasks. If you adopt the practices, calibrate which ones are worth the cost in your context.

Wrap-Up

Three things I've taken from running this for a month:

Internal and external diversity compose. Either alone is a partial defense against single-model bias. Together they cover more of the failure surface than the sum of the parts would suggest.
The transport doesn't have to be fancy. Email and copy-paste are slow on purpose. If that reads as primitive — that's the design. The friction isn't from inability to automate the transport; it's a deliberate cognitive checkpoint between the two models. Slowness is the feature: it gives me time to actually read the response before forwarding it, and it forces the audit trail to be a thing I look at, not a thing the framework hides from me.
Discipline beats automation for governance work. Auto-orchestration is faster. Auto-orchestration is also harder to audit, harder to debug, and harder to explain to a compliance reviewer. For governance-shaped tasks, the slow path wins on the metrics that matter.

The whole construction is a hypothesis. I'd trade it for a tool that does the same job better tomorrow.

What's Next

Part 1 of this series — the entity resolution case study — is the concrete build that prompted this whole reflection.

Part 2 — the AI collaboration patterns from an operations lens — sits alongside this one and covers the session-time discipline.

A future part will cover the protocol design itself: the Seq/Re headers, the TTL-based session control, the gap detection and hard-stop rules. That's where the human-routed design earns its keep.

Comments welcome — particularly:

Cross-vendor multi-AI patterns you've tried, and what surprised you.
Cases where internal diversity was enough (counter-evidence to the two-layer claim).
Audit and governance experiences with auto-orchestration frameworks (where they shone, where they didn't).

DEV Community