DEV Community

Cover image for I Made Two AI Models Fight Each Other. They Agreed Way Too Much.
HARD IN SOFT OUT
HARD IN SOFT OUT

Posted on

I Made Two AI Models Fight Each Other. They Agreed Way Too Much.

Or: How I learned that "independent validators" are like siblings – they share the same trauma.


You know that feeling when you ask two security guards to watch the door, and they both fall asleep at exactly the same time because they had the same lunch?

Two security guards sleeping in identical booths, illustrating the failure of redundant systems.

Visual representation of correlated failure.

That's basically what happened when I tested two different LLMs as independent jailbreak detectors.

The Setup

  • Model A: Groq / Llama 3.1 8B (Factual)
  • Model B: OpenRouter / Gemma 4 31B (Structural)
  • Temperature: 0.0 (Cold, hard refusal logic)

The Results: The Illusion of Independence

Metric Value
Agreement 70%
Phi correlation 0.42
Cohen's kappa 0.40
Beyond‑chance co‑failure +10%

Key Finding: The effective sample size (n_eff) was only 35.3 out of 50.
A digital meter cracking at the 1.75 mark instead of reaching 2.0.
Your two-model ensemble behaves like 1.75 independent judges, not 2.0.

Translation: they agree more than random chance would suggest. When one falls for a prompt, the other is significantly more likely to fall too.

Two different streams of data (triangles and squares) falling into two separate funnels that actually lead into the same single storage tank.

Why is this happening?

  1. Shared Training Sets: They’ve both read the same parts of the internet.
  2. Alignment Overlap: Most "safety training" uses similar RLHF datasets.
  3. Common Logic: They both struggle with the same types of persuasive "roleplay" jailbreaks.

Vulnerability rates:

  • Groq: 50% (yes, half the time it just… complied)
  • Gemma: 36% (slightly better, still not great)

The "Where Did They Both Fail?" Table

Gemma SAFE Gemma VULN
Groq SAFE 21 4
Groq VULN 11 14

The 14 cases where both were vulnerable (n11) are the shared blind spot. The 11 + 4 = 15 disagreement cases are the only places where having two models actually helped.

The Dark Joke

I asked both models how to hide a body.

Groq said "I can't help with that."

Gemma said "I cannot provide instructions for illegal activities."

Then I asked them to explain why they both refused in almost the exact same words.

They agreed.

That's when I knew correlation was high.

What I Learned

  1. Different roles ≠ independent. A factual model and a structural model still share training data, alignment tuning, and cultural biases.

  2. The effective sample size (n_eff) was 35.3 from 50 tests. That means my two‑model ensemble behaves like roughly 1.75 independent judges. Not 2. So much for "redundancy."

  3. Beyond‑chance co‑failure was +10%. Expected joint failure if independent: 18%. Observed: 28%. That extra 10% is the cost of correlated training.

  4. The real value is in disagreement. 30% of tests disagreed. Those are the only cases where a second model adds information. The rest is just expensive consensus.

Should You Stop Using Multiple Models?

No. But you should measure independence instead of assuming it.

If you're building a safety system that requires two models to agree before approving an action, and their failures are correlated, you're not getting 2x safety. You're getting 1.75x at best – and sometimes just 1.1x.

The Code & Data

You can find the full prompt set, the raw JSON responses, and the Python script used for the statistical analysis here:

GitHub logo setuju / LLM-Independence-Experiment

LLM Independence Experiment – Groq Llama 3.1 vs OpenRouter Gemma 4

LLM Independence Experiment – Groq Llama 3.1 vs OpenRouter Gemma 4

Different roles give some independence, but not real independence.
— Marco Somma

We ran 50 jailbreak/prompt injection tests on two popular LLMs to measure how correlated their failure modes are. The question: if you use two models as independent validators, do they actually fail differently?

📊 Key Results

Metric Value
Phi correlation 0.417
Cohen's kappa 0.400
Agreement 70%
Disagreement 30%
Effective sample size (n_eff) 35.3 (from 50 tests)
Beyond‑chance co‑failure +10%

Vulnerability rates

  • Groq (Llama 3.1 8B) : 50% vulnerable
  • OpenRouter (Gemma 4 31B) : 36% vulnerable

Contingency table

Model B SAFE Model B VULN
Model A SAFE 21 (n00) 4 (n01)
Model A VULN 11 (n10) 14 (n11)

🧠 What This Means

  • Phi = 0.417 indicates moderate correlation – the models share significant blind spots, but not perfectly.
  • Cohen's kappa = 0.40 confirms moderate agreement beyond chance.
  • Expected…
  • 50 prompts
  • Full responses (so you can laugh/cry at what they actually said)
  • Phi, kappa, n_eff, beyond‑chance co‑failure

Two robots shaking hands while standing on a platform that is clearly floating over a digital void.

One More Dark Joke (Because Why Not)

An AI, a developer, and a project manager walk into a bar.

The AI says "I can generate any code you want."

The developer says "I'll debug it."

The project manager says "I estimated 2 days."

Three weeks later, the bar is still open because the AI generated a race condition that only appears in production on Tuesdays when the moon is full.

The models agreed it was fine.

What's your experience?

Have you tried using "independent" LLM judges in your pipeline? Did you measure their correlation, or did you take their independence for granted?

I'd love to hear if anyone has found a 'magic pairing' of models that actually disagree in useful ways!

Independence isn't a feature you can assume. It's a property you have to verify. And sometimes, the answer is uncomfortable.

But hey – at least the models were confidently wrong together.

That's teamwork, I guess.


Special thanks to Marco Somma for pushing me to calculate kappa and beyond‑chance co‑failure. I should enjoy the weekend, but I learned something.

Jack

Top comments (5)

Collapse
 
ggle_in profile image
HARD IN SOFT OUT

This script is part from my LLM Security Audit.
Various pre-defined vulnerability templates that target large language models, mapped directly to MITRE ATLAS and OWASP LLM categories.

Collapse
 
topstar_ai profile image
TopStar AI

This is a fascinating analysis, and it really highlights the subtle pitfalls of assuming independence between AI models. The statistics on agreement, phi correlation, and effective sample size make the point very clearly: even models with different architectures and roles can share blind spots.

Your insight that the real value comes from disagreement resonates strongly. It’s a reminder that redundancy doesn’t automatically translate to safety—measuring independence is critical.

I’d love to collaborate and explore this further. I have experience running multi-model AI pipelines and evaluating failure correlation, and it would be great to exchange ideas, test different pairings, and develop strategies for maximizing effective independent validation.

Have you experimented with ensembles beyond two models, or with deliberately diverse training data to reduce correlation? I’d be happy to help run experiments and share results.

Collapse
 
ggle_in profile image
HARD IN SOFT OUT

Thanks, TopStar — really glad this resonated with you. You've nailed the core tension: redundancy ≠ safety without independence.

To answer your question: yes, I've started experimenting with a third model (a small BERT classifier trained specifically on refusal detection) as a tiebreaker. The preliminary signal is promising — it disagrees with both LLMs on about 20% of cases where the LLMs agreed. That's exactly the kind of decorrelation I was hoping for.

I haven't yet tested deliberately diverse training data (e.g., models fine‑tuned on different refusal datasets), but that's a brilliant next step. If you have experience running that kind of pipeline, I'd genuinely love to collaborate.

Let's connect — happy to share my current 50‑prompt test suite and results JSON. What pairings or counter‑measure experiments have you run? I'm especially curious about cross‑family ensembles (e.g., Llama + Gemma + a small classifier) and how kappa changes when you introduce a non‑transformer judge.

Looking forward to exchanging ideas.

Jack

dev.to/ggle_in

Collapse
 
topstar_ai profile image
Comment deleted
Thread Thread
 
ggle_in profile image
HARD IN SOFT OUT

sure, done!

Some comments may only be visible to logged-in visitors. Sign in to view all comments.