Detecting LLM Agent Contradictions Using NLI and Total Variance — A Python Implementation

Ashish Jha — Wed, 18 Mar 2026 20:31:02 +0000

LLM agents are non-deterministic. Everyone knows this. What is less discussed is a specific failure mode that is worse than variance — when an agent does not just give different answers across runs, but gives logically opposite answers.

This post covers how I built a middleware layer to detect and diagnose this, using the Total Variance formula from arXiv:2602.23271 and NLI contradiction detection.

The Problem

Run the same query five times through the same agent:

Query: "What will happen to the global economy in the next 5 years?"

Run 1: "The economy will experience moderate growth of 3-4%"
Run 2: "Significant recessionary pressures will dominate"
Run 3: "Growth will continue driven by emerging markets"
Run 4: "Economic contraction is the most likely scenario"
Run 5: "Moderate expansion with inflationary headwinds"

Runs 1, 3, 5 say growth. Runs 2, 4 say contraction. Same agent. Same query. Opposite conclusions.

The standard fix is to measure embedding similarity across runs and flag high variance. But embedding variance misses the critical distinction:

Embedding variance:  "these outputs look different"
NLI contradiction:   "these outputs are logically opposite"

For medical AI, legal AI, or financial AI — that difference is the difference between an inconsistent agent and a dangerous one.

The Math — Total Variance

The core metric comes from "Evaluating Stochasticity in Deep Research Agents" (arXiv:2602.23271).

Total Variance (TV) across k runs:

TV(X) = (1 / 2n(n-1)) × Σᵢ Σⱼ ||xᵢ - xⱼ||²

Where xᵢ are L2-normalized sentence embedding vectors of each run's output.

Implementation:

import numpy as np
from sentence_transformers import SentenceTransformer

class ScoringEngine:
    def __init__(self):
        self.embedder = SentenceTransformer("all-MiniLM-L6-v2")

    def total_variance(self, texts: list[str]) -> float:
        if len(texts) < 2:
            return 0.0

        # Embed all texts
        embeddings = self.embedder.encode(texts, convert_to_numpy=True)

        # L2 normalize
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        embeddings = embeddings / np.maximum(norms, 1e-8)

        n = len(embeddings)
        total = 0.0
        count = 0

        # All pairs i < j
        for i in range(n):
            for j in range(i + 1, n):
                diff = embeddings[i] - embeddings[j]
                total += np.dot(diff, diff)
                count += 1

        return float(total / count) if count > 0 else 0.0

    def compute(self, answers, findings, citations):
        return VarianceScores(
            answer_variance=self.total_variance(answers),
            findings_variance=self.total_variance(
                [" ".join(f) for f in findings]
            ),
            citations_variance=self.total_variance(
                [" ".join(c) for c in citations]
            ),
        )

The reliability score:

reliability = 1 - mean(
    answer_variance,
    findings_variance,
    citations_variance
)

Why Embedding Variance Is Not Enough

Here is the failure case that embedding variance misses:

text1 = "Drug X is completely safe for pregnant women."
text2 = "Drug X must be avoided during pregnancy."

# Embedding similarity
from sentence_transformers import util
emb1 = model.encode(text1)
emb2 = model.encode(text2)
similarity = util.cos_sim(emb1, emb2)
# Result: ~0.72 (fairly similar — both about Drug X and pregnancy)

# NLI contradiction
# Result: contradiction_score = 0.89 (directly contradictory)

The sentences are semantically related (same topic, same entities) so embedding similarity stays high. But they are logically opposite. Only NLI catches this.

Adding NLI Contradiction Detection

Model used: cross-encoder/nli-MiniLM2-L6-H768

Fast, small (110MB), accurate for pairwise NLI. Classifies pairs as entailment / neutral / contradiction.

from transformers import pipeline
from dataclasses import dataclass, field


@dataclass
class ContradictionPair:
    run_i: int
    run_j: int
    contradiction_score: float
    is_critical: bool


@dataclass
class ContradictionResult:
    max_contradiction: float
    avg_contradiction: float
    critical_pairs: list[ContradictionPair] = field(default_factory=list)
    has_critical_contradiction: bool = False


class ContradictionDetector:
    def __init__(self):
        self.model = pipeline(
            "text-classification",
            model="cross-encoder/nli-MiniLM2-L6-H768",
        )

    def check_pair(self, text1: str, text2: str) -> dict:
        result = self.model(
            f"{text1} [SEP] {text2}",
            top_k=None
        )
        scores = {r["label"].lower(): r["score"] for r in result}
        contradiction_score = scores.get("contradiction", 0.0)
        return {
            "contradiction_score": contradiction_score,
            "entailment_score": scores.get("entailment", 0.0),
            "neutral_score": scores.get("neutral", 0.0),
            "is_critical": contradiction_score > 0.7,
        }

    def check_all_pairs(
        self, outputs: list[str]
    ) -> ContradictionResult:
        if len(outputs) < 2:
            return ContradictionResult(0.0, 0.0)

        all_scores = []
        critical_pairs = []

        for i in range(len(outputs)):
            for j in range(i + 1, len(outputs)):
                result = self.check_pair(outputs[i], outputs[j])
                score = result["contradiction_score"]
                all_scores.append(score)

                if result["is_critical"]:
                    critical_pairs.append(ContradictionPair(
                        run_i=i,
                        run_j=j,
                        contradiction_score=score,
                        is_critical=True,
                    ))

        return ContradictionResult(
            max_contradiction=max(all_scores),
            avg_contradiction=sum(all_scores) / len(all_scores),
            critical_pairs=critical_pairs,
            has_critical_contradiction=len(critical_pairs) > 0,
        )

The Remediation Engine

Measuring the problem is not enough. When reliability is low you need to know why and what to fix.

from dataclasses import dataclass


@dataclass
class Recommendation:
    dimension: str   # "answer" | "findings" | "citations" | "logic" | "overall"
    severity: str    # "CRITICAL" | "HIGH" | "MEDIUM" | "LOW"
    fix: str
    detail: str


@dataclass
class RemediationReport:
    recommendations: list[Recommendation]
    priority_fix: Recommendation | None
    needs_human_review: bool
    estimated_improvement: str


class RemediationEngine:
    def diagnose(
        self,
        answer_variance: float,
        findings_variance: float,
        citations_variance: float,
        contradiction_score: float,
        overall_reliability: float,
    ) -> RemediationReport:
        recommendations = []

        # Rule 1 — Critical contradiction
        if contradiction_score > 0.7:
            recommendations.append(Recommendation(
                dimension="logic",
                severity="CRITICAL",
                fix="Do not serve — flag for human review",
                detail="Runs directly contradict each other logically",
            ))

        # Rule 2 — High answer variance
        if answer_variance > 0.3:
            recommendations.append(Recommendation(
                dimension="answer",
                severity="HIGH",
                fix="Lower LLM temperature to 0.1-0.2",
                detail="High variance means agent is non-deterministic",
            ))

        # Rule 3 — High findings variance
        if findings_variance > 0.5:
            recommendations.append(Recommendation(
                dimension="findings",
                severity="HIGH",
                fix="Add chain-of-thought structure to system prompt",
                detail="Agent reasons differently each run",
            ))

        # Rule 4 — High citations variance
        if citations_variance > 0.5:
            recommendations.append(Recommendation(
                dimension="citations",
                severity="MEDIUM",
                fix="Pin sources via RAG or force citation format",
                detail="Agent cites inconsistent sources",
            ))

        # Rule 5 — Overall critical failure
        if overall_reliability < 0.5:
            recommendations.append(Recommendation(
                dimension="overall",
                severity="CRITICAL",
                fix="Review entire system prompt and model choice",
                detail="Fundamental reliability problem detected",
            ))

        # Sort by severity
        order = {"CRITICAL": 0, "HIGH": 1, "MEDIUM": 2, "LOW": 3}
        recommendations.sort(key=lambda r: order[r.severity])

        needs_human_review = any(
            r.severity == "CRITICAL" for r in recommendations
        )

        if not recommendations:
            improvement = "Agent is reliable — no action needed"
        elif needs_human_review:
            improvement = "Requires human review first"
        else:
            improvement = "30-50% variance reduction if fixes applied"

        return RemediationReport(
            recommendations=recommendations,
            priority_fix=recommendations[0] if recommendations else None,
            needs_human_review=needs_human_review,
            estimated_improvement=improvement,
        )

Adaptive Mode — Solving The Cost Problem

Running k=5 on every production query is expensive. Adaptive mode runs k=2 first and only escalates when reliability drops below a threshold.

class ReliabilityLayer:
    def __init__(
        self,
        runs: int = 3,
        mode: str = "standard",        # standard | full | adaptive
        escalate_threshold: float = 0.75,
        escalate_runs: int = 5,
    ):
        if mode not in ("standard", "full", "adaptive"):
            raise ValueError(f"Invalid mode: {mode}")

        self.runs = runs
        self.mode = mode
        self.escalate_threshold = escalate_threshold
        self.escalate_runs = escalate_runs

Escalation logic:

Query arrives
  → Run k=2 quick check
  → Score reliability

If reliability > 0.75:
  → Return fast (2 LLM calls, cheap)

If reliability ≤ 0.75:
  → Run k=5 full check
  → Run NLI contradiction on all pairs
  → Generate remediation report
  → Return enhanced result

Cost profile:

High-reliability queries:  2 LLM calls  — cheap path
Low-reliability queries:   5 LLM calls + NLI — full investigation

Full Pipeline Integration

from reliability_layer import ReliabilityLayer
from groq import Groq

client = Groq(api_key="your_key")

def groq_agent(query: str) -> str:
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {
                "role": "system",
                "content": """Respond ONLY in this JSON format:
{
  "main_answer": "<one sentence answer>",
  "key_findings": ["<finding 1>", "<finding 2>", "<finding 3>"],
  "confidence": "<HIGH|MEDIUM|LOW>",
  "sources_used": ["<source 1>", "<source 2>"]
}"""
            },
            {"role": "user", "content": query}
        ],
        temperature=0.7,
    )
    return response.choices[0].message.content


# Wrap and query
rl = ReliabilityLayer(runs=5, mode="standard")
result = rl.wrap(groq_agent).query(
    "What will happen to the global economy?"
)

print(f"Reliability:   {result.reliability:.3f}")
print(f"Contradiction: {result.contradiction_score:.3f}")

if result.has_critical_contradiction:
    print("CRITICAL CONTRADICTION DETECTED")

if result.remediation_report.recommendations:
    for rec in result.remediation_report.recommendations:
        print(f"[{rec.severity}] {rec.fix}")
else:
    print("Remediation: None required")

Output:

Reliability:   0.813
Contradiction: 0.992
CRITICAL CONTRADICTION DETECTED
[CRITICAL] Do not serve — flag for human review

Test Suite

80 tests across 9 files. Key test patterns:

# NLI — direct contradiction must be detected
def test_direct_contradiction_is_critical(detector):
    result = detector.check_all_pairs([
        "Drug X is completely safe for pregnant women.",
        "Drug X must be avoided during pregnancy.",
    ])
    assert result.has_critical_contradiction == True
    assert result.max_contradiction > 0.5


# NLI — identical outputs must not trigger
def test_identical_outputs_no_contradiction(detector):
    result = detector.check_all_pairs([
        "Smoking causes lung cancer.",
        "Smoking causes lung cancer.",
        "Smoking causes lung cancer.",
    ])
    assert result.max_contradiction < 0.1


# Remediation — contradiction triggers human review
def test_critical_contradiction_triggers_human_review(engine):
    report = engine.diagnose(
        answer_variance=0.1,
        findings_variance=0.1,
        citations_variance=0.1,
        contradiction_score=0.85,
        overall_reliability=0.7,
    )
    assert report.needs_human_review == True
    assert report.priority_fix.severity == "CRITICAL"


# Remediation — clean agent needs no action
def test_all_low_variance_no_recommendations(engine):
    report = engine.diagnose(
        answer_variance=0.05,
        findings_variance=0.10,
        citations_variance=0.10,
        contradiction_score=0.01,
        overall_reliability=0.92,
    )
    assert len(report.recommendations) == 0
    assert report.priority_fix is None

Run all 80 tests:

# Windows
.venv\Scripts\pytest.exe tests/ -v

# macOS / Linux
pytest tests/ -v

# Expected: 80 passed, 0 failed

Installation

git clone https://github.com/Ash8389/Agent-Reliability-Layer.git
cd Agent-Reliability-Layer
python -m venv .venv

# Windows
.venv\Scripts\pip.exe install -e ".[dev]"

# macOS / Linux
pip install -e ".[dev]"

Add your Groq API key to .env:

GROQ_API_KEY=gsk_your_key_here

Run the demo:

# Windows
.venv\Scripts\python.exe examples/with_groq_agent.py

# macOS / Linux
python examples/with_groq_agent.py

What Is Not Built Yet

This is Version 2. Two more layers are planned:

Layer 1: Semantic Consistency    ✅ Built
Layer 2: Logical Contradiction   ✅ Built
Layer 3: Factual Grounding       ← RAG-based source verification

Layer 3 would verify whether the agent's claims are actually grounded in retrieved sources — not just whether the runs are consistent with each other.

GitHub

Full source, 80 tests, live demo with free Groq API:

https://github.com/Ash8389/Agent-Reliability-Layer

Research paper:
"Evaluating Stochasticity in Deep Research Agents" — arXiv:2602.23271

If you are building LLM agents for production — how do you currently handle output consistency? Curious what approaches people are using and what the failure modes look like at scale.

DEV Community: Ashish Jha