LLM agents are non-deterministic. Everyone knows this. What is less discussed is a specific failure mode that is worse than variance — when an agent does not just give different answers across runs, but gives logically opposite answers.
This post covers how I built a middleware layer to detect and diagnose this, using the Total Variance formula from arXiv:2602.23271 and NLI contradiction detection.
The Problem
Run the same query five times through the same agent:
Query: "What will happen to the global economy in the next 5 years?"
Run 1: "The economy will experience moderate growth of 3-4%"
Run 2: "Significant recessionary pressures will dominate"
Run 3: "Growth will continue driven by emerging markets"
Run 4: "Economic contraction is the most likely scenario"
Run 5: "Moderate expansion with inflationary headwinds"
Runs 1, 3, 5 say growth. Runs 2, 4 say contraction. Same agent. Same query. Opposite conclusions.
The standard fix is to measure embedding similarity across runs and flag high variance. But embedding variance misses the critical distinction:
Embedding variance: "these outputs look different"
NLI contradiction: "these outputs are logically opposite"
For medical AI, legal AI, or financial AI — that difference is the difference between an inconsistent agent and a dangerous one.
The Math — Total Variance
The core metric comes from "Evaluating Stochasticity in Deep Research Agents" (arXiv:2602.23271).
Total Variance (TV) across k runs:
TV(X) = (1 / 2n(n-1)) × Σᵢ Σⱼ ||xᵢ - xⱼ||²
Where xᵢ are L2-normalized sentence embedding vectors of each run's output.
Implementation:
import numpy as np
from sentence_transformers import SentenceTransformer
class ScoringEngine:
def __init__(self):
self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
def total_variance(self, texts: list[str]) -> float:
if len(texts) < 2:
return 0.0
# Embed all texts
embeddings = self.embedder.encode(texts, convert_to_numpy=True)
# L2 normalize
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings = embeddings / np.maximum(norms, 1e-8)
n = len(embeddings)
total = 0.0
count = 0
# All pairs i < j
for i in range(n):
for j in range(i + 1, n):
diff = embeddings[i] - embeddings[j]
total += np.dot(diff, diff)
count += 1
return float(total / count) if count > 0 else 0.0
def compute(self, answers, findings, citations):
return VarianceScores(
answer_variance=self.total_variance(answers),
findings_variance=self.total_variance(
[" ".join(f) for f in findings]
),
citations_variance=self.total_variance(
[" ".join(c) for c in citations]
),
)
The reliability score:
reliability = 1 - mean(
answer_variance,
findings_variance,
citations_variance
)
Why Embedding Variance Is Not Enough
Here is the failure case that embedding variance misses:
text1 = "Drug X is completely safe for pregnant women."
text2 = "Drug X must be avoided during pregnancy."
# Embedding similarity
from sentence_transformers import util
emb1 = model.encode(text1)
emb2 = model.encode(text2)
similarity = util.cos_sim(emb1, emb2)
# Result: ~0.72 (fairly similar — both about Drug X and pregnancy)
# NLI contradiction
# Result: contradiction_score = 0.89 (directly contradictory)
The sentences are semantically related (same topic, same entities) so embedding similarity stays high. But they are logically opposite. Only NLI catches this.
Adding NLI Contradiction Detection
Model used: cross-encoder/nli-MiniLM2-L6-H768
Fast, small (110MB), accurate for pairwise NLI. Classifies pairs as entailment / neutral / contradiction.
from transformers import pipeline
from dataclasses import dataclass, field
@dataclass
class ContradictionPair:
run_i: int
run_j: int
contradiction_score: float
is_critical: bool
@dataclass
class ContradictionResult:
max_contradiction: float
avg_contradiction: float
critical_pairs: list[ContradictionPair] = field(default_factory=list)
has_critical_contradiction: bool = False
class ContradictionDetector:
def __init__(self):
self.model = pipeline(
"text-classification",
model="cross-encoder/nli-MiniLM2-L6-H768",
)
def check_pair(self, text1: str, text2: str) -> dict:
result = self.model(
f"{text1} [SEP] {text2}",
top_k=None
)
scores = {r["label"].lower(): r["score"] for r in result}
contradiction_score = scores.get("contradiction", 0.0)
return {
"contradiction_score": contradiction_score,
"entailment_score": scores.get("entailment", 0.0),
"neutral_score": scores.get("neutral", 0.0),
"is_critical": contradiction_score > 0.7,
}
def check_all_pairs(
self, outputs: list[str]
) -> ContradictionResult:
if len(outputs) < 2:
return ContradictionResult(0.0, 0.0)
all_scores = []
critical_pairs = []
for i in range(len(outputs)):
for j in range(i + 1, len(outputs)):
result = self.check_pair(outputs[i], outputs[j])
score = result["contradiction_score"]
all_scores.append(score)
if result["is_critical"]:
critical_pairs.append(ContradictionPair(
run_i=i,
run_j=j,
contradiction_score=score,
is_critical=True,
))
return ContradictionResult(
max_contradiction=max(all_scores),
avg_contradiction=sum(all_scores) / len(all_scores),
critical_pairs=critical_pairs,
has_critical_contradiction=len(critical_pairs) > 0,
)
The Remediation Engine
Measuring the problem is not enough. When reliability is low you need to know why and what to fix.
from dataclasses import dataclass
@dataclass
class Recommendation:
dimension: str # "answer" | "findings" | "citations" | "logic" | "overall"
severity: str # "CRITICAL" | "HIGH" | "MEDIUM" | "LOW"
fix: str
detail: str
@dataclass
class RemediationReport:
recommendations: list[Recommendation]
priority_fix: Recommendation | None
needs_human_review: bool
estimated_improvement: str
class RemediationEngine:
def diagnose(
self,
answer_variance: float,
findings_variance: float,
citations_variance: float,
contradiction_score: float,
overall_reliability: float,
) -> RemediationReport:
recommendations = []
# Rule 1 — Critical contradiction
if contradiction_score > 0.7:
recommendations.append(Recommendation(
dimension="logic",
severity="CRITICAL",
fix="Do not serve — flag for human review",
detail="Runs directly contradict each other logically",
))
# Rule 2 — High answer variance
if answer_variance > 0.3:
recommendations.append(Recommendation(
dimension="answer",
severity="HIGH",
fix="Lower LLM temperature to 0.1-0.2",
detail="High variance means agent is non-deterministic",
))
# Rule 3 — High findings variance
if findings_variance > 0.5:
recommendations.append(Recommendation(
dimension="findings",
severity="HIGH",
fix="Add chain-of-thought structure to system prompt",
detail="Agent reasons differently each run",
))
# Rule 4 — High citations variance
if citations_variance > 0.5:
recommendations.append(Recommendation(
dimension="citations",
severity="MEDIUM",
fix="Pin sources via RAG or force citation format",
detail="Agent cites inconsistent sources",
))
# Rule 5 — Overall critical failure
if overall_reliability < 0.5:
recommendations.append(Recommendation(
dimension="overall",
severity="CRITICAL",
fix="Review entire system prompt and model choice",
detail="Fundamental reliability problem detected",
))
# Sort by severity
order = {"CRITICAL": 0, "HIGH": 1, "MEDIUM": 2, "LOW": 3}
recommendations.sort(key=lambda r: order[r.severity])
needs_human_review = any(
r.severity == "CRITICAL" for r in recommendations
)
if not recommendations:
improvement = "Agent is reliable — no action needed"
elif needs_human_review:
improvement = "Requires human review first"
else:
improvement = "30-50% variance reduction if fixes applied"
return RemediationReport(
recommendations=recommendations,
priority_fix=recommendations[0] if recommendations else None,
needs_human_review=needs_human_review,
estimated_improvement=improvement,
)
Adaptive Mode — Solving The Cost Problem
Running k=5 on every production query is expensive. Adaptive mode runs k=2 first and only escalates when reliability drops below a threshold.
class ReliabilityLayer:
def __init__(
self,
runs: int = 3,
mode: str = "standard", # standard | full | adaptive
escalate_threshold: float = 0.75,
escalate_runs: int = 5,
):
if mode not in ("standard", "full", "adaptive"):
raise ValueError(f"Invalid mode: {mode}")
self.runs = runs
self.mode = mode
self.escalate_threshold = escalate_threshold
self.escalate_runs = escalate_runs
Escalation logic:
Query arrives
→ Run k=2 quick check
→ Score reliability
If reliability > 0.75:
→ Return fast (2 LLM calls, cheap)
If reliability ≤ 0.75:
→ Run k=5 full check
→ Run NLI contradiction on all pairs
→ Generate remediation report
→ Return enhanced result
Cost profile:
High-reliability queries: 2 LLM calls — cheap path
Low-reliability queries: 5 LLM calls + NLI — full investigation
Full Pipeline Integration
from reliability_layer import ReliabilityLayer
from groq import Groq
client = Groq(api_key="your_key")
def groq_agent(query: str) -> str:
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{
"role": "system",
"content": """Respond ONLY in this JSON format:
{
"main_answer": "<one sentence answer>",
"key_findings": ["<finding 1>", "<finding 2>", "<finding 3>"],
"confidence": "<HIGH|MEDIUM|LOW>",
"sources_used": ["<source 1>", "<source 2>"]
}"""
},
{"role": "user", "content": query}
],
temperature=0.7,
)
return response.choices[0].message.content
# Wrap and query
rl = ReliabilityLayer(runs=5, mode="standard")
result = rl.wrap(groq_agent).query(
"What will happen to the global economy?"
)
print(f"Reliability: {result.reliability:.3f}")
print(f"Contradiction: {result.contradiction_score:.3f}")
if result.has_critical_contradiction:
print("CRITICAL CONTRADICTION DETECTED")
if result.remediation_report.recommendations:
for rec in result.remediation_report.recommendations:
print(f"[{rec.severity}] {rec.fix}")
else:
print("Remediation: None required")
Output:
Reliability: 0.813
Contradiction: 0.992
CRITICAL CONTRADICTION DETECTED
[CRITICAL] Do not serve — flag for human review
Test Suite
80 tests across 9 files. Key test patterns:
# NLI — direct contradiction must be detected
def test_direct_contradiction_is_critical(detector):
result = detector.check_all_pairs([
"Drug X is completely safe for pregnant women.",
"Drug X must be avoided during pregnancy.",
])
assert result.has_critical_contradiction == True
assert result.max_contradiction > 0.5
# NLI — identical outputs must not trigger
def test_identical_outputs_no_contradiction(detector):
result = detector.check_all_pairs([
"Smoking causes lung cancer.",
"Smoking causes lung cancer.",
"Smoking causes lung cancer.",
])
assert result.max_contradiction < 0.1
# Remediation — contradiction triggers human review
def test_critical_contradiction_triggers_human_review(engine):
report = engine.diagnose(
answer_variance=0.1,
findings_variance=0.1,
citations_variance=0.1,
contradiction_score=0.85,
overall_reliability=0.7,
)
assert report.needs_human_review == True
assert report.priority_fix.severity == "CRITICAL"
# Remediation — clean agent needs no action
def test_all_low_variance_no_recommendations(engine):
report = engine.diagnose(
answer_variance=0.05,
findings_variance=0.10,
citations_variance=0.10,
contradiction_score=0.01,
overall_reliability=0.92,
)
assert len(report.recommendations) == 0
assert report.priority_fix is None
Run all 80 tests:
# Windows
.venv\Scripts\pytest.exe tests/ -v
# macOS / Linux
pytest tests/ -v
# Expected: 80 passed, 0 failed
Installation
git clone https://github.com/Ash8389/Agent-Reliability-Layer.git
cd Agent-Reliability-Layer
python -m venv .venv
# Windows
.venv\Scripts\pip.exe install -e ".[dev]"
# macOS / Linux
pip install -e ".[dev]"
Add your Groq API key to .env:
GROQ_API_KEY=gsk_your_key_here
Run the demo:
# Windows
.venv\Scripts\python.exe examples/with_groq_agent.py
# macOS / Linux
python examples/with_groq_agent.py
What Is Not Built Yet
This is Version 2. Two more layers are planned:
Layer 1: Semantic Consistency ✅ Built
Layer 2: Logical Contradiction ✅ Built
Layer 3: Factual Grounding ← RAG-based source verification
Layer 3 would verify whether the agent's claims are actually grounded in retrieved sources — not just whether the runs are consistent with each other.
GitHub
Full source, 80 tests, live demo with free Groq API:
https://github.com/Ash8389/Agent-Reliability-Layer
Research paper:
"Evaluating Stochasticity in Deep Research Agents" — arXiv:2602.23271
If you are building LLM agents for production — how do you currently handle output consistency? Curious what approaches people are using and what the failure modes look like at scale.

Top comments (0)