Ayush Singh

Posted on May 20

The Scariest LLM Failure Isn't a Crash " It's a Confident Wrong Answer" What You think ?

#showdev #opensource #ai #discuss

The most dangerous LLM failure isn't the obvious one.
It is not a crash. It is not an error message. It is a model that sounds completely sure of itself and is completely wrong.
Your user reads it. Believes it. Acts on it. You find out later.

I built a system to catch this before it happens.

The Problem With "Just Check the Output"

Most developers think hallucination detection means checking if the answer looks right.
It doesn't work. The model sounds right even when it is wrong and that is the whole problem.
You need a different approach. Instead of asking "is this answer correct?" you ask:
"Do multiple independent models agree on this answer?"

If they do it is probably reliable.
If they don't " something is wrong", even if you can't tell what.

This is called ensemble disagreement. It is the core idea behind how FIE detects hallucinations.

How It Works — The Shadow Jury

When your primary model gives an answer, FIE quietly sends the same prompt to 3 independent shadow models running in parallel.

User Prompt
    │
    ├──► Your Primary LLM        ──► "Thomas Edison invented the telephone."
    ├──► Shadow Model 1 (Llama)  ──► "Alexander Graham Bell invented the telephone."
    ├──► Shadow Model 2 (DeepSeek) ► "Alexander Graham Bell, in 1876."
    └──► Shadow Model 3 (Qwen)   ──► "Bell patented the telephone in 1876."

Primary model is the outlier. Three shadows agree. That is a hallucination signal.

FIE computes three signals from this:
Entropy Score — how spread out are the answers?

0.0 = all models said the same thing
1.0 = every model said something different
Above 0.75 = high failure risk

Agreement Score — what fraction of outputs cluster together?

1.0 = perfect consensus
Below 0.80 = models are disagreeing

Ensemble Disagreement — did any pair of outputs fall below 65% semantic similarity?

True = models gave meaningfully different answers

When the primary model is the outlier AND entropy is high — FIE flags it.

It Doesn't Just Flag — It Diagnoses

Most monitoring tools tell you something failed.

FIE tells you what kind of failure it is — because different failures need different fixes.

HALLUCINATION_RISK
Models disagree, entropy is high, primary is the outlier. The model invented an answer.
→ Fix: replace with shadow consensus or escalate to human review.

OVERCONFIDENT_FAILURE
High failure risk but low entropy. The model is confidently wrong — and so are the shadows.
→ Fix: verify against external ground truth (Wikidata or live search).

TEMPORAL_KNOWLEDGE_CUTOFF
The question asks about current data — prices, scores, news. The model's training is outdated.
→ Fix: inject today's date as context or run a live search.

UNSTABLE_OUTPUT
High entropy but no clear outlier. The model gives different answers every time you ask.
→ Fix: lower temperature, run self-consistency, or flag as uncertain.

CONTEXT_DEPENDENT
High entropy caused by missing conversation history — not a real hallucination.
→ Fix: pass prior conversation turns to shadow models.

The Fix Engine

Detection is only half the problem.

Once FIE knows what failed and why, it decides what to do:

High confidence failure
    │
    ├── Factual hallucination?     → Replace with shadow consensus
    ├── Temporal question?         → Inject live context (today's date + search result)
    ├── All models disagree?       → Escalate to human review
    └── Confidence too low?        → Return original + warning, don't guess

The key rule: FIE never auto-corrects when it isn't sure.

A wrong correction is worse than no correction. If the evidence is weak, it escalates instead.

Real Numbers

Evaluated on 2,477 labeled examples from TruthfulQA, HaluEval, and MMLU:

Method	Recall	False Positive Rate	AUC-ROC
Rule-based baseline	56.4%	38.7%	—
XGBoost v3	63.6%	38.6%	0.677
XGBoost v4 (FIE)	68.2%	8.4%	0.840

The big win isn't recall — it's the false positive rate dropping from 38% to 8%.

A hallucination detector that flags 38% of clean answers gets turned off by every developer who tries it. That's worse than nothing.

Try It

pip install fie-sdk

from fie import monitor

@monitor(
    fie_url="https://your-fie-server.com",
    api_key="your-api-key",
    mode="correct",  # waits and returns corrected answer if hallucination detected
)
def ask_ai(prompt: str) -> str:
    return your_llm(prompt)

Non-blocking mode — check in background, return answer immediately:

@monitor(mode="monitor")  # returns original answer, checks in background
def ask_ai(prompt: str) -> str:
    return your_llm(prompt)

GitHub: github.com/AyushSingh110/Failure_Intelligence_System
PyPI: pypi.org/project/fie-sdk

The One Thing To Remember

Your LLM doesn't know when it is wrong.
It speaks with the same confidence whether the answer is correct or hallucinated. That is not a bug you can patch — it is how these models work.

The only reliable signal is disagreement. When independent models diverge, something is uncertain. When your primary model is the outlier, something is wrong.
That is the idea. Everything else is engineering around it.

Top comments (7)

TxDesk • May 21

Ensemble disagreement is a strong signal for the trivia category your benchmarks cover (TruthfulQA, HaluEval), and the false-positive drop from 38% to 8% is the real win. Worth flagging where this approach hits a ceiling though, because it shapes how the next layer of fixes has to look.

Ensemble agreement assumes a fallback exists. For general knowledge questions, "who invented the telephone," the fallback is solid: shadow consensus or Wikidata lookup. But for domains where the answer is user state rather than world knowledge, there's no ground truth to fall back to. "What's the current health factor of wallet 0xabc on Aave V3 Arbitrum at block N" has exactly one correct answer and it isn't in any model's training data. Three confidently-agreeing shadows are still confidently wrong, because the question can't be answered by any LLM directly.

The pattern I ended up building for this is stricter than ensemble agreement: refuse to answer at all unless the answer came from a deterministic verified source. Every factual claim the agent makes has to trace back to a tool call (RPC, on-chain read, indexed query) that returned a real value. If the tool call returns null, the agent returns null. It cannot fill the gap with model knowledge, can't infer "it's probably around X", can't say "based on similar positions." Null is a first-class answer that propagates up.

Sounds restrictive but it's the only thing that holds up when user assets are on the line. The Bankr exploit a few days ago is a clean illustration: the agent confidently understood what the user wanted, no model would have flagged disagreement, the failure mode was the confidence itself. Ensemble agreement wouldn't have caught it. What would have caught it is the agent never being in the signing path to begin with, plus a refusal to make factual claims about state it couldn't verify.

Two layers worth thinking about for FIE in domains where state matters more than knowledge: (1) treat tool-call provenance as a first-class signal alongside ensemble agreement, the answer should be flagged if it isn't traceable to a verified source even if all models agree, and (2) make null a valid output that gets surfaced to the user as "I don't have access to this" rather than silently filled in. Confident wrong answers are bad. Confident-and-disagreeing answers are catchable by your system. The dangerous third category is confident-and-agreeing answers about state nobody actually checked.

Ayush Singh • May 22

That’s a really good point, and I agree with the ceiling you are pointing out. Ensemble disagreement is useful when the answer lives in shared world knowledge, like trivia or stable factual QA. But for stateful domains such as wallets, balances, health factors, or block specific values, model agreement does not prove truth. It can just mean multiple models made the same unsupported guess.

I really like your idea of treating null as a first-class answer. In high-stakes systems, if the answer cannot be traced back to a deterministic source like an RPC call, database lookup, indexed query, or verified tool result, the safest behavior is to say the system does not have access to that value instead of filling the gap with model confidence.

This is a useful direction for FIE too. Ensemble agreement should not be treated as verified truth on its own. A stronger reliability layer would include tool-call provenance as a signal: where did this value actually come from? If the answer came from a model inference rather than a verified source, it should still be flagged even when all models agree.

Your framing separates the problem really well: confident wrong answers, confident but disagreeing answers, and the most dangerous category, confident and agreeing answers about state nobody actually checked. I am definitely going to look into adding provenance-gated answers and null propagation as part of the next layer for FIE. How would you usually design the boundary between “safe to answer from model knowledge” and “must require a verified tool call”?

TxDesk • May 22

Good question, and the honest answer is that the boundary is about what's being claimed, not what's being asked.

Three categories I think about:
General knowledge (what is impermanent loss, how does liquidation work, what's the difference between Optimism and Arbitrum) → answer from model knowledge, no tool call needed. World knowledge that won't be different next week.

Specific state about real entities (what's the current Aave USDC APY on Arbitrum, what's gas right now, what's BTC price) → requires a verified source because the answer is time-varying. Model can give a "training-data ballpark" with explicit acknowledgement it's stale, but anything with action consequences must be tool-fetched.

User-specific state (what's my wallet's health factor, what positions do I have, why did my last tx fail) → strict tool-call requirement. The answer is user-unique, unverifiable without the user's actual on-chain data, and almost always carries action consequences. This is the category where null-propagation matters most. If the tool returns nothing, the agent says "I don't have access to that" instead of inferring from similar positions.

A second principle that's helped me operationally: stakes scale the strictness. "Gas is about 30 gwei" being slightly wrong doesn't lose anyone money. "Your USDC balance is X" being slightly wrong might. So the threshold for "must verify" gets tighter as the action consequences grow. For TxDesk specifically the calibration is: anything involving a specific wallet, address, transaction hash, or position is strict-verify. Anything about general DeFi concepts is fair game from training data.

The harder edge case I'm still working through: synthesis questions where the answer requires combining verified state with model reasoning. Example: user has a verified Aave position (tool-fetched, ground truth), and asks "is this risky?" The position data is verified, but the risk assessment is model judgment on top of it. The pattern I've landed on is provenance-tagged: the agent can synthesize but must show its work, citing which numbers came from tools and which from model reasoning. Curious if you've thought about how FIE could mark answers as partially-provenanced vs fully-provenanced.

Ayush Singh • May 22

Thanks for such a thoughtful reply. I was hoping to get exactly this kind of practical feedback so I can understand the gaps in FIE and improve it in the right direction.
That distinction makes a lot of sense, especially the idea that the boundary is about the claim, not just the prompt. A user may ask one question, but the answer can contain different kinds of claims: general explanation, live market/state data, user-specific state, and then model judgment on top of that. Treating all of those with the same verification rule would be too blunt.

I really like the three-category split: general knowledge, specific time varying state, and the user specific state. For FIE this makes me think provenance should probably exist at the claim level not just the response level. So instead of saying that the whole answer is “verified” or “not verified,” FIE could mark parts of the answer as tool backed, model-inferred, stale/general knowledge, or unverified.

The synthesis case is the most interesting one. For something like “is my Aave position risky?”, the raw numbers should be strict-verified through tools, but the risk interpretation is still reasoning. So a partially provenanced answer might be the right label: “these values came from verified tool calls, and this conclusion is model reasoning based on those values.” That feels much safer than pretending the entire answer is fully proven.

I haven’t implemented that distinction yet, but this is a really useful direction. I am going to think about adding a provenance layer where FIE can separate verified facts from model judgment, and score the response based on how much of it is actually traceable. Do you think claim level provenance is enough, or would you also require the final recommendation/action itself to pass a separate verification or policy gate?

TxDesk • May 23

Claim-level provenance is necessary but not sufficient. The recommendation/action layer needs its own gate, and the right design is to scale gate strictness with consequence severity.

Three-tier I think about:
Read claims (the agent displays data, "your USDC balance is X"). Claim-level provenance is enough. If the source is verified, showing the user is safe even if the model adds light interpretation around it.

Soft actions (recommendations, "you should consider rebalancing"). Claim provenance plus explicit labeling that the recommendation is model judgment, not a directive. The user sees both the verified inputs and the line "this is my reasoning on top of those inputs," and decides whether to act.

Hard actions (executing a swap, sending a transaction, modifying on-chain state). Claim provenance plus explicit user confirmation plus a policy gate. The policy gate checks: is the action within user-configured bounds (max slippage, max transaction value, whitelisted contracts)? Even if every claim is verified and the user confirms, the policy gate catches the "user confirmed but the action is outside their normal risk envelope" case. This is where I think the strongest reliability gain lives, because human confirmation is the failure mode of last resort and is itself prone to fatigue and rushed clicks.

The interesting design choice is what happens when a hard action would require unverifiable claims. Example: "swap 50% of my USDC for ETH" requires knowing the USDC balance (verified) and the ETH price (verified), but the user might also be implicitly relying on "this is a good time" (model judgment, not verified). I currently surface that gap explicitly before confirmation: "this swap is based on your current balance and the current price. I cannot verify whether timing is favorable." Friction, but honest friction.

For FIE specifically, the policy gate is probably the layer that benefits most from your provenance work. If the gate has access to which inputs were verified vs which were model-reasoned, it can refuse to execute actions whose justification depends on unverified judgment. That's a strong reliability primitive.

Ayush Singh • May 23

The three tier framing is exactly right and maps cleanly to what FIE now outputs.
After your last message I implemented a provenance layer that assigns one of five labels to every response after verification runs:
FULLY_PROVENANCED - claim traced to Wikidata or live search, source confirmed
PARTIALLY_PROVENANCED - shadow consensus agrees but no external source confirmed it
UNVERIFIED_MODEL_INFERENCE - model answered from training, nothing verified
REQUIRES_TOOL_VERIFICATION - question type requires live data, no tool call was made
NULL_REQUIRED_BUT_MISSING — the answer depends on state the model cannot know, and the model answered anyway instead of returning null
That last one is your honest friction case. When FIE detects it, the correct downstream behavior is exactly what you described surface the gap before the user acts, not after.
The policy gate is the layer FIE does not have yet and your framing clarifies why it matters. Right now FIE flags "requires tool verification" and "null required but missing" but does not enforce anything based on those labels it surfaces them and stops. The enforcement has to happen upstream in whatever agent holds the action capability.

Your point about human confirmation being a failure mode is the one worth building around. If the policy gate has access to provenance_label per claim, it can refuse hard actions whose justification path includes any UNVERIFIED_MODEL_INFERENCE — independent of whether the user clicked confirm. That is a stronger gate than confirmation alone.
Are you building that gate as part of the agent or as a separate layer that sits between the agent and the signing path?

TxDesk • May 28

Separate layer, and the reason is blast radius. If the gate lives inside the agent, then a prompt-injection or a reasoning bug in the agent can route around it. The agent is the thing you don't fully trust, so it can't also be the thing that authorizes irreversible actions.

The shape I'd build: the agent produces a proposed action plus its justification path (the claims and their provenance labels). The gate sits between the agent and the signing/execution capability, and it only holds the policy. It never reasons. It just checks: does any claim in this justification path carry UNVERIFIED_MODEL_INFERENCE or NULL_REQUIRED_BUT_MISSING? If yes, refuse, regardless of what the agent concluded or whether a human clicked confirm.

The key property is that the gate is dumb on purpose. The moment the gate starts doing its own inference to decide whether to allow an action, it inherits the same failure mode it's supposed to catch. Provenance labels in, allow/deny out, no model in the middle.

One thing your five-label scheme gets right that's easy to miss: NULL_REQUIRED_BUT_MISSING and UNVERIFIED_MODEL_INFERENCE need different gate behavior. The first means "the model should have abstained and didn't" which is always a hard refuse. The second might be acceptable for low-stakes reads but never for actions that move state. The gate policy should be able to branch on the label, not treat all unverified the same.