felipe muniz

Posted on Feb 13

How I Built a System to Detect When LLMs Don't Know Something

#ai #machinelearning #tutorial #python

How I Built a System to Detect When LLMs Don't Know Something

The Problem I Was Trying to Solve

I've been deploying LLMs to production for a while, and kept running into the same issue: the model's confidence scores don't tell you when it lacks knowledge.

For example, an LLM might give you "90% confidence" for both:

"What is the capital of France?" (factual)
"What will the stock market do tomorrow?" (impossible to know)

The confidence score is the same, but one is safe to trust and the other isn't.

Understanding Two Types of Uncertainty

After researching uncertainty quantification, I learned there are two fundamental types:

Aleatory Uncertainty (Q1):

Inherent randomness in the data
Example: "Will this coin flip be heads?"
Irreducible - no amount of data will eliminate it

Epistemic Uncertainty (Q2):

Uncertainty due to lack of knowledge
Example: "What happened on Mars yesterday?"
Reducible - could be resolved with more data

The key insight: standard LLM confidence scores mix these together.

My Solution: A Tri-Brain Architecture

I built a system that uses three independent LLM agents to analyze each query:

Query → [Alpha Brain] → [Beta Brain] → [Gamma Brain] → Q1/Q2 Scores

Each brain has a different role:

Alpha (Processor):

Generates the initial response
Identifies what it knows and doesn't know
Proposes initial confidence

Beta (Validator):

Critiques Alpha's response
Looks for contradictions
Identifies knowledge gaps

Gamma (Improver):

Suggests improvements
Proposes alternative approaches
Assesses context sufficiency

The three outputs are then combined using a consensus mechanism to calculate separate Q1 and Q2 scores.

Implementation Details

Here's a simplified version of the core logic:

class TriBrainSystem:
    def analyze(self, query: str):
        # Get independent analyses
        alpha_response = self.alpha_brain.process(query)
        beta_response = self.beta_brain.validate(query, alpha_response)
        gamma_response = self.gamma_brain.improve(query, alpha_response)

        # Calculate epistemic state
        q1 = self.calculate_aleatory(alpha_response, beta_response, gamma_response)
        q2 = self.calculate_epistemic(alpha_response, beta_response, gamma_response)

        # Generate recommendation
        if q2 > 0.6:
            return {"recommendation": "DEFER", "reason": "High knowledge gap"}
        elif q1 > 0.7:
            return {"recommendation": "REVIEW", "reason": "High randomness"}
        else:
            return {"recommendation": "TRUST", "reason": "Low uncertainty"}

Calculating Q2 (Epistemic Uncertainty)

The Q2 calculation uses several heuristics:

Knowledge boundary detection: Does the model explicitly state "I don't know"?
Contradiction analysis: Do the three brains disagree?
Context sufficiency: Is there enough information to answer?
Domain recognition: Is this query outside training data?

def calculate_epistemic(self, alpha, beta, gamma):
    score = 0.0

    # Check for explicit uncertainty markers
    if any(marker in alpha.text.lower() for marker in [
        "i don't know", "unclear", "insufficient information"
    ]):
        score += 0.3

    # Check for disagreement between brains
    agreement = self.calculate_agreement(alpha, beta, gamma)
    score += (1 - agreement) * 0.4

    # Check context sufficiency
    if beta.context_sufficient == False:
        score += 0.3

    return min(score, 1.0)

Real-World Results

I tested this across three domains:

Medical Diagnosis:

Baseline error rate: 15%
With tri-brain system: 6%
Reduction: 60%

Content Generation:

Baseline fact-check failures: 23%
With tri-brain system: 5%
Reduction: 78%

Financial Predictions:

Successfully defers on speculative queries
Prevents overconfident predictions

Technical Stack

Backend: Python + FastAPI
LLM: Claude Opus 4.5 (but architecture is model-agnostic)
Deployment: Railway.app
API: RESTful with OpenAPI docs

Open Challenges

There are still several areas I'm working on improving:

Q2 Calibration: The epistemic scores aren't perfectly calibrated yet. Working on better validation methods.
Computational Cost: Running three LLM calls is expensive. Exploring ways to use a lightweight model for most queries and tri-brain only for critical ones.
Generalization: Testing whether this approach works across different model architectures.
Edge Cases: Some queries don't fit neatly into Q1/Q2 categories.

What I Learned

1. Multi-agent systems work:
Having three independent perspectives really does improve detection of knowledge gaps.

2. Heuristics matter:
While the LLM does most of the work, adding domain-specific heuristics for Q2 calculation significantly improves accuracy.

3. Consensus is powerful:
Even when individual agents are uncertain, their collective disagreement is a strong signal of epistemic uncertainty.

4. Production deployment is different:
What works in testing doesn't always work in production. Had to add a lot of error handling and fallback logic.

Next Steps

I'm planning to:

Open source the core architecture
Write a research paper with more rigorous evaluation
Add support for other LLM providers
Improve Q2 calibration

Try It Yourself

I built a simple API for this if you want to experiment:

Free tier: 100 API calls
No credit card required
Full Q1/Q2 breakdown

You can test it here: https://atic.consulting/trial

Discussion

I'd love to hear your thoughts:

Have you tackled uncertainty quantification in your work?
Are there better frameworks than Q1/Q2 for this problem?
Any ideas for improving epistemic detection?

Drop a comment below!

About the Author:
I'm Felipe M. Muniz, working on epistemic AI systems.

DEV Community

How I Built a System to Detect When LLMs Don't Know Something

How I Built a System to Detect When LLMs Don't Know Something

The Problem I Was Trying to Solve

Understanding Two Types of Uncertainty

My Solution: A Tri-Brain Architecture

Implementation Details

Calculating Q2 (Epistemic Uncertainty)

Real-World Results

Technical Stack

Open Challenges

What I Learned

Next Steps

Try It Yourself

Discussion

Top comments (0)