How I Built a System to Detect When LLMs Don't Know Something
The Problem I Was Trying to Solve
I've been deploying LLMs to production for a while, and kept running into the same issue: the model's confidence scores don't tell you when it lacks knowledge.
For example, an LLM might give you "90% confidence" for both:
- "What is the capital of France?" (factual)
- "What will the stock market do tomorrow?" (impossible to know)
The confidence score is the same, but one is safe to trust and the other isn't.
Understanding Two Types of Uncertainty
After researching uncertainty quantification, I learned there are two fundamental types:
Aleatory Uncertainty (Q1):
- Inherent randomness in the data
- Example: "Will this coin flip be heads?"
- Irreducible - no amount of data will eliminate it
Epistemic Uncertainty (Q2):
- Uncertainty due to lack of knowledge
- Example: "What happened on Mars yesterday?"
- Reducible - could be resolved with more data
The key insight: standard LLM confidence scores mix these together.
My Solution: A Tri-Brain Architecture
I built a system that uses three independent LLM agents to analyze each query:
Query → [Alpha Brain] → [Beta Brain] → [Gamma Brain] → Q1/Q2 Scores
Each brain has a different role:
Alpha (Processor):
- Generates the initial response
- Identifies what it knows and doesn't know
- Proposes initial confidence
Beta (Validator):
- Critiques Alpha's response
- Looks for contradictions
- Identifies knowledge gaps
Gamma (Improver):
- Suggests improvements
- Proposes alternative approaches
- Assesses context sufficiency
The three outputs are then combined using a consensus mechanism to calculate separate Q1 and Q2 scores.
Implementation Details
Here's a simplified version of the core logic:
class TriBrainSystem:
def analyze(self, query: str):
# Get independent analyses
alpha_response = self.alpha_brain.process(query)
beta_response = self.beta_brain.validate(query, alpha_response)
gamma_response = self.gamma_brain.improve(query, alpha_response)
# Calculate epistemic state
q1 = self.calculate_aleatory(alpha_response, beta_response, gamma_response)
q2 = self.calculate_epistemic(alpha_response, beta_response, gamma_response)
# Generate recommendation
if q2 > 0.6:
return {"recommendation": "DEFER", "reason": "High knowledge gap"}
elif q1 > 0.7:
return {"recommendation": "REVIEW", "reason": "High randomness"}
else:
return {"recommendation": "TRUST", "reason": "Low uncertainty"}
Calculating Q2 (Epistemic Uncertainty)
The Q2 calculation uses several heuristics:
- Knowledge boundary detection: Does the model explicitly state "I don't know"?
- Contradiction analysis: Do the three brains disagree?
- Context sufficiency: Is there enough information to answer?
- Domain recognition: Is this query outside training data?
def calculate_epistemic(self, alpha, beta, gamma):
score = 0.0
# Check for explicit uncertainty markers
if any(marker in alpha.text.lower() for marker in [
"i don't know", "unclear", "insufficient information"
]):
score += 0.3
# Check for disagreement between brains
agreement = self.calculate_agreement(alpha, beta, gamma)
score += (1 - agreement) * 0.4
# Check context sufficiency
if beta.context_sufficient == False:
score += 0.3
return min(score, 1.0)
Real-World Results
I tested this across three domains:
Medical Diagnosis:
- Baseline error rate: 15%
- With tri-brain system: 6%
- Reduction: 60%
Content Generation:
- Baseline fact-check failures: 23%
- With tri-brain system: 5%
- Reduction: 78%
Financial Predictions:
- Successfully defers on speculative queries
- Prevents overconfident predictions
Technical Stack
- Backend: Python + FastAPI
- LLM: Claude Opus 4.5 (but architecture is model-agnostic)
- Deployment: Railway.app
- API: RESTful with OpenAPI docs
Open Challenges
There are still several areas I'm working on improving:
Q2 Calibration: The epistemic scores aren't perfectly calibrated yet. Working on better validation methods.
Computational Cost: Running three LLM calls is expensive. Exploring ways to use a lightweight model for most queries and tri-brain only for critical ones.
Generalization: Testing whether this approach works across different model architectures.
Edge Cases: Some queries don't fit neatly into Q1/Q2 categories.
What I Learned
1. Multi-agent systems work:
Having three independent perspectives really does improve detection of knowledge gaps.
2. Heuristics matter:
While the LLM does most of the work, adding domain-specific heuristics for Q2 calculation significantly improves accuracy.
3. Consensus is powerful:
Even when individual agents are uncertain, their collective disagreement is a strong signal of epistemic uncertainty.
4. Production deployment is different:
What works in testing doesn't always work in production. Had to add a lot of error handling and fallback logic.
Next Steps
I'm planning to:
- Open source the core architecture
- Write a research paper with more rigorous evaluation
- Add support for other LLM providers
- Improve Q2 calibration
Try It Yourself
I built a simple API for this if you want to experiment:
- Free tier: 100 API calls
- No credit card required
- Full Q1/Q2 breakdown
You can test it here: https://atic.consulting/trial
Discussion
I'd love to hear your thoughts:
- Have you tackled uncertainty quantification in your work?
- Are there better frameworks than Q1/Q2 for this problem?
- Any ideas for improving epistemic detection?
Drop a comment below!
About the Author:
I'm Felipe M. Muniz, working on epistemic AI systems.
Top comments (0)