I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python

#machinelearning #python #datascience #ai

Most hallucination detection approaches tell you to train another model. I did not want to do that. I used four statistical signals, a combined score, and a tunable threshold. No fine-tuning. No GPU. No external API. Tested on 10,000 real examples from the HaluEval dataset.
Soft flag result: precision 0.71, recall 0.96.
Strict flag result: precision 1.00, recall 0.38.
Here’s how it works.

Why Not Just Use a Model?

Approaches like SelfCheckGPT require multiple model samples and significant compute. That adds up fast when you are scoring thousands of answers a day. You also end up with a black box sitting on top of another black box. When something goes wrong, you have no idea which layer failed.
I wanted something where every flag has a reason you can actually read.

The Core Idea

Hallucination answers behave differently from grounded ones in ways you can measure. You do not need a model for this. You just need to look at the right things.
Four signals ended up doing most of the work.

Signal 1: Length Ratio
When a model does not know the answer, it pads. It generates more text to sound convincing instead of staying close to the facts.

df['answer_len'] = df['answer'].str.split().str.len() df['knowledge_len'] = df['knowledge'].str.split().str.len() df['length_ratio'] = df['answer_len'] / df['knowledge_len']

Average length ratio: hallucinated 0.22 vs not hallucinated 0.05

Signal 2: Unknown Word Rate
Grounded answers stay close to the source. Hallucinated answers introduce words that never appeared in the reference text.

def unknown_word_rate(row): 
knowledge_words = set(str(row['knowledge']).lower().split()) 
answer_words = set(str(row['answer']).lower().split()) 
if not answer_words: 
    return 0 
unknown = answer_words - knowledge_words 
return len(unknown) / len(answer_words)

Average unknown word rate: hallucinated 0.46 vs not hallucinated 0.30

Signal 3: Question-Answer Overlap
When a model fabricates, it often just echoes the question back. Instead of pulling from the source, it repeats the question words in the answer.

def question_answer_overlap(row): 
question_words = set(str(row['question']).lower().split()) 
answer_words = set(str(row['answer']).lower().split()) 
if not question_words: 
   return 0 
overlap = question_words & answer_words 
return len(overlap) / len(question_words)

Average overlap: hallucinated 0.39 vs not hallucinated 0.02

Signal 4: Numeric Inconsistency
Numbers are where models hallucinate most confidently. The general concept might be right but the date, quantity, or statistic is just wrong.

def numeric_inconsistency(row): 
knowledge_nums = set(re.findall(r'\b\d+\b', str(row['knowledge']))) 
answer_nums = set(re.findall(r'\b\d+\b', str(row['answer']))) 
if not answer_nums: 
   return 0 
inconsistent = answer_nums - knowledge_nums
return len(inconsistent) / len(answer_nums)

Average numeric inconsistency: hallucinated 0.087 vs not hallucinated 0.0001

Combining Into a Score

Each signal contributes one point if it crosses its threshold. Every answer gets a score from 0 to 4.

df['score'] = ( 
(df['length_ratio'] > 0.1).astype(int) + 
(df['unknown_word_rate'] > 0.4).astype(int) + 
(df['qa_overlap'] > 0.2).astype(int) + 
(df['numeric_inconsistency'] > 0.5).astype(int) 
)

Not hallucinated answers cluster at 0 and 1. Hallucinated answers clustered at 2, 3, and 4.
Average score: hallucinated 2.18 vs not hallucinated 0.39

Two Thresholds Depending on Your Risk Tolerance

Soft flag (score >= 1): precision 0.71, recall 0.96 Use this when missing a hallucination costs more than a false alarm. Think financial services, healthcare, legal.
Strict flag (score >= 3): precision 1.00, recall 0.38 Use this when your review capacity is limited and you only want the obvious cases.
You can tune the threshold without retraining anything. That matters in production.
Plugging It In

def score_answer(knowledge, question, answer): 
knowledge_words = set(str(knowledge).lower().split()) 
answer_words = set(str(answer).lower().split()) 
question_words = set(str(question).lower().split()) 
knowledge_nums = set(re.findall(r'\b\d+\b', str(knowledge))) 
answer_nums = set(re.findall(r'\b\d+\b', str(answer))) 

answer_len = len(answer_words) 
knowledge_len = len(knowledge_words) if knowledge_words else 1 

length_ratio = answer_len / knowledge_len 
unknown_word_rate = len(answer_words - knowledge_words) / len(answer_words) if answer_words else 0 
qa_overlap = len(question_words & answer_words) / len(question_words) if question_words else 0 
numeric_inconsistency = len(answer_nums - knowledge_nums) / len(answer_nums) if answer_nums else 0 
score = ( 
                    int(length_ratio > 0.1) + 
        int(unknown_word_rate > 0.4) + 
        int(qa_overlap > 0.2) + 
        int(numeric_inconsistency > 0.5) 
) 
return score

score = score_answer(knowledge, question, answer) 
if score >= 3: 
action = "block" 
elif score >= 1: 
action = "flag" 
else: 
 action = "pass"

runs in milliseconds. No model to load, no GPU, no API call. Log the score and individual signal values for every answer. Over time that becomes your calibration dataset.

Real Examples

Hallucinated, score 3/4
Question: What U.S. highway gives access to Zilpo Road, and is also known as Midland Trail? Answer: It's actually Zilpo Road that is known as Midland Trail, not US 60.
The model deflected and contradicted the source instead of answering. Caught.
Hallucinated, score 3/4
Question: Dua Lipa's debut album spawned "New Rules" — in what year was it released? Answer: The album was released in 2018.
The correct year is 2017. Confident, wrong, numeric flag caught it.
Not hallucinated, score 0/4
Question: The Dutch-Belgian series "House of Anubis" was based on — first aired in what year? Answer: 2006.
Correct, grounded, one word. Score zero.

Limitations Worth Knowing

This only works if you have source knowledge to compare against. It does not apply to open-ended generation without a retrievable source. Best fit is RAG pipelines and QA systems.
It uses word-level matching, not semantic understanding. A hallucination that paraphrases the source closely might slip through. The thresholds were tuned on HaluEval so if you are working in a specialized domain, recalibrate on your own data first.
Precision of 0.71 on the soft flag means about 3 in 10 flags are false alarms. That is a tradeoff, not a flaw. Monitor it.

Final Thought

AI produces what it receives. If the outputs are not being validated, you will not know what you are getting. This framework is one way to start checking without adding a lot of infrastructure.
Full code on GitHub: github.com/ritikade2/llm-hallucination-detector

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.