<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ritika</title>
    <description>The latest articles on DEV Community by Ritika (@ritika_2603).</description>
    <link>https://dev.to/ritika_2603</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3951174%2F1d8f21e2-e61c-4d5c-b1fe-08a212b4a9a8.png</url>
      <title>DEV Community: Ritika</title>
      <link>https://dev.to/ritika_2603</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ritika_2603"/>
    <language>en</language>
    <item>
      <title>[Boost]</title>
      <dc:creator>Ritika</dc:creator>
      <pubDate>Mon, 25 May 2026 20:16:12 +0000</pubDate>
      <link>https://dev.to/ritika_2603/-31g4</link>
      <guid>https://dev.to/ritika_2603/-31g4</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/ritika_2603/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of-python-39d6" class="crayons-story__hidden-navigation-link"&gt;I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/ritika_2603" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3951174%2F1d8f21e2-e61c-4d5c-b1fe-08a212b4a9a8.png" alt="ritika_2603 profile" class="crayons-avatar__image" width="96" height="96"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/ritika_2603" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Ritika
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Ritika
                
              
              &lt;div id="story-author-preview-content-3752108" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/ritika_2603" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3951174%2F1d8f21e2-e61c-4d5c-b1fe-08a212b4a9a8.png" class="crayons-avatar__image" alt="" width="96" height="96"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Ritika&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/ritika_2603/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of-python-39d6" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/ritika_2603/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of-python-39d6" id="article-link-3752108"&gt;
          I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/machinelearning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;machinelearning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/ritika_2603/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of-python-39d6#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            4 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python</title>
      <dc:creator>Ritika</dc:creator>
      <pubDate>Mon, 25 May 2026 20:15:46 +0000</pubDate>
      <link>https://dev.to/ritika_2603/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of-python-39d6</link>
      <guid>https://dev.to/ritika_2603/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of-python-39d6</guid>
      <description>&lt;p&gt;Most hallucination detection approaches tell you to train another model. I did not want to do that. I used four statistical signals, a combined score, and a tunable threshold. No fine-tuning. No GPU. No external API. Tested on 10,000 real examples from the HaluEval dataset.&lt;br&gt;
Soft flag result: precision 0.71, recall 0.96. &lt;br&gt;
Strict flag result: precision 1.00, recall 0.38.&lt;br&gt;
Here’s how it works.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Not Just Use a Model?
&lt;/h2&gt;

&lt;p&gt;Approaches like SelfCheckGPT require multiple model samples and significant compute. That adds up fast when you are scoring thousands of answers a day. You also end up with a black box sitting on top of another black box. When something goes wrong, you have no idea which layer failed.&lt;br&gt;
I wanted something where every flag has a reason you can actually read.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;Hallucination answers behave differently from grounded ones in ways you can measure. You do not need a model for this. You just need to look at the right things.&lt;br&gt;
Four signals ended up doing most of the work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 1: Length Ratio&lt;/strong&gt;&lt;br&gt;
When a model does not know the answer, it pads. It generates more text to sound convincing instead of staying close to the facts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df['answer_len'] = df['answer'].str.split().str.len() df['knowledge_len'] = df['knowledge'].str.split().str.len() df['length_ratio'] = df['answer_len'] / df['knowledge_len']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Average length ratio: hallucinated 0.22 vs not hallucinated 0.05&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 2: Unknown Word Rate&lt;/strong&gt;&lt;br&gt;
Grounded answers stay close to the source. Hallucinated answers introduce words that never appeared in the reference text.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def unknown_word_rate(row): 
knowledge_words = set(str(row['knowledge']).lower().split()) 
answer_words = set(str(row['answer']).lower().split()) 
if not answer_words: 
    return 0 
unknown = answer_words - knowledge_words 
return len(unknown) / len(answer_words)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Average unknown word rate: hallucinated 0.46 vs not hallucinated 0.30&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 3: Question-Answer Overlap&lt;/strong&gt;&lt;br&gt;
When a model fabricates, it often just echoes the question back. Instead of pulling from the source, it repeats the question words in the answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def question_answer_overlap(row): 
question_words = set(str(row['question']).lower().split()) 
answer_words = set(str(row['answer']).lower().split()) 
if not question_words: 
   return 0 
overlap = question_words &amp;amp; answer_words 
return len(overlap) / len(question_words)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Average overlap: hallucinated 0.39 vs not hallucinated 0.02&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 4: Numeric Inconsistency&lt;/strong&gt;&lt;br&gt;
Numbers are where models hallucinate most confidently. The general concept might be right but the date, quantity, or statistic is just wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def numeric_inconsistency(row): 
knowledge_nums = set(re.findall(r'\b\d+\b', str(row['knowledge']))) 
answer_nums = set(re.findall(r'\b\d+\b', str(row['answer']))) 
if not answer_nums: 
   return 0 
inconsistent = answer_nums - knowledge_nums
return len(inconsistent) / len(answer_nums)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Average numeric inconsistency: hallucinated 0.087 vs not hallucinated 0.0001&lt;/p&gt;

&lt;h2&gt;
  
  
  Combining Into a Score
&lt;/h2&gt;

&lt;p&gt;Each signal contributes one point if it crosses its threshold. Every answer gets a score from 0 to 4.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df['score'] = ( 
(df['length_ratio'] &amp;gt; 0.1).astype(int) + 
(df['unknown_word_rate'] &amp;gt; 0.4).astype(int) + 
(df['qa_overlap'] &amp;gt; 0.2).astype(int) + 
(df['numeric_inconsistency'] &amp;gt; 0.5).astype(int) 
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not hallucinated answers cluster at 0 and 1. Hallucinated answers clustered at 2, 3, and 4.&lt;br&gt;
Average score: hallucinated 2.18 vs not hallucinated 0.39&lt;/p&gt;
&lt;h2&gt;
  
  
  Two Thresholds Depending on Your Risk Tolerance
&lt;/h2&gt;

&lt;p&gt;Soft flag (score &amp;gt;= 1): precision 0.71, recall 0.96 Use this when missing a hallucination costs more than a false alarm. Think financial services, healthcare, legal.&lt;br&gt;
Strict flag (score &amp;gt;= 3): precision 1.00, recall 0.38 Use this when your review capacity is limited and you only want the obvious cases.&lt;br&gt;
You can tune the threshold without retraining anything. That matters in production.&lt;br&gt;
&lt;strong&gt;Plugging It In&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def score_answer(knowledge, question, answer): 
knowledge_words = set(str(knowledge).lower().split()) 
answer_words = set(str(answer).lower().split()) 
question_words = set(str(question).lower().split()) 
knowledge_nums = set(re.findall(r'\b\d+\b', str(knowledge))) 
answer_nums = set(re.findall(r'\b\d+\b', str(answer))) 

answer_len = len(answer_words) 
knowledge_len = len(knowledge_words) if knowledge_words else 1 

length_ratio = answer_len / knowledge_len 
unknown_word_rate = len(answer_words - knowledge_words) / len(answer_words) if answer_words else 0 
qa_overlap = len(question_words &amp;amp; answer_words) / len(question_words) if question_words else 0 
numeric_inconsistency = len(answer_nums - knowledge_nums) / len(answer_nums) if answer_nums else 0 
score = ( 
                    int(length_ratio &amp;gt; 0.1) + 
        int(unknown_word_rate &amp;gt; 0.4) + 
        int(qa_overlap &amp;gt; 0.2) + 
        int(numeric_inconsistency &amp;gt; 0.5) 
) 
return score

score = score_answer(knowledge, question, answer) 
if score &amp;gt;= 3: 
action = "block" 
elif score &amp;gt;= 1: 
action = "flag" 
else: 
 action = "pass" 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;runs in milliseconds. No model to load, no GPU, no API call. Log the score and individual signal values for every answer. Over time that becomes your calibration dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Examples
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hallucinated, score 3/4&lt;/strong&gt;&lt;br&gt;
Question: What U.S. highway gives access to Zilpo Road, and is also known as Midland Trail? Answer: It's actually Zilpo Road that is known as Midland Trail, not US 60.&lt;br&gt;
The model deflected and contradicted the source instead of answering. Caught.&lt;br&gt;
&lt;strong&gt;Hallucinated, score 3/4&lt;/strong&gt;&lt;br&gt;
Question: Dua Lipa's debut album spawned "New Rules" — in what year was it released? Answer: The album was released in 2018.&lt;br&gt;
The correct year is 2017. Confident, wrong, numeric flag caught it.&lt;br&gt;
&lt;strong&gt;Not hallucinated, score 0/4&lt;/strong&gt;&lt;br&gt;
Question: The Dutch-Belgian series "House of Anubis" was based on — first aired in what year? Answer: 2006.&lt;br&gt;
Correct, grounded, one word. Score zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations Worth Knowing
&lt;/h2&gt;

&lt;p&gt;This only works if you have source knowledge to compare against. It does not apply to open-ended generation without a retrievable source. Best fit is RAG pipelines and QA systems.&lt;br&gt;
It uses word-level matching, not semantic understanding. A hallucination that paraphrases the source closely might slip through. The thresholds were tuned on HaluEval so if you are working in a specialized domain, recalibrate on your own data first.&lt;br&gt;
Precision of 0.71 on the soft flag means about 3 in 10 flags are false alarms. That is a tradeoff, not a flaw. Monitor it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;AI produces what it receives. If the outputs are not being validated, you will not know what you are getting. This framework is one way to start checking without adding a lot of infrastructure.&lt;br&gt;
Full code on GitHub: github.com/ritikade2/llm-hallucination-detector&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>datascience</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
