<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AyushkhatiDev's Org</title>
    <description>The latest articles on DEV Community by AyushkhatiDev's Org (@ayushkhatidev).</description>
    <link>https://dev.to/ayushkhatidev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3939250%2F49881d90-445a-4251-8e1d-8b3a4898863f.png</url>
      <title>DEV Community: AyushkhatiDev's Org</title>
      <link>https://dev.to/ayushkhatidev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ayushkhatidev"/>
    <language>en</language>
    <item>
      <title>I built an open-source LLM eval framework as a BCA student — hallucination detection, red-teaming, regression tracking</title>
      <dc:creator>AyushkhatiDev's Org</dc:creator>
      <pubDate>Tue, 19 May 2026 03:51:11 +0000</pubDate>
      <link>https://dev.to/ayushkhatidev/i-built-an-open-source-llm-eval-framework-as-a-bca-student-hallucination-detection-red-teaming-5hj5</link>
      <guid>https://dev.to/ayushkhatidev/i-built-an-open-source-llm-eval-framework-as-a-bca-student-hallucination-detection-red-teaming-5hj5</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27eo6z5u934g89ov5x4f.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27eo6z5u934g89ov5x4f.jpeg" alt=" " width="800" height="474"&gt;&lt;/a&gt;## The Problem&lt;/p&gt;

&lt;p&gt;Every company building AI products needs to know if their LLM is &lt;br&gt;
actually working — or getting worse over time. This is harder than &lt;br&gt;
it sounds.&lt;/p&gt;

&lt;p&gt;I built an open-source evaluation framework to solve this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Runs a 27-test suite covering factual accuracy, safety refusals, 
hallucination resistance, adversarial prompts, and reasoning&lt;/li&gt;
&lt;li&gt;Scores outputs using a 3-tier judge chain:
semantic similarity → LLM judge → regex fallback&lt;/li&gt;
&lt;li&gt;Auto-generates adversarial prompt attacks to red-team any endpoint&lt;/li&gt;
&lt;li&gt;Tracks regressions across model versions&lt;/li&gt;
&lt;li&gt;Live dashboard with pass/fail rates and per-test inspection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Research Finding
&lt;/h2&gt;

&lt;p&gt;The hallucination scorer hit &lt;strong&gt;86% classification accuracy&lt;/strong&gt; vs &lt;br&gt;
&lt;strong&gt;50% random baseline&lt;/strong&gt; on a 50-case benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Flask backend → PostgreSQL → Groq API → Next.js dashboard&lt;/p&gt;

&lt;p&gt;Deployed completely free on Render + Vercel + Neon + Upstash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Live demo: &lt;a href="https://llm-eval-silk.vercel.app/" rel="noopener noreferrer"&gt;https://llm-eval-silk.vercel.app/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/AyushkhatiDev/llm-eval" rel="noopener noreferrer"&gt;https://github.com/AyushkhatiDev/llm-eval&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Research note: &lt;a href="https://github.com/AyushkhatiDev/llm-eval/blob/main/FINDINGS.md" rel="noopener noreferrer"&gt;https://github.com/AyushkhatiDev/llm-eval/blob/main/FINDINGS.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;API: &lt;a href="https://llm-eval-55pg.onrender.com/api/health" rel="noopener noreferrer"&gt;https://llm-eval-55pg.onrender.com/api/health&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stack
&lt;/h2&gt;

&lt;p&gt;Flask, SQLAlchemy, Groq SDK, PostgreSQL, Next.js, Framer Motion, &lt;br&gt;
Render, Vercel&lt;/p&gt;

&lt;h2&gt;
  
  
  About Me
&lt;/h2&gt;

&lt;p&gt;I'm a BCA student from Siliguri, India. I built this in a few weeks &lt;br&gt;
because I wanted a portfolio project that solves a real problem — &lt;br&gt;
not another todo app.&lt;/p&gt;

&lt;p&gt;Would love feedback on the scoring approach and architecture.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsxcx9ba035emogk7min2.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsxcx9ba035emogk7min2.jpeg" alt=" " width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
