DEV Community

AyushkhatiDev's Org
AyushkhatiDev's Org

Posted on

I built an open-source LLM eval framework as a BCA student — hallucination detection, red-teaming, regression tracking

 ## The Problem

Every company building AI products needs to know if their LLM is
actually working — or getting worse over time. This is harder than
it sounds.

I built an open-source evaluation framework to solve this.

What It Does

  • Runs a 27-test suite covering factual accuracy, safety refusals, hallucination resistance, adversarial prompts, and reasoning
  • Scores outputs using a 3-tier judge chain: semantic similarity → LLM judge → regex fallback
  • Auto-generates adversarial prompt attacks to red-team any endpoint
  • Tracks regressions across model versions
  • Live dashboard with pass/fail rates and per-test inspection

Research Finding

The hallucination scorer hit 86% classification accuracy vs
50% random baseline on a 50-case benchmark.

Architecture

Flask backend → PostgreSQL → Groq API → Next.js dashboard

Deployed completely free on Render + Vercel + Neon + Upstash.

Links

Stack

Flask, SQLAlchemy, Groq SDK, PostgreSQL, Next.js, Framer Motion,
Render, Vercel

About Me

I'm a BCA student from Siliguri, India. I built this in a few weeks
because I wanted a portfolio project that solves a real problem —
not another todo app.

Would love feedback on the scoring approach and architecture.

Top comments (0)