Every company building AI products needs to know if their LLM is
actually working — or getting worse over time. This is harder than
it sounds.
I built an open-source evaluation framework to solve this.
What It Does
- Runs a 27-test suite covering factual accuracy, safety refusals, hallucination resistance, adversarial prompts, and reasoning
- Scores outputs using a 3-tier judge chain: semantic similarity → LLM judge → regex fallback
- Auto-generates adversarial prompt attacks to red-team any endpoint
- Tracks regressions across model versions
- Live dashboard with pass/fail rates and per-test inspection
Research Finding
The hallucination scorer hit 86% classification accuracy vs
50% random baseline on a 50-case benchmark.
Architecture
Flask backend → PostgreSQL → Groq API → Next.js dashboard
Deployed completely free on Render + Vercel + Neon + Upstash.
Links
- Live demo: https://llm-eval-silk.vercel.app/
- GitHub: https://github.com/AyushkhatiDev/llm-eval
- Research note: https://github.com/AyushkhatiDev/llm-eval/blob/main/FINDINGS.md
- API: https://llm-eval-55pg.onrender.com/api/health
Stack
Flask, SQLAlchemy, Groq SDK, PostgreSQL, Next.js, Framer Motion,
Render, Vercel
About Me
I'm a BCA student from Siliguri, India. I built this in a few weeks
because I wanted a portfolio project that solves a real problem —
not another todo app.
Would love feedback on the scoring approach and architecture.


Top comments (0)