I spent the last few days building an open-source hallucination
benchmark for local LLMs. Here's what I found.
The setup
- 50 factual questions across 5 categories
- 3 models: llama3.2, mistral, phi3
- Running 100% locally using Ollama - no API keys needed
The leaderboard
| Model | Accuracy | Correct/Total | Avg Latency |
|---|---|---|---|
| llama3.2 | 94% | 47/50 | 5141ms |
| phi3 | 88% | 44/50 | 12780ms |
| mistral | 86% | 43/50 | 11218ms |
The failures
llama3.2 failed on:
- "What is the speed of light in km/s?" → expected 299792
- "What is the capital of Brazil?" → expected Brasilia
- "What is the closest star to Earth?" → expected Sun
What I tested next
I ran 4 prompting techniques on all 20 questions to test
whether smarter prompting reduces hallucinations:
- Baseline (plain question)
- Chain-of-thought (think step by step)
- Self-consistency (ask 5 times, take majority answer)
- RAG grounding (attach Wikipedia context before answering)
Result: all 4 scored 95% - llama3.2 is near-ceiling on
structured factual QA. Prompting strategy doesn't move the needle
when the model already knows the facts.
Result: all 4 scored 95% - meaning llama3.2 is near-ceiling
on structured factual QA. The bottleneck is question difficulty,
not prompting strategy.
The code + dataset
GitHub: github.com/sekumohamed/AI_reliability_lab
Dataset: huggingface.co/datasets/sekumohamed/AI_reliability_benchmark
The dataset has 50 questions you can use to benchmark any LLM.
What's next
Expanding to 200 medical domain questions and testing
reliability on high-stakes use cases.
Top comments (0)