I benchmarked 3 local LLMs on 50 factual questions -here's what failed

#ai #llm #opensource #showdev

I spent the last few days building an open-source hallucination
benchmark for local LLMs. Here's what I found.

The setup

50 factual questions across 5 categories
3 models: llama3.2, mistral, phi3
Running 100% locally using Ollama - no API keys needed

The leaderboard

Model	Accuracy	Correct/Total	Avg Latency
llama3.2	94%	47/50	5141ms
phi3	88%	44/50	12780ms
mistral	86%	43/50	11218ms

The failures

llama3.2 failed on:

"What is the speed of light in km/s?" → expected 299792
"What is the capital of Brazil?" → expected Brasilia
"What is the closest star to Earth?" → expected Sun

What I tested next

I ran 4 prompting techniques on all 20 questions to test
whether smarter prompting reduces hallucinations:

Baseline (plain question)
Chain-of-thought (think step by step)
Self-consistency (ask 5 times, take majority answer)
RAG grounding (attach Wikipedia context before answering)

Result: all 4 scored 95% - llama3.2 is near-ceiling on
structured factual QA. Prompting strategy doesn't move the needle
when the model already knows the facts.

Result: all 4 scored 95% - meaning llama3.2 is near-ceiling
on structured factual QA. The bottleneck is question difficulty,
not prompting strategy.

The code + dataset

GitHub: github.com/sekumohamed/AI_reliability_lab
Dataset: huggingface.co/datasets/sekumohamed/AI_reliability_benchmark

The dataset has 50 questions you can use to benchmark any LLM.

What's next

Expanding to 200 medical domain questions and testing
reliability on high-stakes use cases.

DEV Community