DEV Community

sekumohamed
sekumohamed

Posted on

I benchmarked 3 local LLMs on 50 factual questions -here's what failed

I spent the last few days building an open-source hallucination
benchmark for local LLMs. Here's what I found.

The setup

  • 50 factual questions across 5 categories
  • 3 models: llama3.2, mistral, phi3
  • Running 100% locally using Ollama - no API keys needed

The leaderboard

Model Accuracy Correct/Total Avg Latency
llama3.2 94% 47/50 5141ms
phi3 88% 44/50 12780ms
mistral 86% 43/50 11218ms

The failures

llama3.2 failed on:

  • "What is the speed of light in km/s?" → expected 299792
  • "What is the capital of Brazil?" → expected Brasilia
  • "What is the closest star to Earth?" → expected Sun

What I tested next

I ran 4 prompting techniques on all 20 questions to test
whether smarter prompting reduces hallucinations:

  • Baseline (plain question)
  • Chain-of-thought (think step by step)
  • Self-consistency (ask 5 times, take majority answer)
  • RAG grounding (attach Wikipedia context before answering)

Result: all 4 scored 95% - llama3.2 is near-ceiling on
structured factual QA. Prompting strategy doesn't move the needle
when the model already knows the facts.

Result: all 4 scored 95% - meaning llama3.2 is near-ceiling
on structured factual QA. The bottleneck is question difficulty,
not prompting strategy.

The code + dataset

GitHub: github.com/sekumohamed/AI_reliability_lab
Dataset: huggingface.co/datasets/sekumohamed/AI_reliability_benchmark

The dataset has 50 questions you can use to benchmark any LLM.

What's next

Expanding to 200 medical domain questions and testing
reliability on high-stakes use cases.

Top comments (0)