Everyone asks: which LLM is the best?
Wrong question.
The right question: which LLM is best for your use case, at your scale, at your budget?
I ran 100 identical prompts across 7 major LLMs. Here's what the data actually showed.
The Numbers Nobody Shows You
| Model | Accuracy | Cost/1K | Latency |
|---|---|---|---|
| GPT-4o | 88.2% | $0.0080 | 892ms |
| Claude 3.5 Sonnet | 87.6% | $0.0090 | 1240ms |
| GPT-4o-mini | 78.4% | $0.0003 | 432ms |
| Gemini 1.5 Flash | 76.8% | $0.0001 | 380ms |
| Claude 3 Haiku | 74.2% | $0.0010 | 410ms |
| Mistral Small | 71.0% | $0.0010 | 520ms |
| Llama 3 8B | 64.4% | $0.0002 | 680ms |
GPT-4o vs Gemini Flash:
- 11% accuracy gap
- 80x cost gap
- 2.3x speed gap
For 90% of production apps — Gemini wins.
Why I Built This
Every AI leaderboard ranks by accuracy.
Your production AWS bill ranks by cost per request.
They are not the same list.
I built an open source LLM Evaluation Framework that benchmarks any model across all 5 dimensions simultaneously:
5 Metrics in One Run
1. Accuracy
Four-strategy cascade scorer:
- Exact string match (case-normalized)
- Prefix normalization (strips "The answer is...")
- Multiple-choice letter extraction
- Fuzzy Levenshtein match at 0.85 threshold
2. Latency (full percentile breakdown)
- p50, p75, p90, p95, p99
- SLA violation rate against configurable threshold
- Async parallel evaluation via LiteLLM
3. Cost per 1K tokens
- From real token counts in API responses
- Not estimates — actual billing data
- Supports 15+ model providers
4. Hallucination Rate
- Linguistic signal analysis
- Detects hedging phrases, uncertainty markers, ungrounded claims vs grounding signals
- Runs entirely locally, zero extra API cost
- Score: 0.0 (grounded) to 1.0 (heavily hallucinating)
5. Reasoning Quality
- Chain-of-thought depth scoring
- Counts reasoning markers, grounding signals, response calibration
- Score: 1 (one-word answer) to 10 (structured multi-step reasoning)
Quick Start
pip install llm-evaluation-framework
export OPENAI_API_KEY=your-key
export GEMINI_API_KEY=your-key
llm-eval compare --models gpt-4o-mini --models gemini/gemini-1.5-flash --benchmark mmlu --samples 100
Output:
Model Accuracy Latency p95 Cost/1K Hallucination Reasoning
gpt-4o-mini 78.4% 891ms $0.000300 0.12 7.2
gemini-1.5-flash 76.8% 743ms $0.000100 0.15 6.8
What's Under the Hood
- Async parallel evaluation — runs all samples concurrently with configurable semaphore
- MMLU benchmark — 57 subjects, ~14K questions (Massive Multitask Language Understanding)
- TruthfulQA benchmark — 817 questions designed to expose common misconceptions
-
Custom benchmarks — bring your own JSON:
[{"prompt": "...", "expected": "..."}] - FastAPI REST API — 12 endpoints, OpenAPI docs
- Streamlit dashboard — radar charts, scatter plots, histograms
- CLI — 7 subcommands
- SQLite persistence — all runs stored, queryable
- PDF report generation — shareable evaluation reports
The Uncomfortable Truth About AI Benchmarks
Leaderboards rank models by accuracy on standardized tests.
Production systems rank models by accuracy per dollar.
Those are very different rankings.
GPT-4o vs Gemini Flash:
- On a leaderboard: GPT-4o wins by 11%
- At 10M requests/month: $80,000 vs $1,000
For most apps, the correct answer 11% more often is NOT worth $79,000/month.
Stop picking LLMs from leaderboards. Start picking them from your data.
Live Demo
Try the accuracy scorer and hallucination detector live — no API key needed:
https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
Links
- GitHub: https://github.com/vignesh2027/LLM-Evaluation-Framework
- HuggingFace Space: https://huggingface.co/spaces/vigneshwar234/llm-eval-demo
- Dataset (1,200 benchmark samples): https://huggingface.co/datasets/vigneshwar234/llm-eval-benchmark
71 tests. 82% coverage. Full CI/CD on GitHub Actions. Open source. Free forever.
If this helped — drop a star on GitHub. Building in public, feedback welcome.
Top comments (0)