vigneshwar

Posted on Jun 8 • Originally published at github.com

I benchmarked 7 LLMs on 100 identical prompts. The cost gap shocked me.

#python #llm #machinelearning #opensource

Everyone asks: which LLM is the best?

Wrong question.

The right question: which LLM is best for your use case, at your scale, at your budget?

I ran 100 identical prompts across 7 major LLMs. Here's what the data actually showed.

The Numbers Nobody Shows You

Model	Accuracy	Cost/1K	Latency
GPT-4o	88.2%	$0.0080	892ms
Claude 3.5 Sonnet	87.6%	$0.0090	1240ms
GPT-4o-mini	78.4%	$0.0003	432ms
Gemini 1.5 Flash	76.8%	$0.0001	380ms
Claude 3 Haiku	74.2%	$0.0010	410ms
Mistral Small	71.0%	$0.0010	520ms
Llama 3 8B	64.4%	$0.0002	680ms

GPT-4o vs Gemini Flash:

11% accuracy gap
80x cost gap
2.3x speed gap

For 90% of production apps — Gemini wins.

Why I Built This

Every AI leaderboard ranks by accuracy.

Your production AWS bill ranks by cost per request.

They are not the same list.

I built an open source LLM Evaluation Framework that benchmarks any model across all 5 dimensions simultaneously:

5 Metrics in One Run

1. Accuracy
Four-strategy cascade scorer:

Exact string match (case-normalized)
Prefix normalization (strips "The answer is...")
Multiple-choice letter extraction
Fuzzy Levenshtein match at 0.85 threshold

2. Latency (full percentile breakdown)

p50, p75, p90, p95, p99
SLA violation rate against configurable threshold
Async parallel evaluation via LiteLLM

3. Cost per 1K tokens

From real token counts in API responses
Not estimates — actual billing data
Supports 15+ model providers

4. Hallucination Rate

Linguistic signal analysis
Detects hedging phrases, uncertainty markers, ungrounded claims vs grounding signals
Runs entirely locally, zero extra API cost
Score: 0.0 (grounded) to 1.0 (heavily hallucinating)

5. Reasoning Quality

Chain-of-thought depth scoring
Counts reasoning markers, grounding signals, response calibration
Score: 1 (one-word answer) to 10 (structured multi-step reasoning)

Quick Start

pip install llm-evaluation-framework

export OPENAI_API_KEY=your-key
export GEMINI_API_KEY=your-key

llm-eval compare --models gpt-4o-mini --models gemini/gemini-1.5-flash --benchmark mmlu --samples 100

Output:

Model              Accuracy  Latency p95  Cost/1K    Hallucination  Reasoning
gpt-4o-mini        78.4%     891ms        $0.000300  0.12           7.2
gemini-1.5-flash   76.8%     743ms        $0.000100  0.15           6.8

What's Under the Hood

Async parallel evaluation — runs all samples concurrently with configurable semaphore
MMLU benchmark — 57 subjects, ~14K questions (Massive Multitask Language Understanding)
TruthfulQA benchmark — 817 questions designed to expose common misconceptions
Custom benchmarks — bring your own JSON: [{"prompt": "...", "expected": "..."}]
FastAPI REST API — 12 endpoints, OpenAPI docs
Streamlit dashboard — radar charts, scatter plots, histograms
CLI — 7 subcommands
SQLite persistence — all runs stored, queryable
PDF report generation — shareable evaluation reports

The Uncomfortable Truth About AI Benchmarks

Leaderboards rank models by accuracy on standardized tests.

Production systems rank models by accuracy per dollar.

Those are very different rankings.

GPT-4o vs Gemini Flash:

On a leaderboard: GPT-4o wins by 11%
At 10M requests/month: $80,000 vs $1,000

For most apps, the correct answer 11% more often is NOT worth $79,000/month.

Stop picking LLMs from leaderboards. Start picking them from your data.

Live Demo

Try the accuracy scorer and hallucination detector live — no API key needed:

https://huggingface.co/spaces/vigneshwar234/llm-eval-demo

DEV Community