DEV Community

vigneshwar
vigneshwar

Posted on • Originally published at github.com

I benchmarked 7 LLMs on 100 identical prompts. The cost gap shocked me.

Everyone asks: which LLM is the best?

Wrong question.

The right question: which LLM is best for your use case, at your scale, at your budget?

I ran 100 identical prompts across 7 major LLMs. Here's what the data actually showed.


The Numbers Nobody Shows You

Model Accuracy Cost/1K Latency
GPT-4o 88.2% $0.0080 892ms
Claude 3.5 Sonnet 87.6% $0.0090 1240ms
GPT-4o-mini 78.4% $0.0003 432ms
Gemini 1.5 Flash 76.8% $0.0001 380ms
Claude 3 Haiku 74.2% $0.0010 410ms
Mistral Small 71.0% $0.0010 520ms
Llama 3 8B 64.4% $0.0002 680ms

GPT-4o vs Gemini Flash:

  • 11% accuracy gap
  • 80x cost gap
  • 2.3x speed gap

For 90% of production apps — Gemini wins.


Why I Built This

Every AI leaderboard ranks by accuracy.

Your production AWS bill ranks by cost per request.

They are not the same list.

I built an open source LLM Evaluation Framework that benchmarks any model across all 5 dimensions simultaneously:

5 Metrics in One Run

1. Accuracy
Four-strategy cascade scorer:

  • Exact string match (case-normalized)
  • Prefix normalization (strips "The answer is...")
  • Multiple-choice letter extraction
  • Fuzzy Levenshtein match at 0.85 threshold

2. Latency (full percentile breakdown)

  • p50, p75, p90, p95, p99
  • SLA violation rate against configurable threshold
  • Async parallel evaluation via LiteLLM

3. Cost per 1K tokens

  • From real token counts in API responses
  • Not estimates — actual billing data
  • Supports 15+ model providers

4. Hallucination Rate

  • Linguistic signal analysis
  • Detects hedging phrases, uncertainty markers, ungrounded claims vs grounding signals
  • Runs entirely locally, zero extra API cost
  • Score: 0.0 (grounded) to 1.0 (heavily hallucinating)

5. Reasoning Quality

  • Chain-of-thought depth scoring
  • Counts reasoning markers, grounding signals, response calibration
  • Score: 1 (one-word answer) to 10 (structured multi-step reasoning)

Quick Start

pip install llm-evaluation-framework

export OPENAI_API_KEY=your-key
export GEMINI_API_KEY=your-key

llm-eval compare --models gpt-4o-mini --models gemini/gemini-1.5-flash --benchmark mmlu --samples 100
Enter fullscreen mode Exit fullscreen mode

Output:

Model              Accuracy  Latency p95  Cost/1K    Hallucination  Reasoning
gpt-4o-mini        78.4%     891ms        $0.000300  0.12           7.2
gemini-1.5-flash   76.8%     743ms        $0.000100  0.15           6.8
Enter fullscreen mode Exit fullscreen mode

What's Under the Hood

  • Async parallel evaluation — runs all samples concurrently with configurable semaphore
  • MMLU benchmark — 57 subjects, ~14K questions (Massive Multitask Language Understanding)
  • TruthfulQA benchmark — 817 questions designed to expose common misconceptions
  • Custom benchmarks — bring your own JSON: [{"prompt": "...", "expected": "..."}]
  • FastAPI REST API — 12 endpoints, OpenAPI docs
  • Streamlit dashboard — radar charts, scatter plots, histograms
  • CLI — 7 subcommands
  • SQLite persistence — all runs stored, queryable
  • PDF report generation — shareable evaluation reports

The Uncomfortable Truth About AI Benchmarks

Leaderboards rank models by accuracy on standardized tests.

Production systems rank models by accuracy per dollar.

Those are very different rankings.

GPT-4o vs Gemini Flash:

  • On a leaderboard: GPT-4o wins by 11%
  • At 10M requests/month: $80,000 vs $1,000

For most apps, the correct answer 11% more often is NOT worth $79,000/month.

Stop picking LLMs from leaderboards. Start picking them from your data.


Live Demo

Try the accuracy scorer and hallucination detector live — no API key needed:

https://huggingface.co/spaces/vigneshwar234/llm-eval-demo


Links

71 tests. 82% coverage. Full CI/CD on GitHub Actions. Open source. Free forever.


If this helped — drop a star on GitHub. Building in public, feedback welcome.

Top comments (0)