I Built a Tool to Benchmark 100+ LLMs on My Actual Use Case — Here's What I Learned

#ai #productivity #machinelearning #showdev

Static leaderboards rank LLMs on generic benchmarks like MMLU and HumanEval. But when I needed to pick a model for my specific task — extracting structured data from messy legal documents — those scores were useless.

So I built OpenMark, an open benchmarking platform that lets you test 100+ AI models on your actual prompt.

The Problem with Static Benchmarks

Every week there's a new "State of the Art" model. MMLU scores keep climbing. But here's the thing:

A model that tops coding benchmarks might be terrible at your specific domain
Pricing varies 100x between providers — and leaderboards don't show real API costs
Response quality can vary between runs — stability matters as much as peak performance

I was tired of switching models based on hype, only to find they performed worse on my actual workload.

What I Built

OpenMark lets you:

Write your real prompt (or describe your task and let AI generate a benchmark YAML)
Select models across providers — GPT-5.2, Claude 4.5 Sonnet, Gemini 3.0 Flash, DeepSeek chat, Llama, Mistral, and 100+ more
Run a benchmark that hits real APIs and scores responses deterministically
Compare results with actual latency, token costs, and consistency metrics

Here's what makes it different from playgrounds and arenas:

Feature	Static Leaderboard	Playground/Arena	OpenMark
Your actual task	❌	✅ (manual)	✅ (automated)
Real API costs	❌	❌	✅
Deterministic scoring	❌	❌ (vibes)	✅
100+ models at once	❌	❌ (2-4)	✅
Stability metrics	❌	❌	✅

Surprising Things I Learned

After running hundreds of benchmarks, some patterns emerged:

1. Expensive ≠ Better (for your task)

For straightforward extraction tasks, GPT-4.1 Mini and Gemini 2.0 Flash consistently matched or beat models costing 10-50x more. The expensive models shine on complex reasoning — but most production prompts don't need that.

2. Stability Varies Wildly

Some models give you a perfect answer 9/10 times and garbage the 10th. If you're building production systems, that 10% failure rate matters more than the peak score. Running multiple iterations revealed which models you can actually trust.

3. The "Best" Model Changes With Your Prompt

I tested the same concept with three different prompt phrasings. The model rankings reshuffled each time. This is why static benchmarks are misleading — they test one phrasing, once.

4. Newer Isn't Always Better

The latest release often has rough edges. Models that have been available for a few months tend to be more stable and better optimized for cost.

How to Try It

You can run a benchmark in under 60 seconds:

Go to openmark.ai
Describe your task (e.g., "Which LLM is best at summarizing medical research papers?")
Click Quick Benchmark — it auto-generates a task, picks diverse models, and starts running
Watch results stream in with scores, costs, and latency

This is the 'quick' flow. You can also go full hands on and manually create everything from scratch and select exactly the configuration and models you want to run.

Free tier gives you 100 credits to start. A typical benchmark across 8 models costs ~4-8 credits.

What's Next

I'm working on:

Programmatic content pages for common comparisons (best LLM for coding, best LLM for writing, etc.)
Benchmark history so you can track model improvements over time
Team sharing for collaborative evaluation

If you've been picking models based on Twitter hype or static leaderboards, give it a shot with your actual use case. The results might surprise you, and be actually really valuable for your use cases. I found that it was *very * valuable to me establishing which models I wanted to power my RAG's agentic flows.

🔗 OpenMark — Benchmark AI Models on Your Actual Task

What's your experience been with model selection? Have you found that benchmark scores match your real-world results? Drop a comment 👇