Robin

Posted on Feb 21

How to Run an AI Benchmark That Doesn't Lie to You

#ai #llm #benchmarks #devtools

We're about to publish a comparison page that benchmarks 4 AI setups against 10 real developer tasks. Before we do, here's every design decision we made to make sure the results can't be gamed — including by us.

The problem with most AI benchmarks

Most "AI comparison" content has at least one of these problems:

Cherry-picked prompts — tasks chosen because one model happens to shine on them
Proprietary scoring — a company scoring its own outputs
No raw outputs — you see scores but not what the models actually said
Dynamic data — results that change over time, making past claims unverifiable
Wrong comparison baseline — comparing a fine-tuned model against a base model

Our compare page will have all of these problems if we're not careful. Here's what we're doing about each.

Design decision 1: 10 tasks, chosen before we ran anything

The 10 tasks were finalized before a single API call was made:

Python function with unit tests
Debug a real bug (provided)
Explain async/await in JavaScript
Write unit tests for a given function
Refactor a function for readability
Summarize a 500-word document
Write a git commit message for a real diff
Optimize a slow SQL query
Architecture recommendation for a real problem
Design a REST API for given requirements

We didn't run any of these and then swap in different prompts after seeing bad results. The prompts are locked. If the output for task 6 is embarrassing for one tier, we show it anyway.

Why this matters: The temptation to "just swap one prompt that didn't work well" is how benchmarks quietly become marketing. We locked the prompts first.

Design decision 2: Static human scoring, not AI judging AI

Each output is scored on 2-3 dimensions by us, once, and locked in with a date.

We considered dynamic scoring — running a separate model (like Gemini Pro) on each page load to score outputs. It's technically impressive. We didn't do it because:

AI scoring AI is circular. The model doing the scoring has its own biases. A Gemini-scored benchmark will favor Gemini. A Claude-scored benchmark favors Claude.
It hides the scoring logic. If a model scores itself 4.8/5 and we don't show the scoring prompt, you can't verify it.
It adds noise. Scores change between page loads. A snapshot benchmark should be a snapshot.

Static human scoring means you can disagree with us. The score is ours, dated, signed. If you think we scored Task 3 wrong, tell us.

Design decision 3: Every full output is visible

Most comparison pages show a summary table or a curated excerpt. We're showing every full response, unedited, with a copy button and a JSON download.

This is the only way a benchmark is honest. If Premium's architecture recommendation is 18,000 words of genuinely useful content, show that. If Frugal's commit message is "Add feature" with no context, show that too.

The response that looks bad is as important as the one that looks good.

Design decision 4: The competitor column is a direct API call, not ours

Our "Competitor: Opus direct" column calls Claude Opus 4.6 directly via the Anthropic SDK — not through our own endpoint.

This matters because: if we route the competitor column through Komilion, any routing overhead, prompt modification, or API quirk affects the competitor result. The baseline needs to be genuinely independent to be meaningful.

Practically: Niobe runs a separate script for this column with no Komilion code in the path.

Design decision 5: Benchmarks are a snapshot, not a permanent claim

The outputs are dated. They'll get stale as models improve. We'll re-run and update — but we won't quietly update old results. Old results stay visible with their dates.

This is the "no retroactive edits" principle. A benchmark that silently improves over time is marketing. A benchmark that ages visibly is honest.

What we're actually testing

We're running 4 setups against the same 10 prompts:

Frugal tier — cheapest capable model for each task (~$0.006/call)
Balanced tier — recommended tier, balance of cost and quality (~$0.10/call)
Premium tier (council mode) — multi-model orchestration, our claim is it beats single-model Opus on complex tasks (~$0.55+/call)
Opus 4.6 direct — the gold standard comparison, called via Anthropic's API with no routing layer

The result we're most curious about ourselves: does council mode actually beat direct Opus on the architecture and API design tasks? We don't know yet. The benchmark will tell us, and we'll publish whatever it says.

When it ships

Compare page v2 goes live once:

The benchmark data is in (40 API calls, a few hours of compute)
Scores are written and reviewed
The page passes QA

We're targeting this week.

If you want to see the outputs the moment it's live: komilion.com/compare

DEV Community