Benchmarking the Model Is the Wrong Abstraction

#ai #llm #benchmark #devtools

Benchmarking the workflow is the right one.

I've spent over a year benchmarking AI models. Thousands of evaluations across 100+ models, dozens of task types, multiple scoring modes. And the single biggest thing I've learned is something most people in this space haven't internalized yet:

Model performance is not a number. It's a function.

performance = f( model, task_type, task_theme, prompt_structure, output_constraints, decoding_parameters, dataset_distribution )

Change any one of these variables, and the rankings reshuffle. Sometimes dramatically. The model that wins on your classification task might lose on mine, not because one of us is wrong, but because the task/model pairing is different.

This has massive implications for how we should think about evaluation, routing, and cost.

Prompt structure reshuffles winners

One of the most consistent patterns I've observed: changing the prompt style, not the question, just the syntax and framing, can completely reorder which model comes out on top.

Rephrase a sentiment classification prompt from "Classify as positive/negative/neutral" to "What is the sentiment? Reply with one word," and you'll get different winners. Same task. Same intent. Different leaderboard.

There's one consolation: the worst models tend to stay the worst regardless of how you phrase things. Prompt engineering mostly reshuffles the top-tier competitors. Lower-capability models saturate early and no amount of prompt craft saves them.

But for anyone choosing between the top 5-10 models for a production task, this means your prompt is part of your evaluation, not separate from it.

Task type alone doesn't predict performance

There's a common mental model that goes something like:

Reasoning tasks → reasoning models
Extraction tasks → smaller instruction models
Creative tasks → large frontier models

It sounds logical. It's also wrong more often than you'd expect.

I've run benchmarks where non-reasoning models outperform dedicated reasoning models on reasoning tasks. Where a "Medium" pricing tier model ties with a "Very High" tier flagship. Where the cheapest model in the roster co-leads with the most expensive one.

Performance depends on task theme, prompt syntax, output formatting constraints, and dataset characteristics in ways that broad categories simply can't capture. "Classification" is not one task. It's thousands of tasks that happen to share a label.

Smaller models win more often than people think

In production workflows, RAG pipelines, agent chains, extraction flows, smaller models frequently outperform frontier models on individual steps. They're faster, cheaper, more deterministic, and often better at following rigid output constraints.

The insight that changed how I build systems:

optimal system ≠ best model
optimal system = best model per step

Most pipelines only need a frontier model for a small minority of steps. The rest can run on models that cost 10-25x less with equal or better results on that specific sub-task.

But you'll never discover this by looking at a leaderboard. You'll only see it by benchmarking each step individually.

Model capability is a vector, not a score

Every leaderboard reduces a model to a single number. But model capability is multidimensional:

Reasoning depth
Extraction precision
Format obedience
Hallucination resistance
Instruction following
Long-context handling
Tool use reliability
Latency efficiency

Different tasks project onto different parts of this capability space. A model can be exceptional at reasoning and terrible at format obedience. It can handle 100K context windows flawlessly and still fail at single-label classification because it can't resist adding an explanation.

When you flatten all of this into one score, you lose the information that actually matters for your decision.

Variance follows capability boundaries

Here's something I didn't expect to find: model variance is not strongly correlated with model size or price. It follows a capability boundary pattern.

Capability far exceeds task difficulty → stable success
Capability roughly matches task difficulty → high variance
Capability far below task difficulty → stable failure

The most dangerous zone is the middle one. A model near the edge of its capability for a task will give you brilliant output sometimes and garbage other times. Single-run benchmarks can't detect this. You need multiple passes with stability tracking to see it.

This is why consistency metrics matter as much as accuracy. A model that scores 75% with perfect stability is often more valuable in production than one that scores 82% but fluctuates wildly.

Models regress silently

Another pattern that doesn't get enough attention: capability drift.

I've observed models regress on tasks even when the model name stays the same and prompts remain unchanged. A model scores 82% in January, you retest in March, it scores 71%. Same API endpoint. Same prompt. Different results.

Possible causes: alignment layer adjustments, silent model updates, decoding policy changes, backend routing changes. The providers don't announce these. Most developers never detect it because they don't run controlled evaluations on a schedule.

This is why I treat benchmark results as perishable data. If you're routing production traffic based on an evaluation you ran three months ago, you might already be misrouting.

The prompt that generates the benchmark can fail the benchmark

One of the more interesting things I've noticed: when a model generates evaluation prompts and expected answers, it doesn't necessarily perform well on those tasks itself.

A model can write a perfectly valid classification test with correct expected labels, then fail that exact test when evaluated. The asymmetry between generating instructions and following them is real, and it means you can't trust a model to evaluate itself.

The real question

The AI industry is obsessed with: "Which model is best?"

After a year of benchmarking, I'm convinced this is the wrong question.

The right question is: "Which model is best for this specific task, with this specific prompt structure, in this specific workflow?"

That question can only be answered by benchmarking the workflow, not the model.

Static leaderboards answer the first question. Custom, task-specific, repeatable benchmarking answers the second. The gap between these two approaches is where most teams are silently overpaying, underperforming, or both.

Bio: Marc Kean Paker is the founder of OpenMark, an AI model benchmarking platform for deterministic, cost-aware model selection across 100+ models.