Phil Rentier Digital

Posted on Mar 27 • Originally published at rentierdigital.xyz

The AI Model You Chose Was Picked by a Server, Not a Score

#ai #machinelearning #softwareengineering #technology

Everything moves so fast.

Every week, I run a test across 7 to 10 of the latest image models. Same prompt, same characters, same instructions. Flux 2 Dev, Ideogram, Nano Banana, and a few others. My agent scores each output automatically — character consistency, text legibility, prompt adherence — and produces a ranked list.

That's how I choose my models. Not from leaderboards (do we actually know how those are built...).

From my own workload. My own criteria. My own scoring.

Last week, Anthropic published an engineering paper explaining why that's probably the only method that holds. I'd arrived at the same conclusion: when a score becomes a target, it stops measuring what it's supposed to measure. AI benchmarks are no exception.

TL;DR: Server configuration alone can swing agentic benchmarks by 6 points — bigger than most leaderboard gaps. Test on your own workload or don't bother.

AI benchmarks: Where marketing meets mathematical magic show

The Paper Nobody Talked About Enough

Anthropic's engineering team ran Terminal-Bench 2.0 with six different server setups. Same Claude model. Same tasks. Same everything. The only variable: how much RAM and CPU each run got, and how strictly those limits were applied.

The gap between the most- and least-resourced setups: 6 percentage points.

To put that in context: the difference between #1 and #5 on most AI coding leaderboards is somewhere between 2 and 4 points. The server setup (which nobody discloses) can be bigger than the actual capability gap you're making decisions on.

Here's what happens under tight limits. The AI agent starts a task, hits a memory spike — installing a library, running tests, launching a subprocess — and the server kills the process. Task failed. Not because the model couldn't do it. Because the box ran out of room. Cool.

The paper shows error rates going from 5.8% on strict setups down to 0.5% with generous ones. That's not the model getting better. That's the server policy changing.

The more interesting part: above a certain resource level, extra power doesn't just fix crashes. It lets the agent try approaches it couldn't attempt before. Install heavier libraries. Run things in parallel. The benchmark starts measuring a different thing depending on how much RAM you gave it.

Anthropic's own words: "A few-point lead might signal a real capability difference — or it might just be a bigger VM."

This Is Older Than Kubernetes

This problem didn't start with server configs.

In 1975, British economist Charles Goodhart wrote: when a measure becomes a target, it stops being a good measure. AI got there eventually, and the consequences have been piling up ever since.

ImageNet scores looked like real progress until researchers tested the same models with shifted backgrounds and poses. Accuracy dropped 40 points. The models had learned to recognize context clues, not the actual objects. GLUE benchmark was declared "solved" at superhuman levels. Those same models couldn't pass a basic language comprehension test designed to check if they actually understood anything. (They didn't. They were very fast at not understanding anything.)

MMLU scores above 86% while the same questions, slightly reworded, exposed huge gaps.

SWE-Bench is the freshest example. An OpenAI audit in February 2026 found that every major model had been trained on data from the benchmark's own repos. The model had basically seen the answers before the test. Remove the scaffolding (the hints and templates that guide models toward solutions) and scores fall from 80% down to around 45%. Same model. Different conditions. 35 points of "capability," gone.

Server noise is just Goodhart's Law applied to Kubernetes. The benchmark was already measuring something slightly off from real capability. Now it's also measuring how generous your server was that day.

Why This Won't Get Fixed

Write this down.

benchmark_score = f(model_capability × server_config)

Both variables affect the score. Server setup is never standardized, never disclosed, and changes between runs. Two labs can run the exact same benchmark on the exact same model and get different numbers because their containers were set up differently. The Terminal-Bench leaderboard uses a provider that lets containers temporarily use more memory than their limit. Anthropic's setup cut them off the moment they hit the limit. Same benchmark. Different rules. Different numbers.

The commercial pressure makes this hard to fix. Labs want their numbers to look good. Getting competitors to agree on a shared server standard means coordinating with the exact people you're trying to beat. Not impossible. But nobody's rushing.

Anthropic's proposed fix: set both a minimum resource guarantee and a separate hard cutoff per task, close enough together that the score difference is basically noise. Solid recommendation. Requires every benchmark maintainer to adopt it, every eval runner to implement it right, and everyone citing scores to start asking "what was the server setup?"

Until that happens: a 2-point lead on a leaderboard is a guess, not a fact.

When a benchmark becomes a leaderboard, it becomes a target. When it becomes a target, it stops being a benchmark.

How I Actually Choose

I didn't get to testing on my own tasks through some smart strategy. I got there by burning money the wrong way first. (Classic RPG mistake: grinding the wrong area for three hours before checking the map.)

First try: self-hosted open-source models. Full control, no API bills, no external dependency. Reality: too slow, needed machines bigger than I was willing to pay for, and the output quality wasn't there for real work. I rebuilt my entire model setup after one of those experiments went sideways. The lesson wasn't which model I picked. It was that running AI seriously has a real cost, and pretending otherwise wastes time.

Second phase: find the middle. Not Opus at $5 per million input tokens. Not local inference with its slowness and setup pain. Something on API, cheap enough to run constantly, good enough for the tasks I actually have.

Kimi K2.5 as primary on OpenClaw, MiniMax M2.5 as fallback. Neither came from a leaderboard. Both came from running actual tasks, looking at what came out, and checking what it cost.

The cover test makes this concrete. Every week, same prompt, 7 to 10 models. My agent checks each output: did it follow the instructions, is the text readable, are the characters consistent. It ranks the results. I look at the ranking next to the price per image and make a call. That's the whole process. No benchmark involved at any step.

Automated evaluation of AI image models comparing performance, cost, and consistency.

Other developers doing similar tests land in similar places. Kimi gets to a working result faster on most tasks and costs 8 to 10 times less than Opus. Claude holds up better when something breaks and the agent needs to fix itself over several rounds. Neither of those insights came from SWE-Bench. Both came from running real tasks and reading the outputs.

The right way to evaluate a model for agent work is closer to just running your agent. Swap the model, run the task, compare what comes out.

Your tasks will give you different answers. That's not a flaw in the method. That's the point.

Three Things That Actually Work

Run your own tasks. Not benchmark tasks. Your tasks. The ones you'll actually run next week. Score the outputs yourself or with an agent. Ten minutes of that beats two hours reading leaderboard comments.

Count the full cost, not the price per token. Kimi at $0.60/M looks cheaper than Claude at $5/M until you count how many rounds it takes to get something usable. Three correction loops and you've already lost the math. Track what it costs to get one output you'd actually ship.

Ignore gaps under 5 points. Given what we know about server variance, contamination, and benchmark scaffolding, anything under 5 points is noise. If two models are that close on a public leaderboard, the only way to know which one is better for you is to test it yourself. (Spoiler: the answer will surprise you.) 😅

The labs will improve this. Terminal-Bench 2.0 already specifies recommended resources per task. SWE-Bench Pro was built specifically because the contamination issue got too obvious to ignore. Progress is real.

What won't change: the incentive structure. Cleaner benchmarks become the new target. Models get better at the cleaner benchmarks. The gap between benchmark score and your real workload stays roughly the same, just with better-looking numbers in the footnotes. Neat.

Builders who ship don't wait for the leaderboard to stabilize. They run the test. They read the output. They push.

The eval that matters is the one you wrote.

Sources

Anthropic Engineering: Quantifying infrastructure noise in agentic coding evals
Goodhart's Law variants in ML: Manheim & Garrabrant, arXiv 1803.04585, 2018
SWE-Bench contamination: OpenAI internal audit referenced in community discussion, February 2026

If you build with AI agents and want more of this — tested, priced, production-grade — subscribe. The hype is free everywhere. The numbers cost something to get.

(*) The cover is AI-generated. The irony of using an automated model ranking system to pick the cover for an article about automated model ranking systems is not lost on me.