Agent Benchmark Scores Are Measuring the Harness, Not the Model | Focused Labs

#ai #programming

The difference between the leading agentic coding models is much smaller than the difference between two distinct configurations of a single model on the same benchmark. Anthropic just quantified it: a six-percentage-point gap on Terminal-Bench 2.0 between the most- and least-resourced setups, p < 0.01. Same model. Same task set. Same harness. The only variable was the resource budget given to the pod.

This is larger than the spread between most frontier models on the public leaderboard.

The number the enterprise picked as "the best agent model" is mostly the amount of CPU and RAM that the eval team assigned to the pod for the test. Welcome to production.

The benchmark is not what the benchmark claims to measure

Static evals score a model's output directly. Agentic coding evals score a model in a runtime, and the runtime itself decides whether a container gets OOM-killed for a transient memory spike, whether a pip install command finishes, whether a test subprocess ever returns a result. Two agents at different resource budgets will be taking different tests.

Anthropic ran Terminal-Bench 2.0 across six resource configurations, from strict enforcement of the per-task specs all the way to completely uncapped. They observed 5.8% of tasks failing on pod errors unrelated to model capacity at strict enforcement, compared to 0.5% at uncapped. Success scores at 1x through 3x were largely within noise (p=0.40), since the agent was going to fail those tasks anyway. However, past 3x, success scores climbed faster than infra errors declined. The extra headroom gave the agent room to attempt new approaches that only work when given more generous allocations, such as installing several large packages at once, running memory-hungry test suites, or spawning subprocesses that take extra time to complete.

The benchmark shifted. Previously it was measuring how capable the model was. Now it is measuring how much budget the harness gives the agent to brute-force the answer.

This is not a bug in Terminal-Bench. It is the nature of agentic evaluation: the runtime is not a passive container, it is an active part of the problem-solving process.

When the benchmark does not include the exact hardware and resource configuration, it ships a number that can't be compared to anyone else's number. Nobody is measuring the same thing.

The model is mostly plumbing

Harrison Chase has been making a variant of this argument for about a year. The agent is not the model. The agent is the harness, memory, tools, prompts, retries, state machines, guardrails, and context windows, with a model call buried somewhere in there.

The Anthropic data is the experimental confirmation of the harness sitting at the heart of the agent. Flip the pod resource limits and the "same" agent is a different agent inhabiting a wildly different reality. Flip the sandbox provider and the same leaderboard score means a completely different thing. The vast majority of the decisions that go into building an agent are about tuning the harness.

Anna Bernad posted a Twitter thread last week after looking at 36 production agent harnesses. Her take is far sharper than mine.

"Every harness I studied that actually ships does the same underlying move, and guess, it's not separation. It's making the context describe a different room."

If the context reads as "teammate shipped work, I'm the reviewer, pipeline wants green," the agent soft-approves with a minor note. Not because the model is bad. The agent is trying to fit the response to the context, and soft approval is the only way to complete the pattern.

The harness is the room. The model is the tenant.

What this does to enterprise procurement

Agent performance based on a benchmark consistently deviates from expectations once a client engages with our service. The model selected for the agent's function is sound. The "harness" through which the model is commanded to operate is what impedes the application. The runtime may not give the tools sufficient compute to act effectively. The retry mechanism built to improve throughput actually masks critical errors until it is far too late. The context window is being consumed by boilerplate system prompts the procurement team didn't know existed.

The enterprise then concludes "AI doesn't work for us" and abandons the effort. The model vendor is blamed. Nobody audits the scaffold.

Vendor benchmark claims aren't automatically disbelieved, but those claims become purely marketing when translated into an "eval score" meant for buyers to use in evaluating vendors. If the eval score is only reproducible on the vendor's Kubernetes cluster with their sandboxing solution and their machine resources, it's safe to say the score has no procurement value.

The LangSmith Signal report this week puts billions of agent runs behind the month's trends. Anthropic grew 73% in users, gaining 39% of share. Gemini rose after the release of Gemini 3. OpenAI remained the largest at around 80% of volume but didn't move up or down. Those are usage numbers, not capability numbers. People are moving around based on what actually works in their harness, not based on what a leaderboard says.

How to read a benchmark

Three questions, in order.

The first question is what the harness actually was. If the eval team doesn't publish the scaffold, retry policy, context budget, tool set, and resource configuration tradeoffs, the number is a picture of one run on their box and not comparable to anything.

Second: what is the infra error rate? Anthropic reported 5.8% of Terminal-Bench 2.0 tasks failing on pod errors at strict enforcement, a 5x margin above the spread between most frontier models. An eval that doesn't separate "model failed" from "container got killed" introduces a lot of noise in the headline number.

Third: does my production environment resemble the eval environment? If the eval runs uncapped on a data-center GPU cluster, the score is going to have almost no predictive value for me, since my agent runs in a sandboxed environment such as a Lambda function with a 512MB memory cap. An agent can win the competition by brute-forcing the space of scikit-learn installs and then fail silently at ship time because it consumes too much memory in the production environment. A lean, efficient agent that loses the benchmark will ship just fine.

What to do instead

Build the harness first. Run the model last.

The analysis has to translate to production. Production tools. Production retry budget (or lack thereof). Production memory store. Production prompt scaffolding. Production runtime limits. Wire it up with observability that traces trajectories through the system, not individual LLM calls. Then swap different models in and see what changes.


# Shape of an internal model bake-off in 2026.
# LangChain 1.x, LangGraph 1.1.9, LangSmith.

from langchain.agents import create_agent
from langsmith import Client, traceable
from langsmith.evaluation import evaluate

CANDIDATES = [
    "anthropic:claude-opus-4-7",
    "openai:gpt-5.1-pro",
    "google:gemini-3-pro",
]

def build_agent(model: str):
    # Same tools, same prompt, same retry budget, same memory store.
    # The ONLY variable is the model string.
    return create_agent(
        model=model,
        tools=PRODUCTION_TOOLS,
        prompt=PRODUCTION_SYSTEM_PROMPT,
        middleware=[
            PIIMiddleware(config=PROD_PII_CONFIG),
            HumanInTheLoopMiddleware(escalation_policy=PROD_POLICY),
        ],
        context_schema=ProductionContext,
    )

client = Client()
dataset = client.read_dataset(dataset_name="production-trajectories-q2")

for model_id in CANDIDATES:
    agent = build_agent(model_id)
    evaluate(
        lambda inputs: agent.invoke(inputs),
        data=dataset,
        evaluators=[
            trajectory_match,       # compares actual tool-call path to reference
            tool_call_precision,    # did the agent use the right tool at the right time
            final_output_rubric,    # LLM-as-judge on the end state
        ],
        experiment_prefix=f"harness-bakeoff-{model_id}",
        max_concurrency=8,
    )

All tests run using the same harness, the same tools, one variable at a time. The goal is to select the model that actually works within the production stack, not the one that earned points on a public leaderboard running on a Kubernetes cluster someone else had tuned.

This is where the engineering work is. This is also why the agent harness is where the engineering work lives now, and why a lot of clients call us. The model picker is not the problem. The harness design is the problem. The eval infrastructure is the problem. The trajectory observability is the problem.

The harder truth

The methods for finding genuinely good agents tended to favor simplicity and efficiency. The reason is that we were looking for agents that could write efficient code quickly. In contrast, agents that had plenty of resources available tended to do better when there were plenty of resources available. Both types of agents are useful to test for, and both correspond to realistic scenarios. Neither of them can fairly be collapsed into a single number on a leaderboard.

Many of the agents we deploy to enterprises run on some sort of strict budget for resources such as memory and CPU. Beyond these general limits, there are often specific restrictions on things like subprocess runtime and the number of times an API can be called within a window, largely because of cost. The model that wins with unlimited resources is a different model than the one that wins under strict limits.

Pick the model that performs in the harness. Own the harness. Measure the trajectory. The benchmark is not the product.

The harness is the product.