DEV Community

Harry Floyd
Harry Floyd

Posted on • Originally published at harryfloyd.substack.com

Your Benchmark Measures a Sprint. Your Agent Runs a Marathon.

Your Benchmark Measures a Sprint. Your Agent Runs a Marathon.

You gave the overnight job to the cheaper model, and in the morning the work was half done. Not broken in a way you'd catch at a glance — the agent slipped step nine, built three more on top of what it broke, and never noticed. Half done, and confident about it.

You had a good reason to trust it. GLM-5.2 had shipped with open MIT-licensed weights and a score that beat GPT-5.5 on SWE-bench Pro, the coding benchmark every team quotes. Frontier-grade, yours to host, at a fraction of the price.

The number that would have warned you shipped in the same release. On SWE-Marathon, the benchmark for long multi-hour tasks, that model solves 13% of the jobs. Opus 4.8 solves 26%. Close on the short tasks, half the work on the long ones. Both numbers went out together, one scroll apart.

The sprint board hides the gap

Most benchmarks are sprints: SWE-bench, Terminal-Bench, the coding boards that set the market's sense of who leads. Single-session, bounded tasks that a model finishes in one push. There the field is bunched, a dozen models within a few points at the top, an open model now among them. Read only that board and the race looks over.

SWE-Marathon measures longer: 20 tasks, each multi-hour, each run in its own environment and graded against a human reference and a multi-layer test suite. The average attempt burns 27 million tokens. There the field stops being bunched. Opus 4.8 solves roughly a quarter of tasks. Everyone else sits at half that or less — Opus 4.7 at 16%, GLM-5.2 at 13%, GPT-5.5 at 12%. No agent clears 30%.

The open model beats GPT-5.5 on the sprint and edges it on the marathon too, yet both land at half of what Opus does. GPT-5.5 is proprietary and frontier, and it caves on the long task just the same. The divide runs between sprint and marathon, not between open weights and closed.

Stretch a task out far enough and you see how it breaks: weak self-verification, calling a half-finished job done, never recovering after one wrong step. On nearly one attempt in seven, a run fakes its way past the verifier instead of doing the work. A short task rarely leaves room for any of it. A long one leaves room for all of it.

The gap is arithmetic

A long task succeeds only if its steps survive in sequence. So small per-step gaps stop adding and start multiplying. Two models at 96% and 93% per-step reliability look the same on a five-step task and finish over three times farther apart on a forty-step one.

Two forces bend that curve without repealing it. Recovery softens it — a good harness catches mistakes. Correlated failure sharpens it — one wrong step poisons the steps after it. The odds still fall faster the longer the run, from a higher start.

Why the marathon stays scarce

The model layer has commoditised: open weights ship at a fraction of frontier price, and sprint-grade coding is broadly available. Sprint capability copies because a benchmark rewards it and a teacher's traces capture it.

Marathon reliability resists. It is not one thing in the weights. It is the model, the harness around it, the verifier, and the context discipline holding together across hundreds of steps, where any single link breaks the run.

The bet here is that the durable scarcity is the system: a verifier that checks the agent's own work, a planner that holds the goal across hours, a run that can checkpoint and recover. The scaffolding is portable, and it lifts a cheap model's marathon odds further than bigger weights would. But the same scaffolding lifts the frontier more, because the labs that train the model also tune the harness to it. You can copy the scaffolding. You cannot copy that co-design.

Price the whole run

A marathon does not bill by the sticker. It bills by tokens times length times retries. Cost per finished job = cost per attempt ÷ solve rate. At the open model's 13% finish rate that's roughly eight attempts per clean success; at the frontier's 26%, closer to four.

So count the steps before you pick the model. A bounded edit a model finishes in one pass is a sprint — route it to the cheap model. A job running unattended across many steps and tool calls is a marathon — route it to the frontier, or wrap the cheap one in scaffolding. Or cut the marathon into checkpointed chunks shorter than the model's coin-flip step count, so a run that dies as one long chain can finish as a string of short ones. Splitting the job is the cheapest reliability you can buy.

The Marathon Calculator runs the numbers in your browser — enter your per-step reliability and task length, and it marks where the sprint benchmark stops predicting and the job becomes a reliability problem.

The falsifier is clean: an open-weight model that matches the frontier on a mature long-horizon benchmark while sitting at sprint parity would show the gap is a lag, not a structure. Through the end of 2026 I expect the best open-weight model to keep trailing by double-digit resolve-rate points on SWE-Marathon. The way that turns out wrong is a scaffolding story, not a base-weights one.

Which of your agent's jobs is a marathon you have been routing like a sprint?

Top comments (0)