Originally published on NextFuture
Frontier AI agents keep scoring much lower in published evaluations than vendor demos suggest. Across ten benchmarks released between May 22 and May 27, 2026 — by IBM and Artificial Analysis, by ArXiv preprints from teams at OpenAI, Anthropic, and academic labs, and by independent practitioners on Dev.to — the median agent score on production-style tasks sits between 50 and 65 percent. Codex CLI clears 82 percent on terminal tasks; everywhere else, the headline number is below the line a deployment review would approve.
TL;DR: the numbers
BenchmarkBest scoreTask scaleSource
ITBench-AA (agentic enterprise IT)under 50%Frontier models, multiple ops domainsIBM + Artificial Analysis, May 27
OSV-Bench (kernel spec generation)55.10% Pass@1245 Hyperkernel tasksBODHI, ArXiv May 26
HealthBench Professional0.6272 (62.7%)n=525, non-fine-tuned LLMMDIA, ArXiv May 26
Terminal-Bench 2.0 (Codex CLI Goal mode)82.7%Multi-hour unattended terminal tasksOwen Fox, Dev.to May 25
CLEVER (Lean 4 verifiable code, Claude Code)98.8% valid specs / 81.3% acceptedTheorem-proving frameworkAgentic Proving, ArXiv May 25
Long-context reasoning audit0 of 11 benchmarks control position11 long-context suites auditedPositional Failures, ArXiv May 25
Multi-LLM spec generation13 LLMs tested, 6 local-capableReal codebase (excalidraw)thlandgraf, Dev.to May 25
Persona-scaled RL agents17x above chance, 22x faster than LLM baseline300-persona life-sim benchmarkOne Policy Infinite NPCs, ArXiv May 25
Eight rows, drawn from independent reports published in a six-day window. Methodology and the two additional benchmarks reviewed appear below.
How this comparison was assembled
This post aggregates measurement-bearing reports published between May 22 and May 27, 2026. Each source had to report a specific score, a Pass@k number, a task-count denominator, or a controlled comparison. Demo writeups, syndicated press, and capability claims without a denominator were excluded.
Inclusion: original benchmark, named dataset, numeric result, or audit of N prior benchmarks; published in the window above.
Exclusion: vendor marketing pages, single-anecdote threads, unreplicated single-task wins, papers with a Pass@k but no baseline.
Normalization: scores left in source units. HealthBench's 0.6272 is reported alongside the percent equivalent. "Frontier models" in ITBench-AA refers to the top closed-weight tier the authors evaluated.
Two additional benchmarks reviewed but not tabled: FastKernels (GPU kernel generation, argues current benchmarks reward replicating known optimizations rather than discovering new ones), and Energy per Successful Goal (proposes that the right denominator for agentic systems is the user goal, not the model invocation). Both reshape how the headline numbers should be read.
Production task scores: why nothing clears 70 percent
The three benchmarks that came closest to a production deployment scenario — enterprise IT operations (ITBench-AA), kernel specification (OSV-Bench), clinical reasoning (HealthBench Professional) — all landed between 50 and 63 percent for the strongest published configuration. The spread is narrower than the underlying tasks suggest, because each suite stops scoring partial credit on multi-step trajectories. A single failed tool call or a hallucinated intermediate spec drops the whole task to zero.
OSV-Bench is the clearest read. The benchmark contains 245 specification-generation tasks derived from the Hyperkernel OS, and the strongest LLM reaches 55.10 percent Pass@1. That's the absolute ceiling. Real OS deployment requires Pass@1 above 95 percent or human review on every output — which is what the BODHI paper effectively concedes by adding a domain-knowledge layer.
HealthBench Professional shows the same shape. MDIA, a seven-node specialty-routed pipeline, reaches 0.6272 under OpenAI's GPT grading on the full n=525. The architecture matters more than the prompt — but even with architecture, the ceiling sits below two-thirds.
Coding agents: the only category clearing the bar
Coding agents are the outlier. Codex CLI's Goal mode reports 82.7 percent on Terminal-Bench 2.0, an unattended multi-hour task suite. Claude Code's agentic proving framework on CLEVER hits 98.8 percent valid specifications and 81.3 percent accepted under isomorphism checks — the highest absolute number in the corpus. The same week, an independent test gave 13 LLMs the same real codebase (excalidraw) and asked each for a specification tree; six ran on a laptop, hinting that the local-model side of the gap is closing.
Why does coding outperform every other agentic category? Three reasons surface across the reports. Code has a compiler, so the reward signal is sharper than the human-graded scores used in healthcare and enterprise IT. The task surface is mature — Terminal-Bench is on version 2.0, CLEVER builds on Lean 4 tooling — so vendors have had cycles to tune. And the user is technical, so partial successes still ship value while the trajectory recovers. Inside the coding category, the eight-way terminal CLI ecosystem roundup we published this month shows unattended-mode wins do not translate cleanly to supervised pair-programming throughput.
When the headline number lies
The 82.7 percent on Terminal-Bench 2.0 will be quoted everywhere this quarter. It is real, and it is also narrower than it reads. Codex CLI's Goal mode is the unattended-runtime configuration tuned for multi-hour terminal tasks — not a general developer-day workload. The same agent in supervised pair-programming mode trades the unattended autonomy for tighter oversight and a different score profile. Worse, an ArXiv paper from the same week — Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks — demonstrates that single-process, asyncio-driven benchmarking utilities introduce client-side queuing bottlenecks that inflate reported throughput and latency numbers under load. The Positional Failures audit makes a parallel argument for reasoning: 0 of 11 long-context benchmarks jointly control task position, filler content, and context length, which means quoted long-context scores routinely overstate the model's actual reach.
Verdict by builder profile
Solo dev shipping side projects: Pick a coding agent — Codex CLI for unattended terminal work (82.7% Terminal-Bench 2.0), Claude Code where verifiability matters (98.8% on CLEVER). Outside coding, do not trust the headline number; run your own 20-task spot check before committing.
Team of 5-20 with budget pressure: Treat agentic-ops claims as marketing until you see Pass@k on your own task distribution. ITBench-AA's sub-50 percent ceiling on enterprise IT is the realistic prior, not the vendor demo. Pair that with the nine production failure modes catalogued from May engineering blogs before you sign a seat-based contract.
Cost-sensitive batch workload: The Energy per Successful Goal paper argues invocation-level pricing misrepresents agentic cost — six retries on one goal is one user outcome but six billed completions. Price your workload at the goal denominator.
Latency-critical user-facing app: Long-context reasoning is the weakest link in current evaluations. Until benchmarks control task position, assume the model loses material at any depth past your validation context window.
Sources reviewed
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — IBM + Artificial Analysis on Hugging Face, May 27, contributed the sub-50 percent ceiling on agentic IT.
BODHI: Precise OS Kernel Specification Inference — ArXiv, May 26, contributed the 55.10% Pass@1 ceiling on OSV-Bench's 245 tasks.
MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional — ArXiv, May 26, contributed the 0.6272 score on n=525.
Agentic Coding in 2026: Claude Code vs Codex CLI vs Gemini CLI vs Cursor Agent — Owen Fox, Dev.to, May 25, contributed the Codex CLI 82.7% on Terminal-Bench 2.0.
Agentic Proving for Program Verification — ArXiv, May 25, contributed Claude Code's 98.8% / 81.3% on CLEVER.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks — ArXiv, May 25, contributed the 11-benchmark audit on long-context evaluation.
I Gave 13 LLMs the Same Codebase and Asked for a Specification. Six Ran on My Laptop. — Dev.to, May 25, contributed the 13-LLM multi-model spec comparison.
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies — ArXiv, May 25, contributed the 17x-above-chance and 22x-faster numbers on the 300-persona life-sim benchmark.
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks — ArXiv, May 26, contributed the measurement-bias argument against asyncio benchmarking utilities.
Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems — ArXiv, May 25, contributed the goal-level cost denominator.
FAQ
Did anyone run these benchmarks here?
No. This post aggregates ten published reports from May 22 to May 27, 2026. Each row in the TL;DR table cites the original source. The synthesis is the contribution — no claim in this post comes from a private benchmark or a re-run.
Why aggregate instead of running one definitive benchmark?
Single benchmarks lie. The Positional Failures audit and the Production LLM Measurement Bias paper from the same week make the case explicitly: benchmark utilities, position controls, and task framing each introduce errors large enough to flip a ranking. Aggregating ten independent reports surfaces the median behavior and the spread, which is more decision-useful than one heroic run.
How current are these numbers?
All ten sources published between May 22 and May 27, 2026. Tool versions cited: Terminal-Bench 2.0, Lean 4 (CLEVER), OSV-Bench (Hyperkernel), HealthBench Professional. Expect the coding-agent leaders to move 3-8 percentage points within 90 days; the agentic-ops ceiling will move slower, because the dataset and grading work harder.
What's missing from this cut?
Cost-per-task numbers in dollar terms. The May 2026 corpus reports task-count denominators and energy denominators but rarely a clean dollar-per-successful-goal figure. Aggregating that gap is the next post in this series.
This article was originally published on NextFuture. Follow us for more fullstack & AI engineering content.
Top comments (0)