Why I Wouldn't Act on SkillsBench

#ai #llm #benchmarks #codingagents

I came across SkillsBench (paper, Feb 2026) while watching Theo, and was genuinely excited. It asks two critical questions: do curated procedural documents ("Skills") actually help coding agents, and which coding agent utilizes them best? The headline number, +16.2pp from curated Skills, felt immediately actionable.

Then I started pulling at the methodology, and things unraveled.

Setup

SkillsBench is ambitious in scope: 84 tasks, 11 domains, 7 coding agents, 7,308 trajectories. It evaluates tasks under three conditions: no Skills, curated (expert-written) Skills, and self-generated Skills. Each task ships with a fixed Skill package (markdown instructions, sometimes with scripts or templates) provided to the agent alongside the task.

The leaderboard

In every benchmark the central outcome is the leaderboard. Here it is Finding 2 (§4.1.1), which crowns Gemini CLI + Flash for best raw performance (48.7%) and Claude Code + Opus 4.5 for largest uplift (+23.3pp). This is a legitimate result — though Flash beating Opus 4.5/4.6 is a bit surprising.

The more interesting question is what the leaderboard actually measures. To answer that, let's consider the actual mechanism behind Skills: they are prompt pieces loaded into the context on demand. So the pass rate shown in the leaderboard doesn't tell us which coding agent uses Skills best. It tells us which agent performed the best — but it doesn't tell us whether the Skills mechanism made any difference, or whether the same result would have been achieved by placing the content directly in the prompt.

The experiment that would settle this: inject the same Skill content directly into the prompt (baseline) vs. let the harness load Skills through its native discovery mechanism. That experiment isn't in the paper — and it's the one that would justify a benchmark titled "SkillsBench."

The leaderboard aside, the paper makes several claims about how and why Skills help.

Here things get complicated

Skill design findings are confounded by task identity

Two of the paper's design-oriented findings sound practical:

2–3 Skills are optimal (+18.6pp); 4+ Skills show diminishing returns (+5.9pp). (Finding 5, §4.2.1)
Moderate-length Skills outperform comprehensive ones — detailed (+18.8pp) and compact (+17.1pp) beat comprehensive (–2.9pp). (Finding 6, §4.2.2)

The problem: In the experiment design, each task ships with a fixed Skill package, so Skill count and Skill complexity are properties of the task. Hence, the experiment presented in the paper cannot isolate the effect of "number of Skills" from the effect of "which task this is." A task that happens to need 4+ Skills is a different task than one that needs 1. The paper stratifies post-hoc by Skill count and draws causal language ("optimal," "diminishing returns"), but the design doesn't support that inference.

The same applies to complexity. The N=140 "comprehensive" bucket that shows –2.9pp could simply contain harder tasks. Without controlling for task difficulty — or better, varying Skill count/complexity within a task — these are correlational observations dressed as design guidelines.

The domain-level claims rest on tiny sample sizes

The paper's most striking result is the domain breakdown (Table 4): Healthcare leads at +51.9pp, Manufacturing at +41.9pp. These numbers anchor the paper's claim that domains with knowledge "underrepresented in model pretraining" benefit most from Skills (Finding 4, §4.1.3).

But Healthcare has 2 tasks and Manufacturing has 3. A single outlier task — and several individual tasks swing by 70–85pp — can dominate an entire domain's aggregate. With N=2, you're not measuring a domain effect; you're measuring two tasks. The paper reports these figures without confidence intervals at the domain level and without flagging the sample size issue.

For comparison, Software Engineering (N=16) shows +4.5pp — a much more defensible estimate, but also a much less exciting one.

The other findings restate what we already know about prompting

We noted that Skills are lazily loaded prompt pieces. With that in mind, try the thought experiment of replacing "Skills" with "prompt" in the remaining findings:

Finding 1 (§4.1.1): curated Skills improve performance → curated, expert-written prompts improve performance.
Finding 7 (§4.2.3): smaller model + Skills can exceed larger model without Skills → a smaller model with a good prompt can outperform a larger model with a mediocre prompt.

Neither of these is surprising. The prompting literature has established both points.

Finding 3 (§4.1.1) — self-generated Skills provide no benefit — is slightly more interesting. Meta-prompting (using a model to generate its own prompts) is a real technique that works in some settings, so this finding could have been novel.

But the likely dynamic here is more mundane: for tasks where the model lacks domain knowledge, it can't write effective Skills because it lacks the knowledge. For tasks where the model already has the domain knowledge, the marginal contribution of a Skill is minimal. Either way, performance doesn't improve when the model writes its own Skills. Do the same substitution exercise again and you get "performance doesn't improve when the model provides its own context" — which is not surprising.

What would make this credible

The paper asks the right questions but doesn't yet have the experiments to answer them. Some of these came up above; here they are in one place.

Isolate the mechanism. A benchmark called "SkillsBench" should measure whether the Skills machinery matters — not just whether the Skills content helps. The cleanest test: take the same Skill content and inject it directly into the prompt (baseline) vs. let the harness load it through its native discovery mechanism. If native loading wins, the Skills architecture is doing real work. If the results are equivalent, Skills are just a packaging format for prompt content — useful, but not what the paper claims to measure.

Isolate the content. A harder but complementary experiment would inject the same token count of topically relevant non-procedural text (API docs, reference material) to test whether procedural structure specifically drives the gains.

Vary Skills within tasks, not across them. The Skill design findings (count, complexity) currently can't be separated from task identity. Run the same task multiple times, each time with a different number of Skills, and measure the delta within each task. Same goes for complexity — give the agent a compact Skill vs. an exhaustive one for the same task, and see what happens. This turns correlational observations into actual design guidance.

Test with a fixed Skill library. In the current setup each task gets its own hand-picked Skill package — the agent always has exactly the right Skills for the job. In practice, you write a set of Skills once and they sit there for every task. The interesting experiment is: give the agent a fixed library of, say, 20–30 Skills across all tasks and see if it can discover and apply the right ones. That tests Skill selection, not just Skill consumption — which is the harder and more realistic problem.

Bottom line

My recommendation: don't act on this paper in its current form. If you're investing in Skills for your agents today, calibrate that investment based on your own trial and error, not on this study's findings.