A massive new study titled SKILLSBENCH has just been released, and it’s a must-read for anyone building or using AI agents. As LLMs evolve into autonomous agents, the industry is racing to find the best way to help them handle complex, domain-specific tasks without the high cost of fine-tuning.
The answer? Agent Skills—modular packages of procedural knowledge (instructions, code templates, and heuristics) that augment agents at inference time.
📊 The Study at a Glance
Researchers tested 7 agent-model configurations (including Claude Code, Gemini CLI, and Codex) across 84 tasks in 11 different domains. They compared three conditions:
No Skills: The agent flies solo with just instructions.
Curated Skills: Human-authored, high-quality procedural guides.
Self-Generated Skills: The agent is asked to write its own guide before starting.
💡 Key Takeaways
Curated Skills are a Game Changer: Adding human-curated Skills boosted average pass rates by 16.2 percentage points. In specialized fields like Healthcare and Manufacturing, the gains were massive (up to +51.9pp).
AI Cannot Grade Its Own Homework: "Self-generated" Skills provided zero benefit on average. Models often fail to recognize when they need specialized knowledge or produce vague, unhelpful procedures.
Smaller Models Can "Punch Up": A smaller model (like Haiku 4.5) equipped with Skills can actually outperform a much larger model (like Opus 4.5) that doesn't have them.
Less is More: Focused Skills with only 2-3 modules outperformed massive, "comprehensive" documentation. Too much info creates "cognitive overhead" for the agent.
🏆 Top Performer
The combination of Gemini CLI + Gemini 3 Flash achieved the highest raw performance, hitting a 48.7% pass rate when equipped with Skills.
🛠 Why This Matters
For developers and enterprise teams, this proves that human expertise is still the bottleneck. Building a library of high-quality, modular "Skills" is currently a more effective (and cheaper) way to scale AI agent performance than just waiting for bigger models or spending a fortune on fine-tuning.
Reference: https://arxiv.org/abs/2602.12670
Top comments (0)