[EvoSkill] An AI agent learned from its own failures and got 12 points more accurate.

#machinelearning #aiagentsclaudecode

AI coding agents have a structural weakness. Claude Code, Codex, and OpenHands are good at general problem solving. But they lack domain-specific know-how. How to correctly extract numbers from 89,000 pages of US Treasury documents. How to find accurate facts in noisy search results. That kind of expertise does not live inside the model.

The current fix is to write "skills" by hand. A SKILL.md file with step-by-step instructions and helper scripts. Claude Code's skill spec made this format standard. But writing a new skill every time a new task appears does not scale.

In March 2026, Sentient Labs and Virginia Tech released EvoSkill (arXiv:2603.02766). It is a framework that analyzes an agent's failures and generates reusable skills automatically. No model retraining needed. Only the skills evolve.

Why skills are the right level of optimization

Google's AlphaEvolve evolves code. GEPA/DSPy evolves prompts. EvoSkill evolves skills.

Code optimization is tightly bound to a specific model and task. You cannot move it to another environment. Prompt optimization can shift priorities, but it cannot encode multi-step procedures. Skills sit in the middle. They can hold branching logic and verification steps inside a Markdown file. Humans can read them. Edit them. Hand them to a different agent as-is.

EvoSkill runs an evolution loop with three specialized agents. The Executor runs tasks and collects failures. The Proposer analyzes failure traces, finds repeated patterns, and suggests new skills. The Skill Builder turns those suggestions into a SKILL.md file plus helper scripts inside .claude/skills/. Candidates are tested on a held-out validation set, and the top N programs survive as the Frontier.

The base model (Claude Opus 4.5) stays frozen throughout. No weight updates at all. Only the skill layer changes.

The Proposer keeps a running history of past suggestions. It knows what worked and what got rejected. This prevents the loop from going in circles. The paper calls this mechanism Textual Feedback Descent.

What this loop produced

OfficeQA is a grounded reasoning benchmark built on US Treasury Bulletin archives (about 89,000 pages going back to 1939). It has 246 questions. Human solvers spend an average of 50 minutes per question (Databricks).

EvoSkill used only 10% of the data for training and raised exact-match accuracy from 60.6% to 67.9% (+7.3 points). The skills it discovered include a strict verification protocol for extracting numbers from financial tables and a Python-based method for CPI inflation adjustment.

SealQA showed an even bigger gain. This benchmark tests fact-finding when web search returns noisy and contradictory results. Accuracy went from 26.6% to 38.7%, a jump of +12.1 points. The system discovered a skill called search-persistence-protocol. It lists all reasonable interpretations of ambiguous terms and searches each one separately. It requires at least three independent sources before giving an answer. It tries three or more query phrasings before reporting "not found."

The most striking result came next. The skills evolved on SealQA transferred to BrowseComp (an OpenAI benchmark) with zero modification and improved accuracy by +5.3 points. Skill-level optimization produces transferable capabilities, not task-specific heuristics.

Caveats

There is an author overlap concern. Two of the five EvoSkill authors (Weiyuan Chen and Tu Vu) also authored the SealQA benchmark. The +12.1 point improvement on SealQA was partly evaluated by the people who designed the test. The BrowseComp transfer result (+5.3 points) is on an independent benchmark built by OpenAI, which adds some objectivity.

The official blog says "up to 50% accuracy improvement." The largest gain in the paper is +12.1 points in absolute terms (SealQA 26.6% to 38.7%). In relative terms, that is about 45%. When citing this work, use absolute points for accuracy.

The paper is an arXiv preprint as of March 2026. No peer-reviewed publication has been confirmed. GitHub has 160 stars and 12 forks. No independent reproduction by third parties has been reported yet.

Training beyond 10% of the data showed diminishing returns. The 15% split scored lower than 10%, suggesting mild overfitting. However, merging skills discovered in separate runs produced the strongest configuration overall.

Conclusion

Improving AI agents used to mean two things: retrain the model, or write skills by hand. EvoSkill offers a third option. The agent learns from its own failures, generates skills automatically, and those skills transfer to tasks it has never seen.

OfficeQA +7.3 points. SealQA +12.1 points. BrowseComp +5.3 points with zero-shot transfer. The model weights were never touched. Only the skill layer, made of Markdown instructions and helper scripts, evolved.

Prompt engineering is hitting its ceiling. Code optimization does not transfer. Skills are the optimization layer in between, and they are human-readable. The paper is still pre-review, but the idea that agents could share skill libraries the way developers share packages is worth watching.