A free model matched GPT-5.2. No fine-tuning. It rewrote its own skill files until it got there

#reinforcementlearning

Kimi-K2.5 scores 21.4% on a difficult benchmark. GPT-5.2 scores 41.1%. A team gave Kimi-K2.5 a system that rewrites its own skill files after every failure. No fine-tuning. No new training data. After a few rounds, it hit 40.6%. That is GPT-5.2 territory.

The paper is MetaClaw (arXiv:2603.17187). It is one of five papers published between March 15 and March 20 that all arrived at the same idea: let agents evolve their own skills.

A second team from UCL and HKUST Guangzhou published Memento-Skills (arXiv:2603.18743). They froze Gemini-3.1-Flash and only let it edit external Markdown files. On Humanity's Last Exam (HLE), 2,500 expert-level questions, accuracy went from 17.9% to 38.7%. More than double. The model never changed. The skill files did.

EvoSkill, Automating Skill Acquisition, and AgentFactory shipped the same week. Five teams. Different methods. Same direction.

Why rewriting skill files works better than fine-tuning

Fine-tuning is expensive. You need data, GPUs, and a pipeline. The result is a new model checkpoint that is hard to inspect, hard to edit, and locked to one provider. Skill files are the opposite. They are plain text. A developer can read them, edit them, and move them to a different model.

Memento-Skills keeps the model frozen and puts all learning into a loop called Read-Write Reflective Learning. The agent picks a skill file, runs the task, and checks the result. If it fails, the system finds the file that caused the failure and rewrites it. If it succeeds, the skill gets a higher score.

The hard part is picking the right skill. With 235 skills in the library, a keyword search (BM25) picks the correct one 32% of the time. Memento-Skills trains a small router model with offline RL. Instead of matching words, it learns which skill leads to success. Recall@1 rose to 60%.

MetaClaw goes further. It rewrites skill files like the others, but it also updates model weights during idle time. When the user is away (sleep mode, keyboard silence, empty calendar slots), it runs RL in the background. The authors call this the Opportunistic Meta-Learning Scheduler. This is what pushed Kimi-K2.5 to GPT-5.2 level. Skill evolution alone was not enough. The weight updates closed the last gap.

What happened when skills grew from 5 to 235

Memento-Skills started with 5 basic skills (web search, terminal, file operations). After running through HLE training questions, the library grew to 235. Clusters formed on their own: quantum physics, math, chemistry, clinical science, chess, code. No one designed these categories. The agent created them by failing and fixing.

Training accuracy went from 30.8% to 54.5%. Test accuracy hit 38.7%. The gap between training and test tells you something. Skills learned on biology questions transferred to unseen biology questions. Skills learned on quantum physics transferred to unseen quantum physics questions. But a skill for chemistry rarely helped with history.

On GAIA (165 multi-step reasoning questions), the picture was different. Test accuracy was 66.0%, up 13.7 points from 52.3%. But transfer was weaker. GAIA questions are diverse by design. A skill for one question type almost never applies to another.

MetaClaw's numbers point to a different finding. Weaker models benefit more from skill evolution. Kimi-K2.5 jumped 19.2 percentage points. A stronger model would likely see smaller gains. Skill evolution does not replace model quality. It compensates for it.

Only the direction converged

All numbers are self-reported. No independent replication exists for either paper.

MetaClaw and Memento-Skills use different benchmarks. You cannot compare their numbers directly. MetaClaw's 40.6% and Memento-Skills' 38.7% are on completely different tests.

MetaClaw updates model weights. The other four keep weights frozen. These are fundamentally different approaches. MetaClaw gets bigger gains but needs GPU access during idle time. Memento-Skills works with any API model because it never touches the weights.

The skill router still fails 40% of the time. As libraries grow, a bad router wastes the improvements. This is the bottleneck that none of the five papers fully solved.

Agent learning became readable

The cost of making a model smarter has been "train a better model" or "fine-tune with more data." Five papers in ten days showed a third option. Give the agent a folder of skill files and let it rewrite them after every failure.

Memento-Skills doubled HLE accuracy with frozen Gemini-3.1-Flash. MetaClaw pushed a free model to GPT-5.2 level. The skill files are Markdown. A developer can open them, read them, and understand what the agent learned.

That is the real point. Not the numbers. The fact that agent learning is now readable.