DEV Community: nasuy

A free model matched GPT-5.2. No fine-tuning. It rewrote its own skill files until it got there

nasuy — Wed, 01 Apr 2026 09:52:11 +0000

Kimi-K2.5 scores 21.4% on a difficult benchmark. GPT-5.2 scores 41.1%. A team gave Kimi-K2.5 a system that rewrites its own skill files after every failure. No fine-tuning. No new training data. After a few rounds, it hit 40.6%. That is GPT-5.2 territory.

The paper is MetaClaw (arXiv:2603.17187). It is one of five papers published between March 15 and March 20 that all arrived at the same idea: let agents evolve their own skills.

A second team from UCL and HKUST Guangzhou published Memento-Skills (arXiv:2603.18743). They froze Gemini-3.1-Flash and only let it edit external Markdown files. On Humanity's Last Exam (HLE), 2,500 expert-level questions, accuracy went from 17.9% to 38.7%. More than double. The model never changed. The skill files did.

EvoSkill, Automating Skill Acquisition, and AgentFactory shipped the same week. Five teams. Different methods. Same direction.

Why rewriting skill files works better than fine-tuning

Fine-tuning is expensive. You need data, GPUs, and a pipeline. The result is a new model checkpoint that is hard to inspect, hard to edit, and locked to one provider. Skill files are the opposite. They are plain text. A developer can read them, edit them, and move them to a different model.

Memento-Skills keeps the model frozen and puts all learning into a loop called Read-Write Reflective Learning. The agent picks a skill file, runs the task, and checks the result. If it fails, the system finds the file that caused the failure and rewrites it. If it succeeds, the skill gets a higher score.

The hard part is picking the right skill. With 235 skills in the library, a keyword search (BM25) picks the correct one 32% of the time. Memento-Skills trains a small router model with offline RL. Instead of matching words, it learns which skill leads to success. Recall@1 rose to 60%.

MetaClaw goes further. It rewrites skill files like the others, but it also updates model weights during idle time. When the user is away (sleep mode, keyboard silence, empty calendar slots), it runs RL in the background. The authors call this the Opportunistic Meta-Learning Scheduler. This is what pushed Kimi-K2.5 to GPT-5.2 level. Skill evolution alone was not enough. The weight updates closed the last gap.

What happened when skills grew from 5 to 235

Memento-Skills started with 5 basic skills (web search, terminal, file operations). After running through HLE training questions, the library grew to 235. Clusters formed on their own: quantum physics, math, chemistry, clinical science, chess, code. No one designed these categories. The agent created them by failing and fixing.

Training accuracy went from 30.8% to 54.5%. Test accuracy hit 38.7%. The gap between training and test tells you something. Skills learned on biology questions transferred to unseen biology questions. Skills learned on quantum physics transferred to unseen quantum physics questions. But a skill for chemistry rarely helped with history.

On GAIA (165 multi-step reasoning questions), the picture was different. Test accuracy was 66.0%, up 13.7 points from 52.3%. But transfer was weaker. GAIA questions are diverse by design. A skill for one question type almost never applies to another.

MetaClaw's numbers point to a different finding. Weaker models benefit more from skill evolution. Kimi-K2.5 jumped 19.2 percentage points. A stronger model would likely see smaller gains. Skill evolution does not replace model quality. It compensates for it.

Only the direction converged

All numbers are self-reported. No independent replication exists for either paper.

MetaClaw and Memento-Skills use different benchmarks. You cannot compare their numbers directly. MetaClaw's 40.6% and Memento-Skills' 38.7% are on completely different tests.

MetaClaw updates model weights. The other four keep weights frozen. These are fundamentally different approaches. MetaClaw gets bigger gains but needs GPU access during idle time. Memento-Skills works with any API model because it never touches the weights.

The skill router still fails 40% of the time. As libraries grow, a bad router wastes the improvements. This is the bottleneck that none of the five papers fully solved.

Agent learning became readable

The cost of making a model smarter has been "train a better model" or "fine-tune with more data." Five papers in ten days showed a third option. Give the agent a folder of skill files and let it rewrite them after every failure.

Memento-Skills doubled HLE accuracy with frozen Gemini-3.1-Flash. MetaClaw pushed a free model to GPT-5.2 level. The skill files are Markdown. A developer can open them, read them, and understand what the agent learned.

That is the real point. Not the numbers. The fact that agent learning is now readable.

Your AI agent works alone. No plan, no tests, no review. 118,000 developers found a fix.

nasuy — Mon, 30 Mar 2026 02:53:29 +0000

A single AI coding agent can write code. But give it a big task, and it drifts. It skips tests. It forgets the plan. It says "done" when it is not. This is not a model problem. It is a workflow problem.

Two open-source frameworks on GitHub found the same answer. Split the work across multiple agents. Make them check each other. Force a loop: plan, implement, review, fix. The bigger one has 118,624 stars.

Why splitting the work stops the drift

When one agent does everything, its context grows with every step. After 30 minutes, it loses track of the original plan. After an hour, it invents new requirements that nobody asked for. The failure patterns are predictable. The agent tries to build everything at once and the structure falls apart. It declares "done" halfway through. It marks features complete without writing tests. It leaves the environment in a broken state at the end of a session.

Multi-agent orchestration fixes this by keeping each step small. A planning agent breaks the task into pieces. An implementation agent handles one piece at a time. A review agent checks if the code matches the plan. If it does not, the work goes back to implementation. Each agent starts with a clean, focused context. The drift stops. Superpowers makes this explicit. It requires plans to be "clear enough for a junior engineer with enthusiasm but no sense, no judgment, no project context, and reluctance to test."

This is not a new idea. It is the same structure as CI/CD pipelines in software development. The difference is that agents now fill the roles that humans used to fill. The handoff between stages is automatic.

Discipline or flexibility. Two frameworks, two designs.

Install Superpowers (118,624 stars) and the agent changes how it works. When it detects you want to build something, it stops. It does not start writing code. It asks what you actually want. After it pulls out the spec from the conversation, it shows the design in short chunks you can read and understand. You approve the design. The agent writes a plan. You say "execute" and sub-agent-driven development starts.

Each task in the plan is 2-5 minutes long. Every task has exact file paths, full code, and verification steps. After each sub-agent finishes a task, a two-stage review runs. Stage one checks if the code matches the spec. Stage two checks code quality. TDD is RED-GREEN-REFACTOR. Code written before a test exists gets deleted. 14 built-in skills enforce this 7-step workflow as "mandatory, not a suggestion." It works with Claude Code, Cursor, Codex, OpenCode, and Gemini CLI.

oh-my-claudecode (13,996 stars) takes the team approach. Its Team mode runs a staged pipeline: team-plan, team-prd, team-exec, team-verify, team-fix. Since v4.4.0, it can run Claude, Codex, and Gemini at the same time using tmux workers. You can send code reviews to Codex, UI tasks to Gemini, and integration work to Claude in one command:

omc team 2:codex "Review auth module for security issues"
omc team 2:gemini "Redesign UI components for accessibility"
omc team 1:claude "Implement payment flow"

It has 32 specialized agents and routes tasks to Haiku or Opus based on difficulty. You can also run a cross-model PR review in one line:

/ccg Review this PR — architecture (Codex) and UI components (Gemini)

The standout feature is automatic skill learning. When you debug a problem, OMC extracts the fix as a skill file. For example, if you fix an aiohttp proxy crash, it saves this:

# .omc/skills/fix-proxy-crash.md
---
name: Fix Proxy Crash
triggers: ["proxy", "aiohttp", "disconnected"]
source: extracted
---
Wrap the handler at server.py:42 with try/except ClientDisconnectedError...

Next time the same error shows up, this skill gets injected automatically. No manual call needed. When requirements are unclear, /deep-interview "I want to build a task management app" starts Socratic questioning. It finds hidden assumptions and measures clarity on weighted dimensions before any code is written.

Superpowers enforces discipline to ensure quality. oh-my-claudecode offers multi-provider flexibility to widen your options. Both use the same core loop: plan, implement, review, fix.

The model is not the bottleneck. The workflow is.

Harness engineering showed that one agent needs a good environment to work well over time. Multi-agent orchestration goes further. It splits the work across agents and makes them inspect each other. The quality comes from the structure, not from any single model.

Both frameworks arrived at the same core loop: plan, implement, review, fix. The next time your AI agent produces bad code, try adding structure before switching models.

18,883 MCP servers. Five Chinese tech giants joined this week. Zero security audits.

nasuy — Thu, 26 Mar 2026 16:25:18 +0000

On March 24, someone put malware in litellm, a popular Python library for calling LLM APIs. Versions 1.82.7 and 1.82.8 on PyPI stole API keys for OpenAI, Anthropic, and Gemini. Two Hacker News posts about it got 1,159 points total and 364 comments.

The same week, five Chinese tech companies appeared on the MCP.so trending page. Tencent, Zhipu AI, Amap (owned by Alibaba), Baidu, and MiniMax. They all published MCP servers between March 23 and 25. The total number of MCP servers has now reached 18,883.

Everyone noticed litellm. Nobody noticed the Chinese MCP servers. No security review. No public discussion. Nothing.

MCP has the same supply chain problem as npm, but worse

npm's supply chain attacks are well known. event-stream in 2018. ua-parser-js in 2021. MCP is heading down the same path, but with two differences that make it worse.

The first is data direction. When you install an npm package, code runs on your machine. It stays local. When you connect an MCP server, your data leaves your machine. A map server gets your location queries. A search server gets your search queries. A deploy server gets your HTML source code. npm packages do not send data out by default. MCP servers do. That is what they are built for.

The second is geopolitics. npm was mostly neutral. No government had a special interest in controlling JavaScript module registries. MCP is different. Five Chinese tech companies now provide servers that handle location data, search queries, and deployment workflows. Data that flows through these servers may fall under China's Cybersecurity Law (2017), which requires data to stay in China, and China's National Intelligence Law (2017), which requires organizations to cooperate with intelligence work. npm never had this kind of issue.

Three events happened in one week. Nobody connected them.

March 19. Security firm Qualys published a blog post: "MCP Servers: The Shadow IT of the AI Agent Era." They warned that developers install MCP servers without approval from IT departments. The same pattern as Shadow IT in the 2010s, but with AI tools.

March 24. litellm versions 1.82.7 and 1.82.8 on PyPI were found to carry malware. A file called litellm_init.pth stole LLM API credentials. litellm is widely used in production because it lets you call OpenAI, Anthropic, and Gemini through one API. Hacker News gave the story 1,159 points.

March 23-25. Five Chinese tech companies showed up on MCP.so's trending page. Tencent EdgeOne deploys HTML and returns a public URL. Zhipu AI offers web search with intent recognition. Amap and Baidu provide map and location services. MiniMax provides text-to-speech, image generation, and video generation through MCP.

Nobody has connected these three events. litellm was reported. Qualys published their warning. But no one has asked: who is auditing the five Chinese MCP servers that appeared the same week?

This is not about China being dangerous

This article does not say Chinese MCP servers are dangerous. There is no evidence that any of them contain malware. Some publish their source code on GitHub.

The problem is structural. 18,883 MCP servers exist, and there is no standard audit process. After the event-stream attack in 2018, it took npm years to add package signing and scoped registries. MCP will need server verification and data flow monitoring. The difference is that MCP sends live data, not just code. The attack surface is wider.

Our previous article, "MCP tool spoofing: 100% success rate," covered protocol-level risks. This is a different kind of problem. Not how MCP is designed, but who publishes servers, where data goes, and who checks.

Ten years of npm lessons. One year to learn them.

MCP is becoming the npm of AI. Same supply chain problem, but higher stakes. Live data instead of static code. National interests instead of neutral registries. npm needed ten years to learn its lessons. MCP needs to learn them in one. The litellm hack was the first warning sign. The next one might come from an MCP server itself.

"The human might be asleep." One line in Karpathy's program.md started 100 automatic experiments per night.

nasuy — Tue, 24 Mar 2026 13:09:57 +0000

The biggest bottleneck in code optimization is the human in the loop. You think of an idea, implement it, test it, check results, then think again. In March 2026, Andrej Karpathy removed that bottleneck. He released autoresearch, a tool that lets an AI agent edit code, run experiments, evaluate results, and keep or discard changes automatically. It hit 42,921 GitHub stars in under two weeks (GitHub API, 2026-03-19 11:56 UTC).

The surprising part is where it spread. Shopify CEO Tobi Lutke applied the pattern to Liquid, a template engine running in production for 20 years. He reported a 53% reduction in parse+render time in PR #2056. LangChain CEO hwchase17 used it to optimize agent quality scores. Ole Lehmann reported raising a Claude Code skill eval score from 56% to 92%. This is not an ML research tool anymore. It is a pattern for any task with a measurable metric.

Why three files are enough

The architecture is stripped to the minimum. There are three core files.

program.md is the instruction file. A human writes it. It defines what to optimize, how to run experiments, and what must not break. train.py is the only file the agent edits. prepare.py is the evaluation harness. Nobody touches it.

This separation works because the boundary between "what changes" and "what stays fixed" is clear. The agent edits train.py, runs a 5-minute experiment, checks the metric. If it improved, git commit. If not, git reset. About 12 experiments per hour. Leave it running overnight and you get about 100.

The 5-minute cap is what makes this work. It forces every experiment into the same budget. You can compare results fairly. Without a fixed budget, a slow-converging change looks just as good as a fast one. The cap makes comparison possible.

program.md includes this line: "The human might be asleep, or gone from a computer and expects you to continue working indefinitely until you are manually stopped." That single instruction removes the human bottleneck.

126 experiments from Karpathy, 974 tests from Shopify

Karpathy ran 126 experiments on a single H100 in about 10.5 hours. He published the full log in Discussion #43. Out of 126 experiments, 23 were kept. That is about 18%. Most experiments fail or make things worse. But the ones that improve stack up. val_bpb went from 0.9979 to 0.9697.

The biggest win was halving the batch size (524K to 262K), which gave -0.0119. The biggest failure was weight tying (sharing embed and unembed layers), which added +2.24 BPB and completely broke the model. The dead-end log is valuable too. Knowing what does not work saves future experiments from going in the wrong direction.

Shopify took a different approach. The target was a Ruby library (lib/liquid/*.rb), not ML training code. The metric was combined_us (parse + render time), not val_bpb. The critical difference was a 3-gate validation system. Every change had to pass 974 unit tests, then a liquid-spec compliance check, then a performance benchmark. Only changes that passed all three gates and improved the metric were kept. About 120 experiments produced 93 commits. Parse time dropped 61%. Render time dropped 20%. Total dropped 53%.

The key insight was that garbage collection was consuming 74% of CPU time. Focusing on reducing object allocations drove most of the improvement. Allocations went from 62,620 to 24,530, a 61% reduction.

Caveats

Shopify PR #2056 was still OPEN as of 2026-03-19. It has not been merged. Comments on the PR mention test failures. The 53% figure is self-reported and has not been independently verified.

Metrics gaming is a known issue. After 30+ iterations, agents start finding ways to improve the metric without real improvement. Random seed engineering is one example. Karpathy's log includes fragile improvements like "seed 137 effect" that may not reproduce.

autoresearch-at-home (440+ stars) extends the pattern to distributed collaboration in a SETI@home style. autoresearch-anything (by zkarimi22) generates setup files for any project with npx autoresearch-anything. The MLX port for Apple Silicon found that a depth=4 model beats depth=8 under the 5-minute budget. Smaller models that run more optimizer steps win. The optimal setup depends on the hardware.

Conclusion

autoresearch success does not depend on model capability. It depends on three design choices. Metric: what you measure. Scope: what the agent is allowed to change. Verify: what tests and constraints protect the things that must not break.

Shopify's 53% improvement happened because they built a 3-gate Verify system with 974 tests, spec compliance, and benchmarks. If you want to apply this pattern, start by asking two questions. Do you have a measurable metric? Do you have a test suite that protects what matters? If the answer to both is yes, you can let an AI run 100 experiments while you sleep.

The most valuable AI agent skills are buried in GitHub repos

nasuy — Fri, 20 Mar 2026 12:20:18 +0000

Most agent skills people create are a rewrite of what the LLM already knows. Claude Code's skill-creator, "save this as a skill" at the end of a chat. These tools are easy to use, but the output rarely adds real capability. The LLM can already do what the skill describes.

Skills become valuable when they encode procedural knowledge. Not facts. Not prompts. Procedures that someone found through debugging and iteration. That kind of knowledge is buried in open-source repositories on GitHub. In March 2026, a research team at East China Normal University proposed a 3-stage pipeline that extracts skills from OSS repos and converts them to standard SKILL.md format.

Why writing skills by hand does not scale

There are three ways to build agent skills.

Manual creation by experts produces high quality. Anthropic adopted this approach officially. OpenAI Codex also supports SKILL.md compatibility. But it does not scale. Each skill requires domain knowledge and testing time. When an agent needs hundreds of skills, manual creation breaks down.

Autonomous discovery by the agent is the second path. EvoSkill, covered in a previous post, takes this approach. It scales, but semantic consistency is hard to maintain. The quality of auto-generated skills varies widely.

OSS mining is the third path and the focus of this paper. Agent repositories on GitHub contain procedures that someone spent time debugging and iterating on. The framework finds those procedures automatically and converts them to standard format. It reuses existing human knowledge, so semantic consistency is higher than generating from scratch.

How the pipeline works, and what it found

The pipeline has three stages.

Repository structure analysis comes first. Tools like repo2AI convert the entire codebase to Markdown and map core scripts and helper modules in a hierarchy.

Semantic skill identification follows. Code modules are converted to dense vectors. A bi-encoder calculates cosine similarity to narrow down candidates. A cross-encoder then refines the ranking. Only modules that pass four promotion criteria become skill candidates. The criteria are recurrence (appears in multiple contexts), verified (works and is documented), non-obviousness (required domain expertise to discover), and generalizability (can be parameterized for other contexts). Modules that fail any criterion are not promoted. This filter prevents both extremes of "make everything a skill" and "make nothing a skill."

SKILL.md conversion is the final stage. Identified patterns are standardized into three layers. YAML frontmatter for metadata. Markdown body for procedures. An assets directory for scripts and templates. Hardcoded paths and API keys are removed to make the skill portable.

The team tested this on two repositories. TheoremExplainAgent from TIGER AI Lab generates Manim animations that explain STEM theorems using a 2-agent system (Planner and Coding Agent). Code2Video from Show Lab at the National University of Singapore generates educational videos using a 3-agent system (Planner, Coder, Critic). It was accepted at the NeurIPS 2025 Deep Learning for Code Workshop. The Code2Video paper reports a 40-point improvement in knowledge transfer efficiency when comparing its full pipeline to a baseline code generation model. TeachQuiz scores went from about 40 to about 80. (arXiv:2510.01174)

For managing large skill libraries, the paper proposes SkillNet. It is an ontology-based structure that connects skills through relationships like "is a subset of" and "requires output from." The paper cites a 30% reduction in execution steps and 40% improvement in task reward, though the experimental conditions behind these numbers are limited.

Caveats

This paper is a preprint, submitted March 12, 2026. It has not been peer-reviewed. The 40-point improvement in knowledge transfer efficiency comes from the Code2Video paper (arXiv:2510.01174), not from the mining framework itself. The framework was tested on only two education-focused repositories. Its applicability to other domains has not been verified. The framework code has not been open-sourced.

On security, a survey paper (arXiv:2602.12430) found vulnerabilities in 26.1% of community-distributed skills. Data theft accounts for 13.3% and privilege escalation for 11.8%. OSS mining amplifies this risk. The paper proposes a 4-stage verification pipeline from static analysis (G1) to permission verification (G4), but no production deployment has been reported.

The SKILL.md specification was published as an open standard by Anthropic on December 18, 2025. OpenAI has documented compatibility in both Codex and its API. The output format is becoming an industry standard.

Conclusion

Skill mines are on GitHub. Manual creation does not scale. Autonomous discovery is inconsistent. OSS mining reuses existing human knowledge, making it a credible third path for skill acquisition.

The four promotion criteria from this paper, recurrence, verified, non-obviousness, and generalizability, work as a practical filter for deciding what should become a skill. You do not need the full pipeline to use them. Start by looking at the OSS repositories your team already uses. There may be procedural knowledge buried in the code that is worth extracting.

[Meta-RL] We told an AI agent 'you can fail 3 times.' Accuracy went up 19%.

nasuy — Thu, 19 Mar 2026 04:30:29 +0000

Most AI agents get one shot. They take a question, run a search or plan, give an answer, and move on. If the answer is wrong, that failure is lost. The agent starts fresh next time with no memory of what went wrong.

Humans do not work this way. We fail, think about why, and try again with a better plan. From December 2025 to March 2026, three independent research teams at AI2, EPFL, and Tsinghua University arrived at the same idea. Give the agent multiple tries. Make it reflect on each failure. Feed that reflection into the next attempt. They call it Meta-Reinforcement Learning with Self-Reflection.

Why single-shot agents fall short

Standard RL-trained agents treat each attempt as independent. They cannot carry lessons from one try to the next. Three problems come together here.

Sparse rewards make it hard to learn. The agent only gets a signal at the end (right or wrong), so it cannot tell which steps were good and which were bad. Independent tries mean the agent repeats the same mistakes. And as RL training continues, the agent converges to a fixed behavior and stops exploring new strategies. LaMer showed this with trajectory diversity analysis. After RL training, agents had much lower entropy in their action patterns compared to the base model.

Meta-RL with Self-Reflection solves all three. The design is simple. Allow three attempts per problem. After each attempt, the agent writes what went wrong and what to try next. That reflection text goes into the context for the next attempt. During training, the system optimizes cross-episode rewards, so the model learns how to write useful reflections.

The key point is that at test time, there are no weight updates. The agent adapts by adding past episodes and reflection text to its context window. LaMer calls this in-context policy adaptation. It means you do not need online learning after deployment.

What three teams found

Three teams tested this pattern in different task domains. Their results show it works across search, games, web tasks, and multi-agent environments.

AI2’s MR-Search targets search QA. Using Qwen2.5-7B, it improved QA benchmark average accuracy by 9.3% relative. With a smaller 3B model, the gain reached 19.3%. MR-Search uses turn-level advantage estimation to assign credit to each intermediate step, not just the final answer. It also scales beyond training. Even though the model trained with 3 attempts, performance keeps improving with 5 or 7 attempts at test time. (arXiv:2603.11327)

EPFL’s LaMer works on games and web tasks. Using Qwen3-4B, it improved pass@3 success rates by 11.8 points on Sokoban, 19.3 points on MineSweeper, and 13.9 points on Webshop versus the best RL baseline. One finding stands out. Keeping only reflection text in memory works better than the default setting of keeping both trajectory and reflection. On MineSweeper, reflection-only scored 80.5% versus 74.4% for full history. Reflections are shorter and carry more useful information per token. (arXiv:2512.16848, ICLR 2026)

Tsinghua’s MAGE extends this to multi-agent settings. It focuses on strategic exploitation, finding and using an opponent’s weaknesses. MAGE reached 100% success rate on Webshop (versus 75.2% for GiGPO) and 67.2% on Tic-Tac-Toe against MCTS-100 (versus 60.2% for LaMer). Against MCTS-1000, a near-perfect opponent in Tic-Tac-Toe, MAGE achieved a 100% draw rate through zero-shot adaptation. (arXiv:2603.03680)

The three frameworks differ in some design choices. MR-Search uses no discount between episodes (gamma=1.0), while LaMer and MAGE use 0.6. MAGE uses differential return, which rewards improvement over the previous episode rather than total score. MAGE’s ablation study showed differential return produces more stable learning than cumulative return. The three papers also use different metrics (Exact Match vs. pass@k), so direct number comparisons between them are not valid.

Caveats

All results come from the authors’ own experiments. Large-scale independent reproduction is still limited. LaMer has been peer-reviewed at ICLR 2026. MR-Search and MAGE are preprints. MR-Search code is expected on March 21, 2026. LaMer and MAGE code is already public.

Base models are small, 4B to 7B parameters. No one has tested this on 70B+ models yet. Training takes about twice as long as standard RL because episodes must be generated one after another, not in parallel. LaMer reported this cost.

Reflection quality is a risk. LLM hallucinations can creep into reflections. A wrong reflection may hurt performance more than no reflection at all. None of the three papers propose a direct fix for this. Context length is another limit. Episodes and reflections pile up fast, and long tasks will lose information.

Conclusion

The shift is from “get it right the first time” to “fail, reflect, and improve.” Three independent teams converged on this pattern at the same time. That convergence itself is a signal. For agent builders, the takeaway is practical. Question the assumption that your agent must finish in one try. Build in room for exploration and reflection. The mechanism is lightweight. No weight updates at test time. Just context.

[EvoSkill] An AI agent learned from its own failures and got 12 points more accurate.

nasuy — Wed, 18 Mar 2026 04:17:47 +0000

AI coding agents have a structural weakness. Claude Code, Codex, and OpenHands are good at general problem solving. But they lack domain-specific know-how. How to correctly extract numbers from 89,000 pages of US Treasury documents. How to find accurate facts in noisy search results. That kind of expertise does not live inside the model.

The current fix is to write "skills" by hand. A SKILL.md file with step-by-step instructions and helper scripts. Claude Code's skill spec made this format standard. But writing a new skill every time a new task appears does not scale.

In March 2026, Sentient Labs and Virginia Tech released EvoSkill (arXiv:2603.02766). It is a framework that analyzes an agent's failures and generates reusable skills automatically. No model retraining needed. Only the skills evolve.

Why skills are the right level of optimization

Google's AlphaEvolve evolves code. GEPA/DSPy evolves prompts. EvoSkill evolves skills.

Code optimization is tightly bound to a specific model and task. You cannot move it to another environment. Prompt optimization can shift priorities, but it cannot encode multi-step procedures. Skills sit in the middle. They can hold branching logic and verification steps inside a Markdown file. Humans can read them. Edit them. Hand them to a different agent as-is.

EvoSkill runs an evolution loop with three specialized agents. The Executor runs tasks and collects failures. The Proposer analyzes failure traces, finds repeated patterns, and suggests new skills. The Skill Builder turns those suggestions into a SKILL.md file plus helper scripts inside .claude/skills/. Candidates are tested on a held-out validation set, and the top N programs survive as the Frontier.

The base model (Claude Opus 4.5) stays frozen throughout. No weight updates at all. Only the skill layer changes.

The Proposer keeps a running history of past suggestions. It knows what worked and what got rejected. This prevents the loop from going in circles. The paper calls this mechanism Textual Feedback Descent.

What this loop produced

OfficeQA is a grounded reasoning benchmark built on US Treasury Bulletin archives (about 89,000 pages going back to 1939). It has 246 questions. Human solvers spend an average of 50 minutes per question (Databricks).

EvoSkill used only 10% of the data for training and raised exact-match accuracy from 60.6% to 67.9% (+7.3 points). The skills it discovered include a strict verification protocol for extracting numbers from financial tables and a Python-based method for CPI inflation adjustment.

SealQA showed an even bigger gain. This benchmark tests fact-finding when web search returns noisy and contradictory results. Accuracy went from 26.6% to 38.7%, a jump of +12.1 points. The system discovered a skill called search-persistence-protocol. It lists all reasonable interpretations of ambiguous terms and searches each one separately. It requires at least three independent sources before giving an answer. It tries three or more query phrasings before reporting "not found."

The most striking result came next. The skills evolved on SealQA transferred to BrowseComp (an OpenAI benchmark) with zero modification and improved accuracy by +5.3 points. Skill-level optimization produces transferable capabilities, not task-specific heuristics.

Caveats

There is an author overlap concern. Two of the five EvoSkill authors (Weiyuan Chen and Tu Vu) also authored the SealQA benchmark. The +12.1 point improvement on SealQA was partly evaluated by the people who designed the test. The BrowseComp transfer result (+5.3 points) is on an independent benchmark built by OpenAI, which adds some objectivity.

The official blog says "up to 50% accuracy improvement." The largest gain in the paper is +12.1 points in absolute terms (SealQA 26.6% to 38.7%). In relative terms, that is about 45%. When citing this work, use absolute points for accuracy.

The paper is an arXiv preprint as of March 2026. No peer-reviewed publication has been confirmed. GitHub has 160 stars and 12 forks. No independent reproduction by third parties has been reported yet.

Training beyond 10% of the data showed diminishing returns. The 15% split scored lower than 10%, suggesting mild overfitting. However, merging skills discovered in separate runs produced the strongest configuration overall.

Conclusion

Improving AI agents used to mean two things: retrain the model, or write skills by hand. EvoSkill offers a third option. The agent learns from its own failures, generates skills automatically, and those skills transfer to tasks it has never seen.

OfficeQA +7.3 points. SealQA +12.1 points. BrowseComp +5.3 points with zero-shot transfer. The model weights were never touched. Only the skill layer, made of Markdown instructions and helper scripts, evolved.

Prompt engineering is hitting its ceiling. Code optimization does not transfer. Skills are the optimization layer in between, and they are human-readable. The paper is still pre-review, but the idea that agents could share skill libraries the way developers share packages is worth watching.

Codified Context: A chemist wrote 100K lines of game code alone. The secret was the context architecture.

nasuy — Tue, 17 Mar 2026 04:18:09 +0000

The biggest weakness of AI coding agents is that they forget everything when a session ends. Project rules, past mistakes, all gone. Developers started writing rules in files like CLAUDE.md and .cursorrules, but a study of 253 CLAUDE.md files (Agentic Coding Manifests) found that a single file cannot cover a 100K-line codebase.

The answer to this problem is Codified Context. In February 2026, Aristidis Vasilopoulos formalized this approach in a paper (arXiv:2602.20478). It structures project knowledge inside the codebase so that agents do not start from scratch every session. The central idea is to treat documentation as infrastructure by design.

Why Structure Works: The 3-Tier Architecture

What makes Codified Context different from a simple rules file is that it separates knowledge into three layers by access frequency.

Tier 1 is the Constitution. A single Markdown file of about 660 lines containing coding standards, build commands, and a trigger table. It is auto-loaded at the start of every session as hot memory. It defines what the agent should know before it writes a single line of code.

Tier 2 is Specialized Agents. Domain-specific agent specs. In the paper's case study, 19 agents totaling about 9,300 lines. A trigger table controls routing. Change a network sync file and network-protocol-designer is called automatically. By predefining which knowledge is needed for which files, agents retrieve the right context without being told.

Tier 3 is the Knowledge Base. 34 spec documents totaling about 16,250 lines, searchable on demand via an MCP server. Cold memory that does not consume the context window until needed.

This design solves a fundamental dilemma: load everything and the context overflows; load nothing and the agent loses track. By separating always-loaded knowledge, trigger-invoked knowledge, and search-retrieved knowledge, maximum context is maintained within a limited context window.

What This Structure Produced: 100K Lines by One Person

Vasilopoulos is a chemist, not a software engineer. Using Claude Code as his only code generation tool, he built a 108,256-line C# multiplayer game in 70 days. 405 files, 283 sessions, 2,801 human prompts. The knowledge-to-code ratio was 24.2%. For every 4 lines of code, there was about 1 line of context documentation.

The 3-tier structure showed concrete results in the save system. A 283-line spec (Tier 3) was refined over 74 sessions. Zero save-related bugs were reported. Because past failures and correct patterns accumulated in the spec, the agent did not repeat the same mistakes. For UI sync routing, a 126-line spec collected lessons learned, and the next similar feature was built correctly on the first try.

Maintenance cost was about 5 minutes per session for spec updates, plus a 30-45 minute review every two weeks. About 1-2 hours per week total.

Caveats and Tool Support

This research has clear limits. It is an observational report from a single developer on a single project, not a controlled experiment comparing results with and without Codified Context. The implementation is specific to Claude Code, so direct applicability to other tools remains at the principle level. Other research has reported that context files can actually lower task success rates in some cases. The effect depends on conditions.

That said, the principle of structuring knowledge for agents has been adopted across vendors. Claude Code has four scope levels (organization, project, user, local) with CLAUDE.md, .claude/rules/, and Auto Memory. Cursor uses Project, Team, and User Rules plus AGENTS.md, with .cursorrules now legacy. GitHub Copilot supports .github/copilot-instructions.md, .github/instructions/ directory files, and AGENTS.md. Context files are not configuration. The AI reads them and tries to follow them, but there is no guarantee. Conflicting instructions are resolved arbitrarily.

Conclusion

The central insight of Codified Context is practical: if you have explained something to an AI agent twice, it should be a spec document. And those specs should be designed by deciding what to always load, what to invoke by trigger, and what to search on demand. Treating project context as infrastructure is shaping the next phase of AI-assisted development.

MCP tool spoofing succeeds 100% of the time. A new paper maps 12 security risks across 4 agent protocols.

nasuy — Sun, 15 Mar 2026 05:08:11 +0000

[Edit] I just shipped youtube about this topic, check it out!

MCP now has over 10,000 public servers. More than 50 companies are building A2A. AI agent protocols are growing fast.

But security research is not keeping up. For Agora and ANP, almost no security analysis existed before this paper.

In February 2026, researchers from the Canadian Institute for Cybersecurity and Mastercard published a paper that organizes 12 risks across 4 protocols (arXiv:2602.11327).

The Four Protocols

AI agent communication has different layers for different jobs.

MCP, released by Anthropic in November 2024, connects AI to external tools and data using OAuth 2.1. It is already in production.
A2A, announced by Google in April 2025, handles agent-to-agent communication with OAuth 2.0+JWT and is currently in draft stage.
Agora, proposed by Marro et al. in October 2024, is a meta-protocol that dynamically generates and negotiates communication rules using hash-based authentication. It remains at the research stage.
ANP, proposed by Chang et al. in July 2025, provides the network layer for large-scale agent networks using W3C DID for authentication. It is also at the research stage.

These protocols do not compete. They stack on top of each other.

12 Protocol-Level Risks

The paper sorts risks by lifecycle: creation, operation, and update.

The creation phase has four risks. MCP and Agora have weak identity checks. MCP and ANP do not protect registration data integrity. MCP has no namespace separation, so same-name tool spoofing works. Agora and ANP have no security policy defined at all.

The operation phase also has four risks. MCP does not verify what actually runs. MCP and A2A have no control over data exchange. MCP and A2A give permissions that are too broad. All four protocols lack rate limiting and backpressure.

The update phase has four more. MCP and A2A never cancel old credentials. Agora and ANP have no rollback protection. MCP does not sign or verify update packages. All protocols are vulnerable to dependency drift.

In short, the creation phase cannot verify who registered what. The operation phase cannot control what runs or how much access it has. The update phase leaves old credentials active and applies unsigned packages without checks.

MCP Experiment: Tool Spoofing

The researchers also ran an experiment on MCP. They set up a real server and a fake server, both using the same tool name (authorize_payment). Then they checked which one the AI called.

In first-match mode, the AI always called the wrong server (Violation Rate = 1.0). In best-match mode, it was wrong about half the time (Violation Rate = 0.52). Without cryptographic signatures, tool spoofing works reliably.

Reported Incidents

These incidents were reported on X and security news sites.

OpenClaw was shown to be vulnerable to indirect prompt injection, enabling backdoor installation and C2 deployment. A supply chain attack on Cline CLI v2.3.0 also led to approximately 4,000 unauthorized installations. An RCE vulnerability in MCP Inspector (CVE-2025-49596, CVSS 9.4) allowed remote code execution just by visiting a malicious web page (now patched). On ClawHub, Snyk's Agent Scan analysis (3,984 skills) found 76 confirmed malware packages. VirusTotal scanning was added to fix this. A Cursor MCP server leaked login credentials through indirect prompt injection.

Many of these reports come from X posts and have not been verified independently.

What to Do Now

Use cryptographic signatures to verify identity, not just names. Add supply chain checks like signature verification, code scanning, and version pinning. Watch MITRE ATLAS. They are adding attack techniques specific to AI agents. Set the default to least privilege and enforce token scope at the protocol level.

Over 10,000 MCP servers. Over 50 companies building A2A. Protocol adoption is outpacing security. The shift from "call tools by name" to "verify tools by signature" is the first step.

AI Agents Don't Need to Touch the UI. WebMCP Is the Third Way.

nasuy — Mon, 09 Mar 2026 17:48:57 +0000

AI agents can interact with websites in two ways.
UI actuation — simulating clicks and keystrokes — is slow and fragile. Backend integration via MCP servers or OpenAPI demands significant server-side implementation and maintenance.

WebMCP is a third way that is different from both.
The web page registers JavaScript functions as "tools." AI agents in the browser call those tools directly. Existing frontend code can be reused as-is.

The W3C Web ML Community Group (https://webmachinelearning.github.io/webmcp/) published it on February 27, 2026 as a Draft Community Group Report. It is not an official W3C standard yet. You can try it through Chrome Early Preview Program. The editors are from Microsoft (Brandon Walderman) and Google (Khushal Sagar, Dominic Farolino).

Minimal Code Example

Setting up WebMCP is surprisingly simple.

if ("modelContext" in navigator) {
  navigator.modelContext.provideContext({
    tools: [{
      name: "add-item",
      description: "Add a new item to the list",
      inputSchema: {
        type: "object",
        properties: {
          name: { type: "string" },
          description: { type: "string" }
        },
        required: ["name"]
      },
      execute({ name, description }) {
        addItem(name, description);
        return { content: [{ type: "text", text: `"${name}" added!` }] };
      }
    }]
  });
}

The page registers tools through navigator.modelContext. The AI agent discovers available tools, calls them, and receives results. When a potentially destructive action is involved, requestUserInteraction() prompts the human user for confirmation.

How It Differs from MCP

People often confuse MCP and WebMCP, but they complement each other rather than compete.

MCP runs on backend servers, communicates via JSON-RPC + stdio/HTTP, requires a transport layer, offers three primitives (Tools, Resources, Prompts), and needs server-side implementation. WebMCP runs inside the browser, executes page JS directly, needs no transport layer, offers Tools only (aligned with MCP's tool format), and needs no server. It reuses existing frontend code.

WebMCP aligns with MCP's "tools" primitive but does not include resources or prompts. The browser acts as the mediator and handles the data layer.

Human-in-the-Loop Design

Before any destructive action (purchasing, sending messages, etc.), requestUserInteraction() asks the user for permission. The design assumes human oversight, not full autonomy.

Open Problems

WebMCP still has several unresolved issues. The security specification remains marked as TODO. Headless execution (no browser UI) is unsupported. Tools are only discoverable after navigating to the page. Firefox and Safari do not support it.

Why You Should Watch This

The way AI agents interact with the web is shifting — from "auto-clicking on screen" to "calling official tools the site provides." As this trend continues, designing business logic so it can be invoked from both UI and AI becomes an advantage. Per-tool permission and audit design will be necessary, and "AI-friendly websites" will become a competitive edge.

Note: A separate project also named "WebMCP" exists (MiguelsPizza/WebMCP). This article covers only the W3C proposal at webmachinelearning/webmcp.

Until now, agents could either scrape or integrate through the backend. WebMCP adds a third option: the page hands over its own tools. It is still a draft, but it could fundamentally change how agents and the web connect.

The secret isn't the model. It's the harness.

nasuy — Sat, 07 Mar 2026 18:04:44 +0000

Getting AI agents to write code is not new anymore. The real problem is not how smart the model is. The real problem is that agents do not have good environments to work in for a long time.

Harness Engineering is the field that works on this problem. In November 2025, Anthropic (https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents) published a blog post about it. In February 2026, OpenAI (https://openai.com/index/harness-engineering/) did the same. OpenAI said a team of 7 people made about 1 million lines of code in 1,500 PRs over 5 months. They wrote zero lines by hand (self-reported).

On X, the post "the 10x skill of 2026 is Evaluation Engineering" went viral. The engineer's job is changing. From "writing code" to "building environments where agents write good code."

Two Parts of Harness Engineering

Agent Harness is the execution side. The setup that lets agents work well over long sessions. It automates environment setup, passes progress between sessions using progress files and Git, builds one feature at a time, and runs E2E tests automatically.

Evaluation Harness is the quality side. How you score AI output with numbers, not feelings. EleutherAI has 60+ benchmarks. Inspect AI has 100+ pre-built evaluations. LLM-as-a-judge lets AI grade AI. These connect to CI/CD gates and safety testing (MLCommons AILuminate has 59,624 test prompts).

Anthropic's Approach: Session Handoff

Anthropic uses a two-step system. First, a setup agent makes init.sh and a feature list (JSON). Then a coding agent builds one feature at a time: code, test, commit, repeat. Between sessions, claude-progress.txt and Git history carry the work forward.

OpenAI's Approach: Repo-Wide Environment

AGENTS.md (about 100 lines) sets the rules for the whole repo. Custom linters and CI enforce those rules automatically. Instead of asking the AI nicely in a prompt, they make the tools force the rules.

The methods are different, but both companies reached the same conclusion. Put knowledge in the repo, enforce rules with tools, break work into small steps and leave a trail. Both approaches have limits, though. Anthropic's method is optimized for full-stack web development and has not been tested on scientific research or financial modeling. OpenAI's environment is highly customized for one repo and cannot be copied directly to other projects.

Models will keep getting smarter. But even the smartest model cannot sustain long-running development without a well-designed environment. The difference is not which model you pick. It is how you build the harness.

I covers AI agent designs, skills, and context engineering from the perspective of bringing AI into real teams and workflows. Analysis grounded in primary sources. Follow for more.
https://x.com/n_asuy