Brad Kinnard

Posted on Apr 29

94% of Published SKILL.md Files Skip the Spec's Two Most Basic Patterns

#ai #devops #opensource #claude

The agentskills.io spec recommends two things in every description: start with an action verb, and include a trigger phrase like "use when..." that tells the routing layer when to fire the skill. They take five seconds to add and they're the difference between a skill an agent picks up and a skill that sits unused in the catalog.

I sampled 500 skills at random from a 1,436-skill public corpus and measured both. 5.8% follow both recommendations. 61.8% follow neither.

The full breakdown of what the SKILL.md ecosystem actually looks like in production, as of late April 2026.

Methodology

Corpus: sickn33/antigravity-awesome-skills at HEAD on April 29, 2026. This is the largest publicly bundled SKILL.md collection in a single repo (1,436 indexed skills with metadata for category, source, and risk classification).

Sample: 500 skills, random with seed 42 for reproducibility.

Tool: skillcheck v1.2.0 from PyPI.

Per-skill features captured: every skillcheck diagnostic (rule, severity, message), description quality score, body line count, body and metadata token estimates, activation entropy and top-hypothesis score from --activation-hypotheses, structural features computed locally (description length in chars and words, action verb in first position, trigger-phrase presence, presence of resources//scripts//references/ subdirectories, frontmatter field count and which fields), plus the antigravity-supplied category, source, and risk metadata.

Caveat one: skillcheck's description quality score is a heuristic that includes action-verb and trigger-phrase detection as positive signals. So the correlation between these two features and the score is partly mechanical. The headline finding is not "we discovered these patterns predict quality." It's "the spec recommends these patterns, the linter that encodes the spec rewards them, and almost nobody is using them."

Caveat two: antigravity's bundler injects risk, source, date_added, and category fields into the SKILL.md frontmatter when packaging skills. The author-original frontmatter analysis below excludes these injected fields.

Reproduce in five commands:

pip install skillcheck==1.2.0
git clone --depth 1 https://github.com/sickn33/antigravity-awesome-skills.git
cd antigravity-awesome-skills
# Then sample from skills_index.json with seed 42 and run skillcheck against each
# Full analysis script: see the dataset link at the bottom

The two-pattern adoption gap

Every skill description was classified on two binary features: does it start with an action verb (Generates, Validates, Creates, Builds, Analyzes, etc., from a 90-verb allowlist), and does it contain a trigger phrase (use when, use this skill when, when the user, when working with, whenever, etc.)?

Pattern	Count	%
Has both action verb and trigger phrase	29	5.8%
Action verb only	108	21.6%
Trigger phrase only	54	10.8%
Neither	309	61.8%

The same four groups, scored against skillcheck's description quality metric:

Group	n	Median score	% scoring 70+
Has both	29	90.0	100.0%
Action verb only	108	70.0	72.2%
Trigger phrase only	54	70.0	94.3%
Neither	309	50.0	8.4%

The 100% rate in the both-features group isn't magic. It reflects that skillcheck's heuristic was designed around the spec's recommendations and rewards skills that follow them. What's actually striking is the bottom line: 309 of 500 published skills skip both recommendations. That's the working majority of the ecosystem leaving easy quality on the floor.

What authors actually fill in

Outside name and description, frontmatter is mostly empty. The median author-original frontmatter (excluding the bundler's injected fields) has just two fields. Two.

Field	Adoption
name	99.6%
description	99.6%
author	10.8%
tags	10.6%
tools	8.8%
license	3.8%
allowed-tools	2.8%
version	2.2%
triggers	0.6%
user-invokable	0.6%
capabilities	0.2%

The spec offers version, author, tags, allowed-tools, model, agent, hooks, user-invocable, disable-model-invocation, skills, mode. Almost none of them are being used. 80% of authors stop after name and description. There's an entire optional metadata layer the spec defines and the ecosystem ignores.

Progressive disclosure adoption is 16%

The spec's load-bearing concept is progressive disclosure: keep metadata tiny so the routing layer scans it cheaply, keep the body lean so it fits the agent's context window, push heavy material into resources/, scripts/, or references/ subdirectories that load only when needed.

Subdirectory	Adoption
`resources/`	6.4%
`scripts/`	4.4%
`references/`	8.2%
Any of the three	16.0%

84% of skills inline everything in SKILL.md. The whole architectural promise of progressive disclosure (multiple skills can sit in the agent's catalog without overwhelming context) requires authors to actually use the pattern. Most don't.

Body bloat is real

23% of skills triggered disclosure.body-bloat warnings, meaning they contain code blocks over 50 lines or tables over 20 rows in the SKILL.md body itself. These are exactly the things the progressive disclosure pattern was designed to push out into references/.

13.6% exceeded the spec's 500-line soft cap on body length. 8.4% exceeded the 5,000-token body budget when skillcheck's tokenizer flagged them (the rest weren't measured because they didn't trip the warning threshold).

Description length sweet spot

Quality scores rise with description length up to about 175-225 characters, then plateau:

Length range (chars)	n	Median quality
25-49	16	50.0
50-99	90	50.0
100-149	158	60.0
150-199	131	70.0
200-249	62	67.5
250-299	38	60.0

The spec's character cap is 1,024. Almost nobody's pushing it. The ecosystem clusters between 100 and 200 chars (median 145), which is roughly the bottom edge of the quality plateau. Authors writing 150+ char descriptions get noticeably better routing signal density.

Cross-source patterns

Antigravity's index classifies each skill's source. Quality patterns by source class:

Source class	n	Median quality	% action verb	% trigger	% progressive disclosure
community	394	60.0	26.6%	17.5%	16.2%
external_repo	38	65.0	34.2%	31.6%	18.4%
official_org	9	60.0	77.8%	0.0%	33.3%
personal	14	50.0	0.0%	0.0%	0.0%

Three observations. Skills from official org repos (Anthropic, Hugging Face, etc.) hit 77.8% action-verb adoption, miles above the community baseline, but zero trigger-phrase use; their descriptions are direct and verb-led without the "use when" preamble. Skills from individual external repos (someone's personal GitHub project) actually hit the highest trigger-phrase rate (31.6%), suggesting individual maintainers writing for their own activation problem think harder about it than community contributors writing for a shared list. Skills tagged "personal" (someone's curated set of their own work) hit 0% on both patterns, which is the cleanest signal that "I made this for me" doesn't translate to "an agent will pick this up."

Skillcheck v1.2.0 against the corpus

The new version was released April 28, 2026. The skillcheck rule set found:

1 of 500 skills produced an actual ERROR (0.2%): android_ui_verification, which has invalid characters in its name.
499 of 500 produced WARNINGs (99.8%).
0 skills passed completely clean.

Most-fired rules:

Rule	Count
`frontmatter.field.unknown`	500
`description.quality-score`	499
`disclosure.body-bloat`	115
`compat.unverified`	81
`disclosure.metadata-budget`	70
`sizing.body.line-count`	68
`disclosure.body-budget`	42
`frontmatter.description.person-voice`	27
`frontmatter.field.ecosystem`	19
`sizing.body.token-estimate`	14
`frontmatter.name.reserved-word`	11

The frontmatter.field.unknown warning fires on every file because antigravity injects bundler-only fields into the frontmatter (risk, source, date_added); strip those and the genuine unknown-field rate drops dramatically. Worth knowing if you're running skillcheck against bundled corpora versus author-original repos.

What this means if you publish skills

Four things, all reversible in a single commit per skill:

Start the description with an action verb (Generates, Validates, Creates, Analyzes, Refactors, Audits, etc.). Not Expert in, not Comprehensive, not One-stop. The verb tells the routing layer what the skill does in two syllables.
Include a trigger phrase (Use when ..., Trigger when ..., Use this skill when the user ...). The agent's routing decision is "should I activate this." A trigger phrase answers it directly.
Aim for 175-225 characters in the description. Short descriptions don't carry enough routing signal; long ones bury it.
Push large code blocks (>50 lines), large tables (>20 rows), and detailed reference material out of SKILL.md and into resources/, scripts/, or references/. The body should describe the work; the reference files should hold the work.

That's it. Four changes that move a skill from the 61.8% of the ecosystem ignoring spec recommendations to the 5.8% following them.

Methodology, for anyone who wants to push back

Tool: skillcheck v1.2.0 from PyPI (released April 28, 2026)
Corpus: sickn33/antigravity-awesome-skills at HEAD on April 29, 2026 (1,436 indexed skills)
Sample: 500 skills, drawn with random.seed(42) then random.sample
Per-skill processing: skillcheck path --format json --skip-ref-check plus skillcheck path --activation-hypotheses --format json
Feature extraction: action-verb match against a 90-verb allowlist (gerund and base forms); trigger-phrase match against 9 regex patterns; structural facts computed from filesystem and parsed frontmatter
Quality score: pulled from skillcheck's description.quality-score info diagnostic (a published heuristic whose source is at src/skillcheck/rules/description.py in the skillcheck repo)
Frontmatter analysis: bundler-injected fields (risk, source, date_added, category, id) excluded from the author-original counts above

The full dataset (500 skills, all features, all diagnostics) and the analysis output are in the skillcheck repo under docs/. Anyone who wants to verify a finding, slice it differently, or run the same pipeline against a different corpus has everything they need.

What's next

This study used skillcheck's symbolic mode and the activation-hypotheses generator. The agent-native critique mode (--ingest-critique) and capability graph extraction (--ingest-graph) weren't run here because they require a real agent in the loop and would have made the corpus run significantly longer. A follow-up study using those modes on a smaller subset (50-100 skills) would tell us what an agent actually sees in a skill versus what a static linter can measure. That's the next post.

moonrunnerkc / skillcheck

Cross-agent skill quality gate for SKILL.md files. Validates frontmatter, scores description discoverability, checks file references, enforces three-tier token budgets, and flags compatibility issues across Claude Code, VS Code/Copilot, Codex, and Cursor.

Cross-agent skill quality gate for SKILL.md files.

What This Does

skillcheck validates SKILL.md files against the agentskills.io specification: frontmatter structure, description quality, body size, file references, and cross-agent compatibility. New in v1.0: agent-native semantic self-critique, heuristic capability graph extraction with five structural analyzers, and a per-skill validation history ledger. It does not call any LLM API, execute skill instructions, or modify files.

Why This Exists

Analysis of 580 AI instruction files found that 96% of their content cannot be verified by any static tool. A separate survey found that 22% of SKILL.md files fail basic structural validation. Skills get written, committed, and published to catalogs; nobody proves they work.

skillcheck addresses both gaps with a two-mode design. When a calling agent is present, it uses that agent for semantic self-critique and capability graph extraction: the agent reads the skill's instructions and reports whether they are clear, complete, and internally…

View on GitHub

Top comments (11)

Mykola Kondratiuk • May 6

not sure agentskills.io spec is the right benchmark. most people writing SKILL.md build for their own setup, not a shared catalog. trigger phrase matters if your routing layer reads it - otherwise it's just style.

Brad Kinnard • May 7

I'd say it's the right benchmark as personal setup skill.md's still follow the same rules as ones for shared catalog. Trigger phrases are actually part of the required description that agents use to decide activation. agentskills.io/specification

The only case in which it would be just style, would be if a person were build a completely custom router, that deliberately ignores descriptions entirely. Which not many people do, far as I'm aware.

Mykola Kondratiuk • May 7

fair — if trigger-phrase routing lives in the description field, then spec compliance is load-bearing even for personal setups. my original pushback was about enforcement: solo configs rarely have validators, so the spec stays aspirational until something actually checks it.

Brad Kinnard • May 7

You're right, most solo users aren't running validators yet, so the spec does feel aspirational right now. But the landscape is changing fast. skillcheck just launched and is already getting easier to use. Skipping validation has real downsides: skills can fail to trigger, bloat context, or stay undiscovered. For the vibe coders out there (anyone shipping code they haven't read or don't fully understand), this can mean real trouble, as they may not catch problems until later. We'll see how it plays out. thanks for your comments, great points made!

Mykola Kondratiuk • May 7

hadn't seen skillcheck - worth checking out. you're right the tooling gap is closing faster than i expected. spec compliance becoming table stakes is only a matter of time.

david duymelinck • Apr 30

What I'm reading is that many skills are spaghetti code.

Brad Kinnard • Apr 30

Think more monolithic code rather than spaghetti code, or even a mixture of both. But yeah, total mess jampacked into one file. The tech debt is real.

david duymelinck • Apr 30

The big ball of mud, I got it.

Ross Peili • Apr 29

good read, but why use skill md at all when you have Skillware?

Brad Kinnard • Apr 30 • Edited

Why use the open standard adopted by dozens of production grade systems and major providers over a python framework with zero SKILL.md compatibility? The question answers itself. After looking at your repo, your project doesn't even replace skill.md, why are you claiming this? Please keep low effort promotional comments on your own posts.

Ross Peili • Apr 30

Dozens of production grade systems are wrong and use skills MD cause it spends tokens/ generates profit. Yes skillware is new, but it has several advantages over simple markdown instructions. It works offline, it is model agnostic, the functions and skill code is transparent and customizable, your agents don't spend tokens on gen or use skills but trigger a python function call. I am sorry it triggered you in a bad way. Hope you good.

View full discussion (11 comments)