Sayed Ali Alkamel

Posted on Jun 12 • Originally published at github.com

skillscore: a CLI that scores your AI agent's SKILL.md 0–100

#ai #cli #opensource #showdev

A vague AI agent skill is worse than no skill at all — because the agent pays for it in context budget on every single turn, whether it uses it or not. Yet most of us write SKILL.md files by feel and ship them with zero feedback.

So I built skillscore: a command-line tool that statically analyzes any SKILL.md and gives it a 0–100 quality score, a letter grade, and a list of fix-it findings — each one citing the official authoring guide it comes from.

skillscore is an open-source Dart CLI that lints and scores AI agent skills (SKILL.md files) against the Claude, Codex, and Antigravity authoring guides. It runs fully offline, is deterministic, and exits with CI-friendly status codes.

TL;DR

🎯 Scores any SKILL.md 0–100 with a letter grade and actionable findings across 7 categories.
📚 Rules are drawn from the official Anthropic (Claude), OpenAI (Codex), Google (Antigravity), and Flutter skill-authoring guides — and every finding cites its source.
🔌 Offline, deterministic, zero network calls. Same input → same score, every time.
🚦 Built for CI: --min-score 80, JSON output, and SARIF 2.1.0 that annotates pull requests.
⚡ Install: dart pub global activate skillscore → pub.dev/packages/skillscore

Why I built this

AI agent skills are quietly becoming a standard. A skill is just a folder with a SKILL.md — YAML frontmatter (a name and a description) plus a Markdown body of instructions — and optional references/, examples/, scripts/, and assets/ subfolders. Claude Code, Codex, Antigravity, Gemini CLI, and Cursor all read the same format.

Here's the catch that most people miss: an agent keeps every skill's name and description in its context window permanently, so it can decide when to reach for one. A skill with a fuzzy description doesn't just fail to get used — it taxes every prompt and occasionally fires on the wrong request.

The vendors all published authoring guides telling you how to avoid this: front-load triggers, write in the third person, state when not to use the skill, keep the body short, document your scripts. Good advice — scattered across four different documents, none of them enforceable. There was no eslint for skills. So I wrote one.

What is skillscore?

skillscore is a skill linter and SKILL.md validator that turns those authoring guides into 24 concrete, checkable rules. Point it at a file, a skill folder, or a whole monorepo, and it produces a score per skill:

# Install (it's on pub.dev)
dart pub global activate skillscore

# Score a single skill — any name, any location
skillscore path/to/SKILL.md

# Score every skill in a tree
skillscore path/to/skills/

The rules live in 7 weighted categories:

Category	What it checks
A — Frontmatter validity	`---` delimiters, `name` format, `description` present
B — Description quality	states what + when, third person, front-loaded triggers, boundary clause
C — Conciseness	body length, no explainer bloat, no endless "or" chains
D — Structure	progressive disclosure, links one level deep, TOCs on long references
E — Instruction quality	anti-patterns, workflow checklist, feedback loop, code examples
F — Content hygiene	no rotting date references, forward-slash paths, consistent terms
G — Safety & scripts	a penalty (up to −15) when bundled scripts lack docs or a Safety section

100 points are distributed across A–F; category G only bites if your skill ships scripts or terminal commands. Profiles that exclude a vendor-specific rule are normalized back to 0–100, so a score means the same thing on every target.

Does it actually work? Let's score a real one

Here's skillscore run against a genuine skill from the Flutter team's public repo — flutter-add-widget-test/SKILL.md:

flutter-add-widget-test  (SKILL.md)
  Score: 90/100  Grade: A

  A  Frontmatter validity                     15/15  ██████████
  B  Description quality                      21/25  ████████░░
  C  Conciseness & token economy              15/15  ██████████
  D  Structure & progressive disclosure       15/15  ██████████
  E  Instruction quality                      14/20  ███████░░░
  F  Content hygiene                          10/10  ██████████
  G  Safety & scripts                    no penalty

  WARNING E1_anti_patterns  line 8
          Body contains no explicit anti-patterns (no "do not", "never", or "avoid").
          fix: Add explicit prohibitions, e.g. "Never share a WidgetTester across tests."

  INFO    B5_boundary_clause  line 3
          Description has no boundary clause saying when NOT to use the skill.
          fix: Append a boundary, e.g. "Do not use for multi-screen integration tests."

A genuinely good skill, and skillscore says so — but it also pinpoints the two things keeping it off a perfect score: it never tells the model what not to do, and its description doesn't state where the skill stops. Both are real, both are fixable in one line, and both come straight from the published guides.

Want the rationale behind any finding? Ask:

skillscore explain E1_anti_patterns

It prints why the rule exists, the exact fix, and the source guide it's from.

Built for CI

A score you have to eyeball isn't a gate. skillscore is designed to live in a pipeline:

# .github/workflows/skills.yml
- name: Lint agent skills
  run: |
    dart pub global activate skillscore
    skillscore skills/ --min-score 80 --no-color

--min-score 80 → the job exits non-zero if any skill dips below the bar.
--format json → structured output for dashboards.
--format sarif → valid SARIF 2.1.0 that uploads to GitHub code scanning, so findings annotate the exact lines in a pull request.

Exit codes are pipeline-grade: 0 everything passed, 1 a gate failed, 2 a usage error. No flaky LLM in the loop, no network — the same skill always scores the same.

How is this different from just asking an LLM to review my skill?

	skillscore	Vendor schema check	Markdown linter	"Ask an LLM"
Validates frontmatter	✅	✅	❌	⚠️
Scores quality (discoverability, structure, instructions)	✅	❌	❌	✅
Cites a source guide per finding	✅	❌	❌	❌
Deterministic / reproducible	✅	✅	✅	❌
Safe for a CI gate	✅	✅	✅	❌
Offline	✅	✅	✅	❌

An LLM review is great for nuance but non-deterministic — you can't gate a build on it. A schema check tells you the file is valid, not whether it's any good. skillscore fills the gap in the middle, and it pairs nicely with the other two.

It's a library too

The CLI is a thin wrapper over a public Dart API, so you can embed scoring in your own tooling:

import 'package:skillscore/skillscore.dart';

void main() {
  final doc = SkillParser().parseFile('my-skill/SKILL.md');
  final result = Scorer(RuleRegistry()).score(doc, Target.universal);
  print('${result.score}/100 ${result.grade}');
}

FAQ

What is an AI agent skill?
A folder with a SKILL.md manifest (YAML frontmatter + Markdown instructions) that teaches an AI agent a repeatable task. Optional subfolders hold references, examples, scripts, and assets. The format is shared across Claude Code, Codex, Antigravity, Gemini CLI, and Cursor.

Which agents does skillscore support?
All of them — the SKILL.md format is shared. Score against one vendor with --target claude|codex|antigravity, or use the default universal profile, which a portable skill should pass everywhere.

Is it really offline?
Completely. No network calls at runtime, local files only, fully deterministic — the same input always produces the same score and the same finding order.

Does my skill have to be named a particular way?
No. skillscore is name-agnostic: the frontmatter name, the folder name, and the file name are independent, and even non-ASCII folder names work. Rule A2 will still tell you if the name field breaks the official format.

What happens with malformed frontmatter?
No crash. The relevant frontmatter errors are reported, every other rule that can still run does, and you always get a score.

What's next

v0.1.0 is live and the rubric is stable, but it's early. The roadmap: more vendor targets as new guides land, an autofix mode for the mechanical findings (forward slashes, missing TOCs), and a GitHub Action wrapper so CI setup is one line. The rule engine is deliberately simple — a new rule is one class plus one registration — so contributions are welcome, and every rule must cite the published guide it enforces.

Try it

dart pub global activate skillscore
skillscore your-skill/

📦 pub.dev: pub.dev/packages/skillscore
💻 GitHub: github.com/sayed3li97/skillscore

sayed3li97 / skillscore

Lint and score any AI agent SKILL.md against the official Claude, Codex, and Antigravity authoring guides — offline Dart CLI.

skillscore — lint and score AI agent skills (SKILL.md)

skillscore statically analyzes any AI agent skill — a SKILL.md manifest and its folder — and produces a 0–100 quality score, a letter grade and a list of actionable findings, scored against the official skill authoring guides from Anthropic (Claude), Google (Antigravity), and OpenAI (Codex). Offline, deterministic, CI-friendly.

What is skillscore?

skillscore is a skill linter / SKILL.md validator / agent-skill quality checker / AI skill scorer. Agent skills are an open standard — a folder with a SKILL.md (YAML frontmatter + Markdown body) plus optional references/, examples/, scripts/, and assets/ — used by Claude Code Codex, Antigravity, Gemini CLI, and Cursor. Because an agent keeps every skill's name and description in its context budget permanently, a vague or malformed skill is worse than no skill. skillscore catches exactly those…

View on GitHub

If you maintain skills, run it against your SKILL.md and tell me what score you get — and what it got wrong. I want the rules to reflect how people actually author skills, so findings you disagree with are the most useful feedback I can get. And if it saves you a context-budget headache, a ⭐ helps it reach other people building agents.

Top comments (6)

Hossain Foysal • Jun 13

Love it

Sayed Ali Alkamel • Jun 13

Thank you

HARD IN SOFT OUT • Jun 13

Hey Sayed, this is genuinely one of the most practical things I've seen in the AI tooling space this year. A deterministic, offline linter for SKILL.md files that cites which guide each rule comes from? That's not just useful — it's how you build trust in a CI pipeline.

A product manager asks: "Can we just use an LLM to review our SKILL.md files?"

The engineer says: "That's non‑deterministic."

The PM says: "So?"

The engineer runs skillscore twice, gets the same 67 both times, then runs an LLM twice and gets 85 and 42.

The PM says: "Why is the AI grading itself like a generous professor?"

The engineer says: "Because it read the rubric you wrote."

Add a "context cost estimator" alongside the score. You mention that vague descriptions waste tokens on every turn. What if skillscore also estimated the annual token waste of a poorly written description? Something like: "Description is 50 tokens of fluff → at 100 calls/day → costs ~$X/year." That turns abstract quality into a number finance will understand.

Rule versioning per vendor. Vendor guides evolve. A skill that passed against Claude's v1 guide might fail against v2. It'd be powerful to pin --target claude@2025-03 or something similar, so teams can upgrade rules intentionally rather than waking up to a broken build because Anthropic changed their advice.

The CLI output is clean, but the SARIF integration is buried a bit. Consider a one‑line example in the README of how to upload to GitHub code scanning — that's the difference between "nice tool" and "installed in every skills repo tomorrow."

Great work. Starred it.

Tyson Cung • Jun 13 • Edited

This is exactly the kind of tool the agent ecosystem needed. I author skills daily for Hermes Agent and the biggest pain point is exactly what you described, a vague description taxes context on every turn without the agent knowing when to use it. The CI-ready scoring and SARIF output seals it. One question: does it handle skills with multiple file_path entries in scripts, or is it single-file only? Would love to add this to our build pipeline.

Sayed Ali Alkamel • Jun 13

It walks the full scripts/ folder, so multiple files are handled automatically. The check fires if any bundled script is not mentioned or documented in the manifest body. Just dropped 0.2.0 which also lets you gate your whole skill library in one shot:skillscore skill-a/ skill-b/ Happy to help wire it into your pipeline if you run into anything.

Alex Shev • Jun 13

I like that this treats skills as code, not just prose. Once a SKILL.md becomes part of an agent's operating surface, it needs linting, scoring, and regression checks like any other artifact. Offline and deterministic is especially important because otherwise the evaluation becomes another model opinion.