thestack_ai

Posted on Mar 26

I Audited 214 Claude Code Skills — 73% Were Silently Broken

#claudecode #ai #productivity #opensource

I ran a single command against my Claude Code skills directory last week. Out of 47 skills I'd written over three months, 31 had structural problems that degraded how often Claude actually triggered them. Two had never fired at all. Their descriptions were so vague that Claude's skill-matching logic couldn't determine when to use them.

The fix took 20 minutes once I knew what was wrong. Here's the full setup.

TL;DR: npx pulser@latest audits your Claude Code skills against Anthropic's documented principles — frontmatter structure, description specificity, body quality, reference coverage. It scores each skill 0–100, flags exact issues, and generates fix commands. 73% of 214 community skills I tested scored below 60. Free, open-source, MIT-licensed, runs in under 5 seconds.

The Problem: Skills That Silently Fail

Claude Code skills fail silently. A bad SKILL.md doesn't throw an error — it just never gets invoked. The description field in your frontmatter is what Claude uses at runtime to decide whether to activate a skill. Too vague, and Claude skips it. Every time. No warning.

I spent two weeks wondering why my api-testing skill wasn't firing during debugging sessions. The description read:

description: Helps with API testing

Five words. Claude had no idea when to use it. The fix was embarrassingly simple:

yaml
description: This skill should be used when the user asks to "test an API endpoint", "write integration tests for REST APIs", "debug a failing HTTP request", or "generate API test fixtures". Activate when the conversation involves HTTP status codes, request/response payloads, or API authentication flows.

The difference between a working skill and a dead one is often a single frontmatter field. When you have 30, 50, or 100+ skills across plugins, manual auditing doesn't scale. That's why I built pulser.

What pulser Actually Checks

pulser reads your SKILL.md files and validates them against 14 weighted criteria drawn directly from Anthropic's official plugin development documentation. Not guessing at best practices — checking against the published source. Each skill receives a 0–100 score based on frontmatter completeness, description quality, body structure, and reference coverage.

1. Frontmatter Validation

The minimum viable frontmatter needs name and description. But "minimum viable" and "actually works reliably" are different things.

npx pulser@latest ./my-plugin/skills/

Output for a broken skill:

  ✗ api-testing                              22/100
    ├─ WARN  description too short (5 words, need 20+)
    ├─ FAIL  no trigger phrases in description
    ├─ WARN  missing version field
    └─ INFO  no references/ directory found

  ✓ hook-management                          87/100
    ├─ PASS  description contains 4 trigger phrases
    ├─ PASS  frontmatter complete
    └─ INFO  3 reference files detected

**pulser checks 14 frontmatter attributes**, weighted by their measured impact on skill invocation reliability:

| Check | Weight | What It Catches |
|-------|--------|-----------------|
| Description length | 20% | Under 20 words = Claude can't match it |
| Trigger phrases | 25% | No quoted phrases = no reliable activation |
| Name present | 10% | Missing name = skill won't load |
| Description format | 15% | First-person instead of third-person |
| Version field | 5% | Missing = no change tracking |
| Body content length | 15% | Under 200 words = insufficient guidance |
| Reference files | 10% | No examples = Claude guesses patterns |

### 2. Description Quality Analysis

**The most critical check: descriptions must use third-person format and include exact quoted phrases that users would say.** This is in Anthropic's plugin development guidelines. It's also where 68% of community skills fail.

Bad:

yaml
description: A skill for handling database migrations

Good:

description: This skill should be used when the user asks to "create a migration", "run database migrations", "rollback a migration", or "check migration status". Activate for any task involving schema changes, Prisma migrate, or Knex migrations.

pulser counts quoted trigger phrases, checks for third-person voice, and measures description specificity. **A description without trigger phrases scores 0 on the most heavily weighted check — that's 25% of your total score, gone.**

### 3. Body Content Scoring

**The body of `SKILL.md` is the actual instruction set Claude follows when the skill activates.** pulser evaluates four dimensions:

- **Word count**: Under 200 words means Claude is flying blind. Over 3,000 means you probably need to split content into reference files.
- **Code block presence**: Skills without code examples force Claude to improvise patterns from scratch.
- **Section structure**: Headers, lists, and structured content score higher than walls of prose.
- **Actionable instructions**: Lines starting with imperative verbs ("Use", "Create", "Check") score higher than descriptive text.

Body Analysis: api-testing/SKILL.md
  Words: 142 (WARN: below 200 minimum)
  Code blocks: 0 (FAIL: no examples)
  Sections: 1 (WARN: no sub-sections)
  Imperative ratio: 0.12 (WARN: mostly descriptive)
  Score: 18/100

### 4. The eval Subcommand

**`pulser eval` detects whether skills fire correctly in realistic scenarios — without manually testing every prompt combination.** It sends synthetic prompts to your skill set and checks whether Claude activates the right skill. This is what caught my two permanently-dead skills.

bash
npx pulser@latest eval ./skills/ --prompts 10

Evaluating skill activation with 10 synthetic prompts...

"Create a new database migration for adding user roles"
Expected: database-migrations Actual: database-migrations ✓

"Help me debug this flaky test"
Expected: debugging Actual: (none) ✗
→ No skill matched. Check debugging/SKILL.md description.

"Set up a pre-commit hook for linting"
Expected: hook-management Actual: git-workflow ✗
→ Wrong skill activated. Descriptions overlap.

Results: 7/10 correct (70%)
2 skills never activated
1 skill conflict detected

Skill conflicts are the silent killer. Two skills with overlapping descriptions compete, and Claude picks whichever seems closest — which might be wrong. This is how git-workflow eats prompts meant for hook-management.

5. The Prescriber

Once pulser identifies issues, it generates exact fixes:

npx pulser@latest --prescribe ./skills/

Prescriptions for api-testing/SKILL.md:

  1. Replace description with:
     ---
     description: This skill should be used when the user asks to
     "test an API endpoint", "write API integration tests",
     "mock HTTP responses", or "validate API contracts". Activate
     when the task involves REST, GraphQL, HTTP clients, or
     API authentication.
     ---

  2. Add version field:
     version: 1.0.0

  3. Create references/ directory with:
     - references/rest-patterns.md
     - references/auth-flows.md

  4. Add code examples to body (minimum 2 blocks)

The generated descriptions aren't perfect — you'll want to customize the trigger phrases for your actual workflow. **But a prescriber-generated description is a massive upgrade over "Helps with API testing," and it typically pushes a skill past the 70-point threshold where reliable activation begins.**

## What I'd Do Differently

**I should have built eval first.** I started with frontmatter validation because it was measurable and straightforward, but eval catches the problems that actually matter — skills that don't fire when they should. A skill can score 90/100 on frontmatter and still be functionally dead.

**I over-indexed on word count.** Early versions penalized short skills too heavily. Some skills genuinely need to be brief — a 150-word skill that's all actionable instructions beats a 500-word skill that's mostly context. v0.4.0 weighs *instruction density* instead of raw word count.

**The prescriber needs user context.** Right now it generates generic trigger phrases. A future version should analyze actual conversation history to suggest phrases you *actually use* when you want a particular skill.

## The Numbers

### Cost Comparison

| Approach | Cost | Time per Skill | Catches Conflicts |
|----------|------|----------------|-------------------|
| Manual review | $0 | 2–3 minutes | No |
| pulser audit | $0 | ~0.3 seconds | No |
| pulser eval | $0 | ~2 seconds | Yes |
| Paid linter + manual QA | $20+/month | ~1 minute | Sometimes |

### Audit Results Across 214 Community Skills

| Metric | Value |
|--------|-------|
| Skills audited | 214 |
| Average score | 48/100 |
| Skills scoring below 60 | 73% |
| Missing trigger phrases | 68% |
| Description under 20 words | 41% |
| No code examples in body | 55% |
| Missing version field | 62% |
| Skill conflicts detected (eval) | 12 pairs |
| Time to audit all 214 | 47 seconds |

**The #1 failure mode: vague descriptions with no trigger phrases (68% of audited skills).** It's also the highest-impact fix — adding quoted trigger phrases typically raises a skill's score by 20–35 points in a single edit.

## FAQ

### Does pulser modify my files?

**No — pulser is read-only by default.** The `--prescribe` flag generates suggestions but writes nothing. Your skills stay untouched until you decide to apply changes.

### What's the minimum score I should target?

**70 is the reliability threshold.** Below 60, you're gambling on whether Claude picks up the skill. Above 80, you're consistently solid. I've seen skills score 95+ and still have edge cases — don't chase perfection, chase "fires when it should."

### Does this work with the new plugin system?

**Yes — pulser reads any `SKILL.md` file regardless of directory structure.** It works with both the legacy `~/.claude/skills/` layout and proper plugin structures under `.claude-plugin/`. It follows the directory convention from Anthropic's plugin-dev reference: `skills/skill-name/SKILL.md` with optional `references/`, `examples/`, and `scripts/` subdirectories.

### Can I run this in CI?

**Yes — `npx pulser@latest` returns a non-zero exit code when skills fall below a configurable threshold.** I run it as a pre-commit check on every plugin repo.

yaml

.github/workflows/skill-lint.yml

name: Audit skills run: npx pulser@latest ./skills/ --min-score 65 --exit-code

How is this different from just reading Anthropic's docs?

It's the difference between knowing the speed limit and having a speedometer. The docs tell you what good looks like. pulser tells you where your skills fall short right now — with scores, flagged lines, and fix suggestions. I read Anthropic's docs three times before building this and still shipped 31 broken skills out of 47.

Try It

30 seconds to your first audit:

npx pulser@latest ./path/to/skills/

Then dig deeper:

bash

Check for skill conflicts

npx pulser@latest eval ./skills/ --prompts 20

Generate fixes for anything below 70

npx pulser@latest --prescribe --min-score 70 ./skills/

Add to CI to prevent regressions

npx pulser@latest ./skills/ --min-score 65 --format json --exit-code

Most skills jump 20–30 points just from fixing the description field. Re-run after fixes to see the difference.

What's the worst score you've found in your own setup? I had a skill with a 4-word description and zero code examples — it scored 8/100. Drop your worst score in the comments.

If this saved you debugging time, run npx pulser@latest before your next skill lands in production. Your future self will thank you.

I write about AI tooling, Claude Code workflows, and the unglamorous parts of making LLMs actually work. Follow me here — next post covers how to structure skill references so Claude stops hallucinating file paths.

pulser is free, open-source, and MIT-licensed. Contributions and bug reports welcome on GitHub.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.