I Built a Diagnostic CLI for Claude Code Skills — Here's What 8 Rules Caught That I Missed

#ai #opensource #cli #claude

Most of my Claude Code skills were broken and I had no idea. I had 23 skill files, felt productive, and assumed Claude was using all of them. Then I built a diagnostic tool, ran it on my own setup, and 14 of those 23 skills had structural issues that silently degraded how Claude interpreted them. That's a 61% failure rate on files I had personally written and considered finished.

TL;DR: pulser is a CLI that scans Claude Code skill files, classifies them by type, runs 8 diagnostic rules, generates prescriptions, and auto-fixes issues with backup and rollback. I ran it on 23 skills and found 14 had problems — missing frontmatter fields, ambiguous trigger conditions, conflicting instructions. One command. Zero config. npx pulser-cli and you're done.

The Problem: Claude Code Skills Have No Validation Layer

Claude Code skills are markdown files in ~/.claude/skills/ that change how Claude behaves — and there is no built-in way to verify they're structured correctly. You write a markdown file, drop it in the skills directory, and Claude is supposed to route to it based on the frontmatter description and body content. Nothing enforces that the file is valid.

I spent two weeks building a skill that was supposed to trigger whenever I said "debug this." It had detailed instructions, code examples, a careful system prompt. One problem: a typo in the frontmatter description field made the trigger condition ambiguous. Claude matched it maybe 30% of the time. The other 70%, it fell through to default behavior and I blamed the model.

Anthropic published skill quality principles, but they're guidelines, not tools. You read them, nod, and go back to writing skills the same way. I needed something that would read my skills, tell me what's wrong, and fix it.

Step 1: Define What "Broken" Actually Means

Before writing any code, I spent a day cataloging every skill failure mode I'd encountered. Not theoretical ones — actual bugs from my own skill files and issues reported by other Claude Code users. I landed on 8 diagnostic rules split into three tiers:

Tier	Rule	What It Catches
Core	`frontmatter-required`	Missing `name`, `description`, or `model` fields
Core	`description-quality`	Descriptions too vague to route ("Use this for stuff")
Core	`trigger-clarity`	Ambiguous or missing trigger conditions
Recommended	`instruction-structure`	No clear sections, wall-of-text body
Recommended	`conflict-detection`	Two skills claiming the same trigger space
Recommended	`example-coverage`	Skills without input/output examples
Recommended	`scope-boundaries`	Skills that try to do everything (>500 lines, 5+ responsibilities)
Experimental	`dependency-chain`	Skills referencing other skills that don't exist

The core 3 rules alone caught issues in 60% of my skill files. The frontmatter rule sounds trivial until you realize that a missing description field means Claude has to guess what your skill does from the body text. Sometimes it guesses right. Sometimes it routes to a completely unrelated skill.

Step 2: Build a Multi-Signal Classifier

Not all skills are the same, and treating them identically produces bad diagnostics. A coding skill needs different validation than a writing skill or a workflow automation skill. pulser classifies each skill using 4 signals before running any diagnostic rules, so the prescriptions it generates are type-appropriate rather than generic.

interface ClassificationResult {
  type: SkillType;
  confidence: number;
  signals: Signal[];
}

type SkillType =
  | "coding"
  | "writing"
  | "workflow"
  | "diagnostic"
  | "integration"
  | "meta";

The classifier looks at frontmatter fields, body keywords, code block density, and structural patterns. A skill with many ` ```
{% endraw %}
bash {% raw %}` blocks and words like "test", "build", "deploy" gets classified as `{% endraw %}coding{% raw %}` with high confidence. A skill mentioning "tone", "audience", "draft" lands in `{% endraw %}writing{% raw %}`.

**Why classification matters: each type gets different prescriptions.** A coding skill missing examples is a critical failure — Claude needs to understand exact input/output transformations. A workflow skill missing examples is a warning. Same rule, different severity, different fix.

`{% endraw %}{% raw %}``bash
$ npx pulser-cli --skill my-tdd-skill.md

  ┌─────────────────────────────────────────┐
  │  PULSER v0.3.1 — Skill Diagnostic       │
  │  Diagnose. Prescribe. Fix.              │
  └─────────────────────────────────────────┘

  Scanning: my-tdd-skill.md
  Classification: coding (confidence: 0.92)

  ■ frontmatter-required    PASS
  ■ description-quality     WARN  Description is 8 chars — too short for reliable routing
  ■ trigger-clarity         FAIL  No trigger condition found in frontmatter or body
  ■ instruction-structure   PASS
  ■ conflict-detection      PASS
  ■ example-coverage        FAIL  0 examples found — coding skills need ≥2
  ■ scope-boundaries        PASS
  ■ dependency-chain        SKIP  (experimental)

  2 issues found. Run with --fix to auto-repair.

## Step 3: Prescriptions, Not Just Pass/Fail

**pulser generates type-specific repair suggestions rather than generic error messages — this is what separates a diagnostic tool from a linter.** Every competitor I evaluated takes the same approach: scan, report pass or fail, stop. That's a linter. I didn't want a linter. I wanted a doctor.

When pulser finds an issue, it generates a prescription showing the current state, the proposed fix, and why the fix is appropriate for that skill's type. A `coding` skill with a vague description gets different language than a `writing` skill with the same problem:

```bash
$ npx pulser-cli --fix

  Prescription for my-tdd-skill.md:

  1. description-quality (WARN)
     Current:  "TDD stuff"
     Proposed: "Use when starting any coding task — enforces red-green-refactor
                cycle with test-first development. Triggers on: 'write tests',
                'TDD', 'test first'."
     → Auto-fix available. Backup will be created.

  2. trigger-clarity (FAIL)
     Current:  (none)
     Proposed: Add trigger block to frontmatter:
               triggers: ["TDD", "test first", "write tests", "red green refactor"]
     → Auto-fix available. Backup will be created.

  Apply fixes? [y/N]

**Every fix creates a timestamped backup before writing anything.** The files go to `.pulser/backups/{timestamp}/` and you can roll them back with one command. Nothing is modified without showing you the exact diff first.

## Step 4: The Undo System

The fix engine uses atomic writes with full rollback support. I learned this lesson from deploying database migrations: **if you can't undo it, you shouldn't automate it.**

```bash
$ npx pulser-cli undo

  Found 1 backup set:
  [1] 2026-03-15T14:32:00Z — 2 files modified
      my-tdd-skill.md (description + triggers added)
      debug-workflow.md (examples section added)

  Restore backup [1]? [y/N] y

  ✓ Restored my-tdd-skill.md
  ✓ Restored debug-workflow.md
  Backup retained at .pulser/backups/2026-03-15T143200/

The undo system reads the backup, validates the file still exists at the expected path, writes to a temp file, then renames atomically. **If the process dies mid-write, you don't end up with a half-written skill file.** This sounds paranoid for markdown files, but partial writes in shell scripts have corrupted enough of my config files that I now treat atomic writes as non-negotiable.

## Step 5: The TUI That Nobody Asked For (But Everyone Remembers)

pulser displays an EtCO2-style patient monitor animation while scanning — a waveform tracing across the terminal in real time, exactly like a hospital vital signs monitor. Was this necessary? No. **Did it make the tool memorable enough that three people asked about the animation before asking what the tool actually does? Yes.**

``{% endraw %}{% raw %}`bash
$ npx pulser-cli --all

  ╭──────────────────────────────────────────╮
  │  ♥ PULSER — Skill Vitals                 │
  │  ╱╲    ╱╲    ╱╲    ╱╲                    │
  │ ╱  ╲__╱  ╲__╱  ╲__╱  ╲__                │
  │                                          │
  │  Skills: 23  Healthy: 9  Warning: 8      │
  │  Critical: 6  Fixable: 12                │
  ╰──────────────────────────────────────────╯

The `{% endraw %}--no-anim{% raw %}` flag disables the animation for CI pipelines and terminals that don't support ANSI escape codes. I demoed the tool in a Discord server for Claude Code developers, and the animation generated more immediate questions than the diagnostic output. Memorable UI is a distribution strategy.

## Step 6: Output Formats for Every Workflow

pulser supports three output formats because different workflows require different data shapes. The default is human-readable terminal output. `{% endraw %}--format json{% raw %}` emits a strict schema suitable for piping into other tools. `{% endraw %}--format md{% raw %}` generates a markdown report you can commit to your repo.

`{% endraw %}{% raw %}``bash
# Human-readable (default)
$ npx pulser-cli

# JSON for piping into other tools
$ npx pulser-cli --format json | jq '.issues[] | select(.severity == "error")'

# Markdown for documentation
$ npx pulser-cli --format md > skill-health-report.md

The JSON schema is stable across patch versions. I use it in a pre-commit hook that blocks commits if any core-tier rules fail:

``{% endraw %}{% raw %}`bash
#!/bin/bash
# .git/hooks/pre-commit
ISSUES=$(npx pulser-cli --format json --strict 2>/dev/null | jq '.summary.errors')
if [ "$ISSUES" -gt 0 ]; then
  echo "pulser: $ISSUES skill errors found. Run 'npx pulser-cli' to see details."
  exit 1
fi

**Running pulser as a pre-commit hook means broken skills never reach your main branch.** The scan completes in under 4 seconds for 20 skills, fast enough that it doesn't noticeably slow commits.

## Step 7: The Build Pipeline

The whole project is TypeScript, bundled with tsup into a single ESM file. Four runtime dependencies total:

`{% endraw %}{% raw %}``json
{
  "name": "pulser-cli",
  "version": "0.3.1",
  "type": "module",
  "bin": { "pulser": "./dist/index.js" },
  "dependencies": {
    "commander": "^12.0.0",
    "gray-matter": "^4.0.3",
    "chalk": "^5.3.0",
    "boxen": "^7.1.1"
  }
}

**gray-matter parses frontmatter, commander handles CLI args, chalk and boxen handle terminal formatting.** Everything else is standard library. The total bundle size is 847KB including dependencies. Node 18+ is the only runtime requirement.

``{% endraw %}{% raw %}`bash
# Install globally
npm i -g pulser-cli

# Or run without installing
npx pulser-cli

## What I'd Do Differently

**I should have shipped with 3 rules instead of 8.** I launched with 8 rules because I wanted to feel comprehensive. In practice, the 3 core rules catch 80% of real problems. The recommended tier adds nuance but also adds noise for users who just want the basics working. I'd ship core-only and gate the rest behind a `{% endraw %}--full{% raw %}` flag.

**The classifier needs more training data.** My confidence scores are calibrated against my own skill files plus a small set of open source examples — roughly 50 skills total. The classifier works well for common patterns (TDD skills, writing skills, workflow automations) but produces low-confidence scores on unusual skill types. I need at minimum 200 diverse skills to make the confidence values trustworthy.

**I over-invested in the TUI animation before the fix engine was solid.** The waveform animation took a full day to build. During that time, the prescription engine had a bug where it would suggest adding a `{% endraw %}triggers{% raw %}` field to skills that already had trigger keywords embedded in the body — a false positive that would have confused early users. Animation is memorable, but correctness ships first.

## The Numbers

### Cost Comparison

| Approach | Cost | Time to First Result |
|----------|------|---------------------|
| Manual skill review | $0 | 2–3 hours for 20 skills |
| pulser scan | $0 | 4 seconds for 20 skills |
| Asking Claude to review skills | ~$0.03/skill | 30 seconds per skill |
| Building custom validation | $0 + 8–16 hours dev time | Varies |

### Scan Performance

| Metric | Value |
|--------|-------|
| Skills scanned per second | ~5 |
| Average fix generation time | 200ms |
| Backup + atomic write | <50ms per file |
| Total bundle size | 847KB |
| Runtime dependencies | 4 |
| Diagnostic rules | 8 (3 core + 4 recommended + 1 experimental) |

## FAQ

### Does pulser modify my skill files without asking?

No. The `{% endraw %}--fix{% raw %}` flag shows you every proposed change with a before/after diff and requires explicit confirmation before writing. Every modification creates a timestamped backup at `{% endraw %}.pulser/backups/{timestamp}/{% raw %}` before touching the original. You can restore any change with `pulser undo`. Nothing is destructive by default.

### How does pulser know what a "good" skill looks like?

It implements Anthropic's published skill quality principles as executable rules. The frontmatter checks follow the documented Claude Code skill schema. The description quality rule uses measurable heuristics: minimum character length, presence of trigger keywords, and specificity scoring based on action verb density. It's opinionated but grounded in published documentation, not personal taste.

### Does pulser work with skills in subdirectories or custom locations?

Yes. By default it scans `~/.claude/skills/` and any project-level `.claude/skills/` directories it finds. You can point it at any path: `npx pulser-cli /path/to/my/skills/`. The `--skill` flag scans a single file: `npx pulser-cli --skill my-skill.md`.

### Can I use pulser in CI/CD?

Yes. Use `--format json` for machine-readable output and `--strict` to exit with code 1 if any errors are found. The `--no-anim` flag disables the TUI for non-interactive environments. See the pre-commit hook example above.

### What's the difference between pulser and a generic markdown linter?

A markdown linter checks syntax. **pulser checks semantics — whether your skill's structure, description, and trigger conditions will actually work with Claude Code's routing logic.** It understands the difference between skill types, generates context-aware prescriptions, and auto-fixes issues with rollback. markdownlint will catch a broken heading. pulser will tell you your description is too generic to route reliably, and rewrite it.

## Try It Yourself

1. Run `npx pulser-cli` in any directory with Claude Code skills
2. Read the output — fix core-tier failures before anything else
3. Run `npx pulser-cli --fix` to review proposed repairs
4. Accept the fixes and verify Claude routes to your skills more reliably
5. Add the pre-commit hook if you want ongoing enforcement
6. Star the repo if it helped: [whynowlab/pulser](https://github.com/whynowlab/pulser)

## What's Next

If you've written Claude Code skills and wondered why Claude sometimes ignores them — run the scan. **The most common failure across 50+ skill files I've analyzed is a description field that's too generic for Claude to route reliably.** It takes 4 seconds to find out if yours has the same problem.

Have you run into skill reliability issues with Claude Code? What failure patterns have you seen? Drop a comment — I'm building the next rule set from real failure modes, not hypothetical ones.

---

*I'm a developer building AI-powered infrastructure tools. pulser started as a debugging script for my own Claude Code skill files and grew into an open source CLI that has now diagnosed over 200 skill files across early adopter setups. I write about building developer tools and the engineering mistakes that make them better.*

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.