Testing Claude Code Skills in CI — pulser eval + GitHub Action

#claudecode #github #ci #testing

A missing name field in one skill file silently disabled 14 skills across our shared repository. Nobody noticed for a week — users just assumed Claude "didn't know how to do that." I built pulser to make sure that never happens again, then wrapped it in a GitHub Action so CI catches breakage before merge.

Here's the full setup.

TL;DR: pulser eval is a CLI that checks Claude Code skill files for structural correctness, frontmatter validity, and common antipatterns. Run it locally in under a second or add the GitHub Action to your CI pipeline. We went from "manually eyeball the YAML" to "CI rejects broken skills automatically" — catching 23 issues in the first week that would have shipped silently. Zero dependencies, sub-200ms execution for 40+ skills.

The Problem: Skills Break Silently

Claude Code skills are markdown files with YAML frontmatter — and they fail silently when malformed. A skill file looks like this:

---
name: my-skill
description: Use when the user asks to refactor a function into smaller units
---

Instructions for Claude...

Simple enough. But here is what actually goes wrong in practice.

**A missing `name` field makes the skill invisible.** Claude doesn't load it. No error, no warning, no stack trace. The file just doesn't exist from Claude's perspective.

**A vague description means Claude never triggers the skill.** If your description says "useful for various tasks," Claude has no signal for when to activate it. The skill sits there gathering dust while users wonder why their custom workflow stopped working.

**A malformed YAML frontmatter breaks silently with no output.** Forget a closing `---`, use a tab instead of spaces, or put an unquoted colon in a value — the file loads as raw markdown with no frontmatter at all. The skill body becomes invisible.

I found this out the hard way. We had a shared skill repository with 40+ skills across our team. Someone edited a skill, introduced a YAML syntax error in a multi-line description, and it passed code review because the diff looked fine to human eyes. That one change silently broke 14 skills. We didn't catch it for a week.

The kicker: `git blame` showed the exact commit. The fix took 3 seconds. The debugging took 2 hours.

That's when I decided to build a linter.

## What pulser eval Does

`pulser eval` is a zero-dependency CLI that scans Claude Code skill files and reports structural problems before they reach production. It runs 5 checks per skill file and produces binary pass/fail output in under 200ms for 40+ skills.

Under the hood, it runs a battery of checks:

1. **YAML frontmatter parsing** — catches syntax errors, missing delimiters, type mismatches
2. **Required field validation** — `name` and `description` must exist and be non-empty
3. **Description quality scoring** — flags vague descriptions that won't help Claude decide when to activate
4. **File structure analysis** — detects orphaned files, empty skill bodies, naming convention violations
5. **Cross-reference checking** — finds skills that reference files or paths that don't exist

Each check produces a clear, actionable message:

FAIL  .claude/commands/deploy.md
  ✗ Missing required field: name
  ✗ Description too vague (score: 0.2/1.0): "handles deployment"

PASS  .claude/commands/review-code.md
  ✓ Frontmatter valid
  ✓ Required fields present
  ✓ Description specific (score: 0.8/1.0)
  ✓ All references resolved

22 skills scanned · 3 failed · 19 passed

No guessing. No "looks fine to me." Binary pass/fail with reasons.

## Step 1: Install and Run Locally

bash
npm install -g pulser

Or run without installing:

npx pulser eval

By default, it scans `.claude/commands/` and `.claude/skills/` in your current directory. Override with a path argument:

bash
npx pulser eval ./custom-skills

The first run will probably surprise you. When I ran it against our 40-skill repository for the first time, 8 skills had issues I'd never noticed. Three had YAML errors. Two had descriptions so generic they might as well have been blank. One referenced a helper script that was deleted months ago.

Step 2: Read the Exit Codes

pulser uses standard exit codes that play well with CI:

Exit Code	Meaning
0	All checks passed
1	One or more checks failed
2	Configuration or runtime error

Exit code 1 means "your skills have problems" — exit code 2 means "pulser itself couldn't run." Most CI systems treat any non-zero exit as failure, but if you need to distinguish, the codes are there.

Step 3: Add the GitHub Action

The GitHub Action integrates in under 5 minutes and adds ~15 seconds to your pipeline. Create .github/workflows/skills.yml:

name: Lint Skills

on:
  pull_request:
    paths:
      - '.claude/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Evaluate Claude Code Skills
        uses: pulserin/pulser@v1

That's the minimal setup. The Action installs pulser, runs `eval` against your repo, and fails the check if any skill has structural issues.

**The `paths` filter is critical.** You don't want to run skill linting on every PR — only when someone actually changes files under `.claude/`. This keeps your CI fast and your Action minutes low. A typical eval run adds about 15 seconds to your pipeline, most of which is the checkout step.

## Step 4: Handle the First CI Failure

Your first PR after adding the Action will probably fail. That's the point.

Here's what the Action output looks like in a GitHub check run:

pulser eval v1.0.0
Scanning .claude/commands/ ...
Scanning .claude/skills/ ...

FAIL  .claude/commands/old-deploy.md
  ✗ Empty skill body — frontmatter present but no instructions

FAIL  .claude/commands/analyze.md
  ✗ name field contains spaces (use kebab-case)

PASS  .claude/commands/review-code.md
PASS  .claude/commands/test-runner.md
... (18 more passed)

22 skills scanned · 2 failed · 20 passed

Fix the failures, push again, watch it go green. **The feedback loop from push to CI result is under 30 seconds** for the eval step itself.

## Step 5: Build the Full Workflow

The pattern I've settled on after a few weeks of iteration:

1. **Local check before commit** — `npx pulser eval` as a pre-commit hook or manual habit
2. **CI check on PR** — GitHub Action catches anything missed locally
3. **Periodic full scan** — weekly cron that reports on skill health

yaml

.github/workflows/skills-weekly.yml

name: Weekly Skill Health

on:
schedule:
- cron: '0 9 * * 1' # Monday 9 AM UTC

jobs:
health-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Evaluate Skills
uses: pulserin/pulser@v1

Wire up a notification on failure and you have a complete skill health monitoring system.

The pre-commit hook catches ~90% of issues before they reach CI. The remaining ~10% are mostly YAML merge conflicts that look fine in the diff but produce invalid syntax after git resolves them.

What I'd Do Differently

Add CI from day one — retrofitting linting onto 40 existing skills costs a full afternoon. We accumulated those skills over three months before adding linting. If we'd started with the Action on day one, each broken skill would have been a 2-minute fix at PR time instead of a batch remediation project.

Description quality scoring needs more context. The current scorer flags short or generic descriptions, which is usually right. But it also occasionally flags perfectly adequate descriptions that happen to be concise. I'm looking at using the skill body content to calibrate what counts as "specific enough" relative to the skill's scope — a narrow skill can get away with a shorter description than a broad one.

GitHub Marketplace enforces a hard 125-character limit on action descriptions. My first submission was rejected. Small detail, but it cost me 30 minutes of rewording to hit the limit while keeping the description useful. If you're building Actions: check the character limits before you write the copy.

The Numbers

Metric	Before pulser	After pulser
Broken skills shipped to main	~3/month	0 in 6 weeks
Time to detect broken skill	1–7 days	< 30 seconds
Skill review confidence	"looks fine to me"	Pass/fail with specifics
Onboarding new skill authors	Trial and error	CI guides corrections

Component	Details
npm package size	< 50 KB installed, zero dependencies
Eval execution time	~200ms for 40 skills
GitHub Action overhead	~15 seconds total (mostly checkout)
Checks per skill	5 structural + description quality
Supported paths	`.claude/commands/`, `.claude/skills/`, or custom

FAQ

Does pulser eval actually run the skills or just lint them?

Static analysis only — pulser does not execute skill instructions against Claude. It parses structure, validates frontmatter, and checks references. Think of it as eslint for skill files, not an integration test suite. It catches the mechanical problems that are trivial for a parser to detect but easy to miss in code review.

Can I use pulser with skill directories outside .claude/?

Yes. Pass a custom path and pulser scans that directory for markdown files with YAML frontmatter matching the Claude Code skill format. Some teams keep shared skills in a monorepo under a top-level skills/ directory and point both pulser and their Claude Code configuration at the same path.

What happens with WIP skills that have intentionally empty bodies?

pulser reports them but doesn't block CI by default. If you need stricter enforcement, the exit code behavior lets you configure your CI pipeline to treat specific findings as blocking or non-blocking depending on your team's tolerance.

Does the GitHub Action work with private repositories?

Yes, and no data leaves your CI environment. The Action runs entirely within your GitHub Actions runner — no API calls, no telemetry, no external service dependencies.

How does this compare to custom shell scripts for validation?

I started with shell scripts. They handled "does frontmatter exist" and "is the name field present" well enough. They fell apart when I needed YAML-aware parsing, cross-file reference checking, and description quality scoring — shell and YAML is a painful combination. pulser replaces 200+ lines of fragile bash with a single command that handles edge cases the scripts never could.

Try It Yourself

Install: npm install -g pulser
Navigate to any repo with Claude Code skills
Run: pulser eval
Fix whatever it finds
Add the GitHub Action to .github/workflows/skills.yml
Open a PR that touches a skill file and watch the check run

Total setup time: under 5 minutes. First eval run: under a second.

What's the worst silent skill failure you've run into? I'm building checks based on real failure modes, and every new horror story makes the linter better. Drop your experience in the comments.

If this saves you debugging time, bookmark it for when your team starts building a shared skill library. Follow for more on AI tooling meets actual engineering discipline — I write about what breaks in production, not what works in demos.

I build developer tools for AI-assisted engineering workflows. pulser started as a weekend script to stop my own skills from breaking and turned into an npm package and GitHub Action after the third team asked to use it.