From Slash Commands to Real Skill Engineering: 3 Lessons I Learned the Hard Way

#ai #claude #programming #devtools

I wrote an email-processing Skill with 8 detailed rules. Claude followed every one of them like an obedient but soulless intern — the output was correct but completely useless.

Then I deleted all 8 rules and replaced them with two sentences: "Which emails need my action, and which do I just need to know about?"

The result was 3x better. Claude started organizing information by urgency, merging redundant emails, and even flagging ones I could safely ignore.

That experience taught me something: writing instructions ≠ Skill engineering. There are three cognitive layers between the two.

Lesson 1: Your Skill's Entry Point Is Probably Broken

Here's an embarrassing fact: all of my Skills are triggered via slash commands. /read-think-write, /invest-analysis, /idc-inspection — every single time, I type the command manually.

This means the description field — the one that's supposed to determine "when the user says X, auto-trigger this Skill" — is completely dead weight in my setup.

Thariq from Anthropic wrote about this explicitly: description isn't documentation. It's a classifier — written for the AI to decide when to activate, not for humans to read.

Community benchmarks tell the story: unoptimized descriptions → 20% natural language trigger rate. Optimized → 50%. With examples → 90%.

There's also a counterintuitive design principle: descriptions should over-trigger. Recall matters more than precision. A false trigger wastes a few tokens — Claude enters the Skill, realizes it's not needed, and exits. But a missed trigger means the user thinks the Skill is useless and never tries natural language again.

We all use slash commands because we never engineered the entry point.

Lesson 2: Stop Opening Blind Boxes

The second thing I ignored for too long was eval — the evaluation system.

When I used skill-creator, I'd iterate 2-3 rounds. Each round it scores, keeps the higher version. Final output: ~90 points. Ship it.

But if you asked me "what does 90 actually measure?" — I couldn't answer.

Layer 1: Trigger evaluation. Tests whether "user said X, should the Skill activate?" This is the only layer I ever used — and where that 90 came from.

Layer 2: Quality evaluation. Run the same task with the Skill and without (bare Claude), then compare. That delta is your Skill's true value.

Bare Claude: 80 pts, your Skill: 82 pts → hundreds of lines for 2 points. Not worth it.
Bare Claude: 60 pts, your Skill: 95 pts → that 35-point delta is why your Skill exists.

No baseline comparison = slot machine development.

Layer 3: Process evaluation. Examine Claude's execution transcript. If Claude skips the same step in three test cases, that step isn't pulling its weight. Delete it — the Skill gets better.

Lesson 3: Don't Put Guardrails in the Prompt

A Hook in Claude Code is: a shell command that auto-triggers before or after Claude uses a tool.

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Write",
      "command": "eslint --fix $FILE_PATH"
    }]
  }
}

Every time Claude writes a file, the system automatically runs linting. Claude doesn't need to "remember" — it doesn't even know it's happening.

We write tons of MUST, NEVER, ALWAYS in our SKILL.md files — all enforced by Claude's attention. Long context = forgotten rules.

But if you turn "never modify .env" into a PreToolUse hook — Claude tries to write .env, gets blocked by the system — the rule goes from "please remember this" to "you can't violate this even if you try."

Good engineering doesn't rely on AI discipline. It relies on system guarantees.

The Conclusion

Skills aren't written — they're tested, measured, and system-guaranteed.

The core loop: write → test → observe → revise → test.

Most people stop at "write." I did too.

If you're triggering everything via slash commands, iterating by gut feel, and putting all your rules in the prompt — maybe it's time to pause and see what you've been skipping.