Stephan Miller

Posted on Jun 24 • Originally published at stephanmiller.com on Jun 17

The Agent Skills Guide I Wish I'd Had

#agentskills #claudecodeskills #llmcontextmanagement #aiagentdevelopmentgu

I’ve got over a dozen projects in many different states of “done,” and for a long time every single one of them started with me typing the same context tax into a fresh session. CLAUDE.md helped. But CLAUDE.md loads everything, every session, whether the task needs it or not. What I actually wanted was a way to hand the agent the right knowledge at the right moment and pay nothing for it the rest of the time.

That’s a skill. And after a few months of building them for real (a skill that researches model trends and accidentally turned into its own weekly blog post and a few that do boring research so I don’t have to), I’ve got opinions. This is the guide I wish someone had handed me. It’s Claude Code first, because that’s my preferred driver, but every other coding agent I’ve considered using gets its own section near the end, quirks and all.

What a Skill Actually Is (and the Three Things It Isn’t)
The Mistake Everyone Makes First: Treating It Like a Shell Script
The Part That Saves You Tokens: How Claude Code Loads Skills
Building Your First Skill in Claude Code
The Folder Is the Feature
- A Claude Code-specific trick: skill-scoped hooks
Skills Actually Worth Building
The Other Guys: Skills Everywhere Else
How Good Skills Actually Get Built
Organize Before It Becomes a Swamp
The Skills I Actually Reach For
Lessons I Had to Learn the Hard Way
- Don’t trust a skill’s first trial. Build a way to catch the ones that rot
- Revisit your descriptions. Treat global ones completely differently
The Honest Version

What a Skill Actually Is (and the Three Things It Isn’t)

A skill is a folder. At minimum it’s one file, SKILL.md, with a little YAML frontmatter on top and plain markdown instructions underneath. The agent reads the frontmatter at startup, decides on its own whether the skill is relevant to what you’re doing, and pulls in the full thing only when it is.

That last part is the whole point: a skill is conditionally loaded context. It sleeps in an index until the model decides it matters, then wakes up. That makes it different from the three things people constantly confuse it with.

A prompt is something you type. It lives for one turn and dies when the session ends.

A CLAUDE.md (or AGENTS.md, the more portable name a lot of tools now read) is always-on. It’s the employee handbook: team conventions, non-negotiable standards, the stuff that should apply to everything. The problem is that you pay for every line of it on every single turn, whether you’re touching the billing code or fixing a typo. Cram domain knowledge in there and you’re burning context to tell the model about your payment state machine while it edits your README.

A slash command is a shortcut you fire manually. You decide when it runs.

A skill is the specialist manual that comes off the shelf only when the job calls for it. The model decides when (or you can call it like a slash command). CLAUDE.md is “here’s how we do things here.” A skill is “here’s the thing you’d otherwise get wrong, and only when you’re about to get it wrong.”

The Mistake Everyone Makes First: Treating It Like a Shell Script

A skill is not a batch file. An LLM is not a command executor. It’s a probabilistic model that reads your instructions and decides what to do. There is no guarantee your steps run in order. There is no guarantee every line gets followed. If you write a skill as a numbered list of shell commands, you haven’t written a skill. You’ve written documentation that will fail in surprising ways the first time reality doesn’t match your happy path.

Think of it like directing instead of programming. The model is the talent. It can act, it has instincts, it’s done this before. Your skill is the shot list and the blocking notes for this specific scene. You don’t tell a good actor which muscles to move. You give them motivation, constraints, and the things they can’t know on their own, then you let them perform.

So don’t write this:

git checkout main
git checkout -b fix-branch
git cherry-pick <sha>
git push origin fix-branch

Write this:

Cherry-pick the commit onto a clean branch off main. Resolve conflicts by preserving the original intent of the change. If it can’t land cleanly, stop and explain why instead of forcing it.

The second version works better precisely because it gives the model room to handle the mess. The newer and more capable the model, the truer this gets. A smart model will interpret your rigid steps and quietly do something better. Or worse, get confused trying to follow a script that no longer fits the situation. Give it judgment criteria. Let it execute.

The Part That Saves You Tokens: How Claude Code Loads Skills

Skills load in three stages, and understanding this is the difference between a skill that stays cheap and one that taxes every turn of your session.

Stage	What loads	Roughly what it costs	When you pay
Index	Just the `name` and `description` from the frontmatter	A handful of tokens per skill	Every session, always
Body	The full `SKILL.md`	A few hundred lines, ideally	When the agent decides the skill applies
Runtime	Files in `references/`, `scripts/`, `assets/`	Effectively unlimited	Only when the agent actually opens them

The index is paid by everyone, every session, forever. Every skill you have installed contributes its name and description to a list the agent scans at startup. This is why the description has to be tight: every character burns tokens on every session, including the ones where the skill never fires.

The body is paid once the skill triggers, and then it sits in context until the session ends or hits a compaction boundary. Load five fat skills in one session and you’re carrying all five bodies the whole way. A skill stuffed with fluff doesn’t just hurt itself. It degrades every other skill loaded next to it.

The runtime files are basically free until needed. This is where the heavy stuff goes: full API references, error-code tables, the long boring rules nobody needs most of the time. The agent reads them on demand, and only the parts it needs.

Get this right and a skill stays dormant and cheap until it earns its place. Get it wrong (everything jammed into one giant SKILL.md) and you pay full price even when the task needed ten percent of it. I’ve seen the real-world version of this: a bloated monolithic skill restructured into a thin spine pointing at a few reference files dropped its context cost by roughly three times with zero change to the actual instructions. Same words, but a different shape. Three times cheaper.

Building Your First Skill in Claude Code

Don’t start with a document. Start with the thinnest thing that helps. In Claude Code, skills live in two obvious places:

# Personal: follows you across every project
~/.claude/skills/your-skill-name/SKILL.md

# Project: checked into the repo, everyone on it gets the skill
.claude/skills/your-skill-name/SKILL.md

Start personal. Break things where nobody’s watching. Promote to the project repo once it actually works. And the minimum viable skill is genuinely this small:

---
name: react-component-conventions
description: Load when building or modifying React components, referencing MUI components, or implementing our design system patterns.
---

## What this provides

Our components use MUI as the base. Files go in `src/components/` organized
by domain, not by type. Props interfaces live in the same file as the component.

## Gotchas

- Don't use the `sx` prop for styles reused across components. Extract a styled component instead
- Always pull theme values through `useAppTheme()`, never direct MUI theme imports
- Forms use react-hook-form with our `FormField` wrapper. Don't hand-roll form state

Two sections. That’s it. You’ll grow it later, and you’ll grow it from failures, not from imagination.

The description is the hardest line you’ll write

The description isn’t a summary. It’s a routing trigger. It’s the one thing the agent sees in that always-loaded index, and it alone decides whether your skill loads. And whether it wrongly loads during unrelated tasks and contaminates them.

A bad description describes the skill’s contents. A good description describes the user’s state of mind when they need it:

Bad	Good
“This skill helps with our billing library”	“Load when working with billing-lib, subscription states, or invoice generation. Covers the edge cases and footguns.”
“Deployment workflow docs”	“Load when the user says ‘babysit the PR’, ‘watch CI’, ‘make sure this lands’, or ‘deploy the service’.”

Write it from the user’s perspective, keep it short, and don’t summarize the workflow. One sloppy description doesn’t just make your skill miss. It makes the whole shelf noisier, because now your skill barges into tasks it has no business in. Every skill you add risks making every other skill slightly less accurate. The description is where you control that.

Let Claude write it, then cut hard

Do a real task in Claude Code, manually feeding it all the context you’d normally re-explain. Notice what you repeated. At the end, tell it: “Write a skill that captures the pattern we just used. Focus on the knowledge I gave you, not the stuff you already knew. Keep it under 200 lines.” Then cut the result aggressively (first drafts always over-explain) and test it in a fresh session with zero carryover.

If you want the structured version of this, Claude Code ships skill-creator. Invoke it with /skill-creator and it interviews you, writes a draft, runs your test cases with and without the skill side by side so you can actually see what the skill buys you, and even tunes the description against should-trigger and should-not-trigger queries. It’s overkill for a quick library-reference skill. It’s exactly right for anything going into wide use. And it will use a lot of tokens. I created only a handful of skills using it so far and the most complex one took a complete Pro session to finish.

The gotchas section is the whole game

After it ships, a skill barely changes in the body. It evolves through gotchas. Agent does something dumb because its sane default doesn’t match your weird environment? Add a gotcha. One line, “Always run the build from the repo root, never from inside a module”, kills a class of error forever.

The Folder Is the Feature

The one-file skill is fine to start. But the reason skills beat a giant CLAUDE.md is the folder:

your-skill/
├── SKILL.md ← the hub: frontmatter + core instructions
├── references/ ← heavy docs, read only when needed
│ ├── api.md
│ └── error-codes.md
├── scripts/ ← code the agent runs, not rewrites
│ └── validate.py
└── assets/ ← templates and output shapes
    └── pr-template.md

The rule that keeps this sane: one hop from SKILL.md to anything. The hub points directly at references/api.md. One hop. The hub pointing at a file that points at another file that finally has the content? That’s three hops, and the agent will half-read the chain, lose the thread, and miss things. Keep it flat. Progressive disclosure, not a hierarchy for its own sake.

Each folder has a job:

references/ : documentation too long for the body. API tables, error codes, domain rules that run pages. Put a table of contents at the top of anything over ~100 lines so the agent can jump instead of reading the whole thing.
scripts/ : deterministic code you want run, not reinvented. Here’s the quiet efficiency win: when a script runs, only its output enters the context window, not its source. You can parse a file, hit an API, or run a whole validation suite and pay only for the result. Make the failure messages specific. “Field customer_name not found. Available: account_id, order_total” lets the agent self-correct. “Validation failed” makes it guess.
assets/ : templates and locked-down output shapes. PR descriptions, report formats.

And the antipatterns that bite everyone:

Frontmatter on a reference file. Frontmatter is what gets promoted to the always-loaded index. Put name: and description: on a reference file and you’ve just made it a top-level skill the agent can trigger without the parent that gives it context. Strip frontmatter from everything except the root SKILL.md.
Hardcoded paths. cd modules/web works on your machine. Your teammate’s repo has packages/frontend/web. Tell the agent to discover the path: “find the directory with the frontend package.json.”
One monolithic file. Already covered the token math. Don’t.

A Claude Code-specific trick: skill-scoped hooks

This is where Claude Code pulls ahead of most of the field. You can define hooks in a skill’s frontmatter, and they’re only active while that skill is active. A security skill can register a PreToolUse hook that inspects Bash commands and blocks rm -rf. A deployment skill can attach a PostToolUse hook that reminds the model to run the verification script after touching release files. The rule lives and dies with the skill instead of polluting every session. Most other agents can’t do this yet. They lean on bundled scripts and always-on instructions instead, which I’ll get to.

Skills Actually Worth Building

I won’t list every category. After watching how teams and solo builders use these, a few earn their keep more than the rest:

Verification skills are the best return on the list. These teach the agent to check its own work: a Playwright script that walks your signup flow with assertions at each step, or a checker that validates responses against your OpenAPI spec. The power is the loop: the agent runs the check, sees the failure, fixes it, and re-runs, all in one task. Without one, it writes code and ships it optimistically and you find the bug. The Claude Code team has noted that the investment in verification skills pays out disproportionately, and that matches my experience.

Library and API reference skills handle the internal stuff the model can’t know: your billing library’s edge cases, your migration patterns, the navigation component you wrote that shares a name with nothing public. The core docs are table stakes here. The gotchas are the value.

Scaffolding skills generate boilerplate that’s already shaped right: a new endpoint pre-wired to your architecture, a component shell with your styling conventions baked in.

Runbook skills map a symptom or an error signature to a structured investigation. Gold for debugging and on-call.

Onboarding skills turn the docs you already have into something the agent can actually use. Cheap to build, more useful than you’d guess, because the raw material already exists. It just needs packaging.

If you want a feel for how far this scales, I run a weekly model-trends blog post off a single skill now. It started as a thing to save me money on research and quietly became infrastructure. And now it runs automatically every Tuesday morning at 7 AM and emails me when it’s done, so I can edit and publish it.

The Other Guys: Skills Everywhere Else

The SKILL.md format is an open standard now. The same skill folder, untouched, works across a growing list of agents. What changes between tools is two things: where the skill files live, and how the tool behaves once it loads one. So this section is organized exactly that way: for each tool, where you put it, and the quirk that’ll trip you up. One piece of good news up front: a shared install location, .agents/skills, has started to emerge as the neutral ground several of these tools quietly agree on. I’ll come back to why that matters when we get to organizing the mess.

The open-source agents

If you’ve only used Claude Code, you’ve missed that the open-source side of this has gotten genuinely good. I run open models through OpenRouter (DeepSeek, Qwen Coder, GLM and friends) partly to lean less on Claude for everything, and the agents below are how I drive them. Worth being clear about one thing the marketing blurs: DeepSeek and Qwen are models, not agents. You don’t write skills “for DeepSeek.” You point one of these open agents at a DeepSeek or Qwen endpoint and it loads the skills. Keep that straight and the whole map makes more sense.

OpenCode is the one that’s eaten the open-source world. By mid-2026 it’s the most-starred open coding agent by a wide margin. It’s a terminal agent, provider-agnostic (cloud APIs, OpenRouter, local models via Ollama), and it supports the skill standard natively through a skill tool the agent calls on demand. The quirk worth knowing: it’s promiscuous about where it reads skills from. Project skills go in .opencode/skills/, personal ones in ~/.config/opencode/skills/, but it also reads .claude/skills/ and .agents/skills/, both project and global. Which means if you already have Claude Code skills, OpenCode will often just find and use them with zero porting. I’ve been running it on smaller projects and that cross-reading is a quietly great feature.

Pi (the pi-mono toolkit) is Mario Zechner’s harness and it’s the engine underneath my OpenClaw setup, which is where most of my open-model tinkering actually happens. The whole pitch is subtraction: a four-tool core (read, write, edit, bash), a system prompt under a thousand tokens, MIT-licensed TypeScript, and a refusal to bolt on features just because everyone else has. He built it as a reaction to Claude Code getting heavier, which is its own kind of funny given this whole post. Skills fit that minimalist philosophy perfectly. Pi lazy-loads them on demand, and its skills are deliberately cross-compatible with Claude Code and Codex, so the same SKILL.md I wrote for my daily driver runs unmodified inside the thing answering my Telegram messages. That’s the open standard doing exactly what it promised.

Aider is the terminal OG, around since 2023, and still the gold standard if you live on the command line. Git is a first-class citizen. It stages changes and writes commit messages for you. Its context convention predates the skill standard: it leans on a CONVENTIONS.md you pass in as a read-only file, which is closer to an always-on instructions file than to on-demand skills. Different philosophy, same goal.

Cline is the top pick for VS Code people who want the editor-integrated experience instead of a terminal. Strong multi-file reasoning, and it reads the skill standard (still an experimental, opt-in feature you flip on in settings), picking up skills from .cline/skills/, .clinerules/skills/, and .claude/skills/.

Goose , from Block, is the more autonomous of the bunch. It plans, executes, and iterates with less hand-holding, and it’s built around extensions and custom tools.

Gemini CLI is Google’s open-source (Apache-licensed) terminal agent. It speaks the SKILL.md standard natively now: drop skills in .gemini/skills/ (project) or ~/.gemini/skills/ (personal) and it injects their name and description at session start, then calls an activate_skill tool when a task matches. The same SKILL.md you wrote for Claude Code runs here unmodified. Its always-on instruction file is GEMINI.md: same idea as CLAUDE.md, different filename.

Codex CLI (OpenAI)

OpenAI’s terminal agent. It implements the skill standard and behaves a lot like Claude Code: progressive disclosure, description-matched loading on demand. The wrinkle is that Codex adds its own metadata layer: alongside SKILL.md you can drop an agents/openai.yaml file for UI metadata, invocation policy, and tool dependencies. Skills live in .agents/skills/, which, not coincidentally, is one of the same portable locations OpenCode reads, so a skill dropped there is visible to both. Its always-on instruction file is the portable AGENTS.md, which OpenAI pushed hard as a cross-tool convention. So a skill written for Claude Code mostly drops in; you’re adding Codex’s metadata file, not rewriting anything.

Cursor

Cursor took the longest to come around, but it’s here now: Cursor natively supports SKILL.md as an open standard, and the same skill folder you wrote for Claude Code drops in untouched. It reads skills from .cursor/skills/ and .agents/skills/ at the project level, ~/.cursor/skills/ and ~/.agents/skills/ globally, and for backward compatibility it also picks up .claude/skills/ and .codex/skills/. It walks the skills root recursively too, so a SKILL.md nested deeper in a repo gets scoped to its containing folder automatically. Frontmatter is the familiar name and description, plus optional paths globs for file-scoping and disable-model-invocation for a skill that only fires when you call it by name. It also reads AGENTS.md for always-on instructions.

The history is worth knowing, because it’s what you’ll still find in older Cursor projects. Cursor’s context system grew up around rules , not skills: the .cursor/rules/ directory full of .mdc files (a plain .md in there gets ignored, because rules need frontmatter). Rules can be always-on, auto-attached by file glob, or pulled in on demand via their description, which always gave you some of the conditional-loading behavior skills have. Rules still work, but skills are the forward path now, and Cursor ships a /migrate-to-skills command that converts existing rules and slash commands over. Starting fresh, author skills. Sitting on a pile of rules, migrate them.

GitHub Copilot

The one with the home-field advantage if your work already lives on GitHub. The distinguishing thing about Copilot’s skills isn’t the format. It’s the reach. The same SKILL.md works across the whole Copilot surface: the cloud agent, code review, the Copilot CLI, the desktop app, and VS Code’s agent mode. Write a skill once and it shows up everywhere Copilot does.

On storage, Copilot is the most catholic of the bunch. The GitHub-native default is .github/skills/, but it also reads .claude/skills/ and .agents/skills/ at the repo level, with personal skills in ~/.copilot/skills/ or ~/.agents/skills/. So between Copilot, Codex, and OpenCode all reading .agents/skills/, that folder really is becoming the lingua franca. Distribution has a native path too: gh skill discovers and installs skills straight from GitHub repositories, which is the least-surprising workflow for a team that already does everything through GitHub. Org- and enterprise-wide skill scopes are still labeled “coming soon,” so for now you’re working with personal and project.

The quirk to know: Copilot doesn’t do the skill-scoped hooks trick Claude Code does. There’s no per-skill lifecycle hook you can register from frontmatter. For deterministic behavior you lean on bundled scripts, allowed-tools pre-approval for trusted commands, and always-on rules in copilot-instructions.md (or AGENTS.md). It’s not a dealbreaker. It just means the “block this command before it runs” pattern lives somewhere other than the skill itself.

The commercial top three

If you’re picking a paid agent and money’s the deciding factor, the field really narrows to three:

Claude Code (Anthropic): the most mature skill system, full stop. Skill-scoped hooks, /skill-creator, the cleanest progressive-disclosure model. It’s my daily driver for a reason, even as I question that habit out loud sometimes.
Cursor : the best editor-native experience if you want your agent inside the IDE rather than a terminal. It now natively supports SKILL.md (it used to be rules-first), so your skills travel here too, with a /migrate-to-skills command for older rule setups.
GitHub Copilot : if your team already lives in GitHub, it implements agent skills through VS Code and slots into existing repo workflows with the least friction.

The skill cheat sheet

Same SKILL.md, different mailboxes:

Tool	Where it reads skills	Worth knowing
Claude Code	`.claude/skills/`	The reference implementation
Codex	`.agents/skills/`	Optional `agents/openai.yaml` for metadata
OpenCode	`.opencode/skills/`, `.claude/skills/`, `.agents/skills/`	Cross-reads your Claude Code skills
Copilot	`.github/skills/`, `.claude/skills/`, `.agents/skills/`	`gh skill` to install from repos
Cursor	`.cursor/skills/`, `.agents/skills/`, `.claude/skills/`	Older `.cursor/rules/` still works; `/migrate-to-skills` moves you off it
Gemini CLI	`.gemini/skills/`	Same `SKILL.md`, runs unmodified

For always-on instructions, the names are CLAUDE.md, GEMINI.md, or the increasingly universal AGENTS.md. Write to the standard, keep your paths discoverable, and most of your skills travel for free.

How Good Skills Actually Get Built

The instinct you have to fight is opening an editor and documenting a skill before you’ve watched the agent fail without it. The right order is backwards from that:

Run the agent without the skill on three to five realistic tasks.
Write down exactly where it fails or assumes wrong.
Turn those failures into evaluations: what the agent should do, and crucially what it should not do. Negative examples are often worth more than positive ones.
Write the minimal skill that makes those evals pass.
Ship.

Starting from observed failures is the only thing that stops you from over-building. Most first drafts explain things the model already nails and skip the one gotcha that actually mattered.

Two more things I learned the slow way. Test across at least two model families before you trust a skill. A skill tuned on one model is calibrated to that model’s behavior, not just its raw capability. And a smarter model will often interpret your instructions more literally, not less. A writing skill told to “write short sentences” produced clean rhythmic prose on one model and choppy, mechanical garbage on the upgrade, because the better model applied the rule to every sentence regardless of feel. Keep a golden set of three or four prompts and re-run them on every model bump.

Organize Before It Becomes a Swamp

Here’s the lesson nobody puts in the getting-started guide, and it’s the one I’d undo the most damage by knowing earlier: decide where skills live before you have a pile of them. If you don’t, you end up with the same skill in three places, slightly different in each, and a context window quietly contaminated by near-duplicates that fight each other. Then you spend an afternoon ripping skills out of scattered folders trying to remember which copy was the good one. Ask me how I know.

After enough of that, I landed on two tools and one rule.

Skillshare for the global stuff. Skillshare keeps a single source of skills in ~/.config/skillshare/skills and syncs them out to every AI CLI I use (Claude Code, Codex, Cursor, and the rest) so one skill follows me into every repo without me copying anything. Everything I put up there is genuinely global: tools I want available no matter what I’m working on, regardless of the project’s language or stack. There’s no per-project flavor to them, which is exactly why they can be global without causing trouble.

APM for the project-level stuff. Microsoft’s APM, the Agent Package Manager, treats agent context like dependencies in a manifest: skills, instructions, hooks, MCP servers, the whole pile, declared once and installed per repo. It’s package.json for your agent setup. I dug into how I use it in my piece on AI-native engineering, which is also where I first ran into the .agents/skills convention. APM installs standard skills there for the tools that speak it (Copilot, Cursor, OpenCode, Codex, Gemini, Claude). It’s not a universal standard with everything behind it, but enough tools follow it now that it’s become the practical neutral ground.

The rule that ties them together: global is for skills with no project opinion; project-level is for skills that do. This sounds obvious until you hit the case that forces it. One project enforces a strict, functional TypeScript style with a particular set of lint rules; another is an older codebase with completely different conventions. If I made a “typescript-conventions” skill global, it would fire in both and be wrong in one of them every time. And worse, it’d sit in the index polluting every session, including the Python ones. Project-level via APM means each repo gets exactly the conventions it wants and nothing it doesn’t. The context window only ever sees the skills that repo actually needs.

That’s the whole game with organization: keep the global shelf small and opinion-free, push everything project-specific down into the repo, and you never end up with two slightly-different skills quietly arguing inside the same context window.

The Skills I Actually Reach For

Since I’ve spent this whole post telling you to build skills, here’s what’s actually on my shelf. The ones I wrote myself (or had an agent write for me, then cut down hard) I’ve packaged some of into a public repo so you can use them. Each one is standalone, so grab just the folder you want:

vault-writer : adds and updates notes in my Obsidian vault following my templates and conventions, so I’m not hand-formatting frontmatter at midnight.
blog-idea-scorer : scans the vault for half-formed post ideas, scores them by how ripe they are: material on hand, draft progress, recency
feature-story-research : mines a project’s finished work and my session logs to assemble the narrative material and outline for a “how I built this” post.
model-buzz-roundup : the one that accidentally became a newsletter. Researches what’s hot in open models and drafts the roundup.
fetch-anything : a wrapper that refuses to give up on a web page. It doesn’t do the fetching itself; it escalates through a stack of underlying fetch tools until one of them actually returns the content, and it hands back clean markdown instead of a soup of HTML. It exists because “the page blocked me” is not an acceptable answer when I know the content is right there.
modular-skill-creator : builds a skill as a lazy-loading router instead of one fat file: a thin SKILL.md that delegates to focused sub-workflows. Basically the folder-is-the-feature idea from earlier, turned into a tool.
verbalized-sampling : my skill version of the Verbalized Sampling technique (paper here). Instead of taking the model’s single safest answer, it asks for several candidates with explicit probabilities, which sidesteps the mode collapse that makes AI brainstorms so depressingly samey. My go-to when I want real options, not the most-likely one.
skill-hardener : the one I’m giving its own section below

I’m leaving a couple of my best ones off this list on purpose. Some things stay in the nest. And a few favorites I didn’t write but reach for constantly:

skill-creator: Anthropic’s own. The structured way to build and benchmark a skill, baseline runs and all.
frontend-design: when I want a UI that doesn’t look like every other AI-generated dashboard.
playwright-cli: drives a real browser for verification skills and end-to-end checks and I’ve had more luck with it than a Playwright MCP.
obsidian-cli and obsidian-markdown: the difference between an agent that thinks it knows Obsidian-flavored markdown and one that actually does.
ce-gemini-imagegen: generates and edits images through the Gemini image API. I cherry-picked this one out of the compound-engineering plugin; it does the text-to-image and image-editing work I’d otherwise leave a tab open for.
context-retrospective: analyzes an agent session after the fact to spot where my context and guidance need work. I honestly forget where I picked it up, which tells you how casually these things pile up.
adhd: spins up parallel idea branches under different cognitive frames (the biologist, the speedrunner, the ten-year-old, the zero-budget version), scores them, and prunes the dead ends. Not mine, but it’s the creativity hack I reach for when I’m stuck on something.

Lessons I Had to Learn the Hard Way

The how-it-works stuff above you can find in any decent guide. These next two I had to earn, and they’re the ones worth a sticky note on your monitor.

Don’t trust a skill’s first trial. Build a way to catch the ones that rot

A skill that passed its evals on Tuesday is not a skill that’s still good in a month. Models change under you, your other skills shift the context around it, and a description that routed perfectly starts mis-firing once you’ve added ten neighbors. The first trial is the beginning of trust, not the end of it.

I’d been chewing on this problem for a while (“how do I notice when a skill quietly stops pulling its weight?”), and then a /insights run suggested almost the exact same thing back to me, so I went ahead and built skill-hardener. It mines my recent session transcripts for recurring failure patterns, traces each one back to the skill responsible, and hardens that skill with a targeted fix plus a regression test so the same failure can’t sneak back in. Evals are how you prove a skill works before you ship it; skill-hardener is the regression suite that proves it still works after the world moved. The two aren’t the same job, and you want both.

It’s the kind of tool you don’t know you’ve been missing until your skills start silently degrading and you have no system for catching it.

I’ll be honest, though: I don’t think a manual-trigger regression skill is the final shape of this for me. I’ve got a bigger thing brewing: a local, always-on context layer that watches where my skills produce output I end up fixing by hand, and improves them straight from that telemetry instead of waiting for me to run a check. If that works the way I think it will, it makes skill-hardener mostly redundant. Which is fine. The best tools earn their own replacements. (More on that one another day, once it exists outside my notes.)

Revisit your descriptions. Treat global ones completely differently

The description is the routing trigger, and it’s also the thing that drifts most as your shelf grows. Re-read them periodically. But here’s the split I didn’t expect to land on: I tune project-level and global descriptions in opposite directions.

Project-level descriptions I push toward maximum (rich trigger language, lots of “load when” phrasing) because in a single repo I want the relevant skills firing automatically the moment they’re relevant. The blast radius is one project, so aggressive auto-loading is a feature.

Global descriptions I shrink to the bone. I’ve got somewhere around thirty global skills, and very few of them are ones I want auto-firing on a stray keyword across every project I touch. A global skill with a greedy description is a skill that barges into unrelated work everywhere. So most of mine are deliberately quiet. I know they exist, I keep the list short enough to remember, and when I want one I just ask for it by name. A small, well-known global shelf you invoke on purpose beats a big one that keeps interrupting. The minimal description is me choosing manual invocation for the things that shouldn’t have an opinion until I say so.

The Honest Version

Good skills start bad. The first draft over-explains, the description reads like documentation instead of a trigger, and the gotcha that would’ve saved the first three failures isn’t in there yet. That’s not a sign you did it wrong. That’s the normal starting state.

What separates the skills that become permanent fixtures from the ones that get abandoned is whether you commit to the loop: ship thin, watch it fail, add a gotcha, repeat. The useful skills in my setup weren’t built in one exhaustive Saturday. They were shipped on a Tuesday, used Wednesday, patched Thursday, and they’re still earning their keep months later.

So pick the knowledge gap that’s annoying you most today: the thing you re-explain to Claude Code every single morning. Write the minimal skill. Drop it in .claude/skills/. Fix it the next time it fails.

Nothing fancy. Still the work that actually moves things forward. And the nice part is that once you’ve written it for Claude Code, it mostly just works everywhere else too. Which means the ten minutes you stop wasting every morning, you stop wasting in every tool at once.

DEV Community