The Problem with Writing Skills by Hand
You've written a skill for your AI coding agent. It's got clear instructions, proper formatting, a good description. You test it in a session — it works. Ship it, right?
Not so fast.
Skills trigger based on their description field — a 1-2 sentence summary in the SKILL.md frontmatter. And here's the thing: descriptions that seem crystal clear to humans often trigger wrong. Too specific, and the skill never activates when it should. Too broad, and it fires on unrelated prompts.
The result: skills that feel right in theory but fail unpredictably in practice. And there's no systematic way to measure whether a skill is getting better or worse across iterations.
This is the same problem software engineering solved decades ago with automated testing. Skills are software. They need testing too.
What Is Eval-Driven Development?
Eval-driven development is the practice of:
- Writing test cases that define expected behavior
- Running those tests automatically to measure actual vs. expected outcomes
- Using the results to improve iteratively, with quantifiable evidence
For AI agent skills, this means:
- Generating test prompts (should-trigger and should-not-trigger queries)
- Running each prompt with and without the skill
- Comparing outputs to see if the skill actually improves results
- Optimizing the description so the skill triggers on the right prompts
The Skill Creation Lifecycle
opencode-skill-creator implements eval-driven development as a structured lifecycle:
Create → Evaluate → Optimize → Benchmark → Install
↑ |
└──────────── Iterate ─────────────────┘
1. Create
Start with an intake interview. The skill-creator asks 3-5 targeted questions:
- What should this skill enable the agent to do?
- When should it trigger?
- What output format is expected?
- What workflow steps must be preserved exactly?
This captures intent before writing any code.
2. Evaluate
Auto-generate eval test sets — realistic prompts categorized as should-trigger or should-not-trigger. Run each test case twice:
- With skill: The agent has the skill loaded
- Without skill: The agent runs without it (baseline)
This measures whether the skill actually improves the output for relevant prompts.
3. Optimize
The description optimization loop treats triggering accuracy as a search problem:
For each iteration (up to 5):
1. Evaluate current description on train set (60%)
2. Analyze failure patterns
3. LLM proposes improved description
4. Evaluate on both train AND test (40%) sets
5. Select best description by test score
The 60/40 train/test split prevents overfitting. An description that works perfectly on train queries but fails on held-out test queries is overfit — it's memorized specific prompts rather than learning the general pattern.
4. Benchmark
Run the full eval suite across multiple iterations with variance analysis. This answers:
- Is the skill getting consistently better?
- Are there eval cases where the skill never triggers correctly?
- How much variance is there across runs?
The benchmark includes:
- Pass rates (with-skill vs. baseline)
- Timing data (tokens, duration)
- Mean ± standard deviation for each metric
5. Install
Install the final validated skill to project-level (.opencode/skills/) or global (~/.config/opencode/skills/). Only the final version gets installed — eval artifacts stay in the staging directory.
Why This Works
Skills are software
They have inputs (prompts), outputs (agent behavior), and a triggering mechanism (the description). Just like any software, they need testing.
Manual testing doesn't scale
You can test a skill manually in a session, but that's one prompt, one run, no measurement. Eval-driven development gives you 20+ test cases, multiple runs per case, and quantitative metrics.
Description optimization is more impactful than skill content
The description field is the primary triggering mechanism. A perfectly-written skill with a poor description won't trigger. An average skill with an optimized description will trigger reliably. The optimization loop focuses effort where it matters most.
Train/test splits prevent overfitting
If you only test on the same queries you optimize for, descriptions become overfit — they work on those specific prompts but fail on real-world usage. The 60/40 split keeps you honest.
Human review catches what automation misses
The visual eval viewer puts outputs side by side so you can see with your own eyes whether the skill is producing good results. Quantitative metrics tell you if it's triggering correctly; human review tells you if the output is actually useful.
Getting Started
npx opencode-skill-creator install --global
Then ask OpenCode to create or improve a skill. The eval-driven workflow starts automatically.
Apache 2.0, free, open source. Works with any of OpenCode's supported models.
GitHub: https://github.com/antongulin/opencode-skill-creator
opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global
Top comments (0)