Anton Gulin

Posted on Apr 15 • Edited on Jun 15

Eval-Driven Development for AI Agent Skills

#opencode #typescript #ai #opensource

The Problem with Writing Skills by Hand

You've written a skill for your AI coding agent. It's got clear instructions, proper formatting, a good description. You test it in a session — it works. Ship it, right?

Not so fast.

Skills trigger based on their description field — a 1-2 sentence summary in the SKILL.md frontmatter. And here's the thing: descriptions that seem crystal clear to humans often trigger wrong. Too specific, and the skill never activates when it should. Too broad, and it fires on unrelated prompts.

The result: skills that feel right in theory but fail unpredictably in practice. And there's no systematic way to measure whether a skill is getting better or worse across iterations.

This is the same problem software engineering solved decades ago with automated testing. Skills are software. They need testing too.

What Is Eval-Driven Development?

Eval-driven development is the practice of:

Writing test cases that define expected behavior
Running those tests automatically to measure actual vs. expected outcomes
Using the results to improve iteratively, with quantifiable evidence

For AI agent skills, this means:

Generating test prompts (should-trigger and should-not-trigger queries)
Running each prompt with and without the skill
Comparing outputs to see if the skill actually improves results
Optimizing the description so the skill triggers on the right prompts

The Skill Creation Lifecycle

opencode-skill-creator implements eval-driven development as a structured lifecycle:

Create → Evaluate → Optimize → Benchmark → Install
   ↑                                      |
   └──────────── Iterate ─────────────────┘

1. Create

Start with an intake interview. The skill-creator asks 3-5 targeted questions:

What should this skill enable the agent to do?
When should it trigger?
What output format is expected?
What workflow steps must be preserved exactly?

This captures intent before writing any code.

2. Evaluate

Auto-generate eval test sets — realistic prompts categorized as should-trigger or should-not-trigger. Run each test case twice:

With skill: The agent has the skill loaded
Without skill: The agent runs without it (baseline)

This measures whether the skill actually improves the output for relevant prompts.

3. Optimize

The description optimization loop treats triggering accuracy as a search problem:

For each iteration (up to 5):
  1. Evaluate current description on train set (60%)
  2. Analyze failure patterns
  3. LLM proposes improved description
  4. Evaluate on both train AND test (40%) sets
  5. Select best description by test score

The 60/40 train/test split prevents overfitting. An description that works perfectly on train queries but fails on held-out test queries is overfit — it's memorized specific prompts rather than learning the general pattern.

4. Benchmark

Run the full eval suite across multiple iterations with variance analysis. This answers:

Is the skill getting consistently better?
Are there eval cases where the skill never triggers correctly?
How much variance is there across runs?

The benchmark includes:

Pass rates (with-skill vs. baseline)
Timing data (tokens, duration)
Mean ± standard deviation for each metric

5. Install

Install the final validated skill to project-level (.opencode/skills/) or global (~/.config/opencode/skills/). Only the final version gets installed — eval artifacts stay in the staging directory.

Why This Works

Skills are software

They have inputs (prompts), outputs (agent behavior), and a triggering mechanism (the description). Just like any software, they need testing.

Manual testing doesn't scale

You can test a skill manually in a session, but that's one prompt, one run, no measurement. Eval-driven development gives you 20+ test cases, multiple runs per case, and quantitative metrics.

Description optimization is more impactful than skill content

The description field is the primary triggering mechanism. A perfectly-written skill with a poor description won't trigger. An average skill with an optimized description will trigger reliably. The optimization loop focuses effort where it matters most.

Train/test splits prevent overfitting

If you only test on the same queries you optimize for, descriptions become overfit — they work on those specific prompts but fail on real-world usage. The 60/40 split keeps you honest.

Human review catches what automation misses

The visual eval viewer puts outputs side by side so you can see with your own eyes whether the skill is producing good results. Quantitative metrics tell you if it's triggering correctly; human review tells you if the output is actually useful.

Getting Started

npx opencode-skill-creator install --global

Then ask OpenCode to create or improve a skill. The eval-driven workflow starts automatically.

Apache 2.0, free, open source. Works with any of OpenCode's supported models.

GitHub: https://github.com/antongulin/opencode-skill-creator

opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global

DEV Community