Anthropic's skill-creator for Claude Code is excellent. It introduced eval-driven development for AI agent skills — write a skill, test it with evals, optimize the description, benchmark the results. The methodology is proven.
But it has a limitation: it only works with Claude Code, and skill access requires a paid subscription ($20/month minimum). Free tier users can't use it at all.
OpenCode is free and supports 300+ models. I wanted to bring the same methodology to OpenCode users — for free, with no paywall.
High-Level Architecture
The original has this structure:
Anthropic skill-creator/
├── SKILL.md # The skill instructions
├── scripts/
│ ├── run_loop.py # Eval→improve optimization loop
│ ├── improve_description.py # LLM-powered description improvement
│ ├── aggregate_benchmark.py # Benchmark aggregation
│ └── generate_review.py # HTML report generation
└── evals/
└── evals.json # Test query definitions
My version:
opencode-skill-creator/
├── skill-creator/ # The SKILL
│ ├── SKILL.md # Main skill instructions
│ ├── agents/
│ │ ├── grader.md # Assertion evaluation
│ │ ├── analyzer.md # Benchmark analysis
│ │ └── comparator.md # Blind A/B comparison
│ ├── references/
│ │ └── schemas.md # JSON schema definitions
│ └── templates/
│ └── eval-review.html # Eval set review/edit UI
└── plugin/ # The PLUGIN (npm package)
├── package.json # npm package metadata
├── skill-creator.ts # Entry point
└── lib/
├── utils.ts # SKILL.md frontmatter parsing
├── validate.ts # Skill structure validation
├── run-eval.ts # Trigger evaluation
├── improve-description.ts # Description optimization
├── run-loop.ts # Eval→improve loop
├── aggregate.ts # Benchmark aggregation
├── report.ts # HTML report generation
└── review-server.ts # HTTP eval review server
Key difference: the skill provides workflow knowledge, the plugin provides executable tools. The agent orchestrates everything by calling tools during its session.
Decision 1: Scripts → Plugin Tool Calls
Original: Python scripts invoked via CLI
# Run the optimization loop
python -m scripts.run_loop --skill-path /path/to/skill --eval-set evals.json
New: Plugin tool calls in OpenCode sessions
skill_optimize_loop with:
evalSetPath: /path/to/evals.json
skillPath: /path/to/skill
maxIterations: 5
Why: OpenCode's plugin architecture lets agents call custom tools directly. No subprocess management, no script execution, no Python environment. The agent calls the tool inline and gets results back in the session.
This is cleaner integration but also more composable. The agent can interleave tool calls with other work — read files, ask the user questions, make decisions — between optimization iterations.
Decision 2: Python → TypeScript
The original requires Python 3.11+ and pyyaml. My version requires nothing beyond Node.js (which OpenCode users already have).
All pipeline components — validation, eval, description improvement, loop runner, aggregation, report generation, review server — are TypeScript modules in the plugin. ~256kB unpacked on npm.
Dependency tree is minimal: the plugin only depends on @opencode-ai/plugin (peer dependency).
Decision 3: Static HTML → HTTP Review Server
Original: Python script generates a static HTML file and opens it in the browser.
generate_review.py --workspace /path/to/workspace
# Opens /path/to/workspace/review.html in browser
New: Plugin starts a local HTTP server that serves an interactive eval viewer.
skill_serve_review with:
workspace: /path/to/workspace
skillName: "my-skill"
The HTTP server approach has advantages:
- Real-time updates when new eval results come in
- Interactive review with save buttons that write feedback back to files
- Previous/next navigation between eval cases
- Benchmark tab with quantitative metrics
- No file management — just open localhost:PORT
The server can also generate static HTML for headless environments:
skill_export_static_review with:
workspace: /path/to/workspace
outputPath: /path/to/report.html
Decision 4: Subagents → Task Tool
Original: Claude Code's built-in subagent concept, where the skill directly spawns sub-agents.
New: OpenCode's Task tool with general and explore subagent types. The SKILL.md instructs the agent to spawn tasks for:
- Running eval cases (with-skill and baseline)
- Grading assertions against outputs
- Analyzing benchmark results
- Blind A/B comparison between skill versions
The agent orchestrates these tasks and synthesizes their results.
Decision 5: Staging Outside the Repo
Original: Evals and benchmarks run alongside the skill in the same directory.
New: Draft skills and eval artifacts go to the system temp directory:
/tmp/opencode-skills/<skill-name>/ # Staged skill
/tmp/opencode-skills/<skill-name>-workspace/ # Eval artifacts
Only the final validated skill gets installed to:
- Project:
.opencode/skills/<skill-name>/ - Global:
~/.config/opencode/skills/<skill-name>/
This keeps the user's repository clean during skill development. Evals create a lot of artifacts (outputs, timing data, grading results, benchmark files) that you don't want mixed into your project.
Decision 6: Strict Review Workflow
Added a "review workflow guard" that enforces paired comparison data by default:
-
skill_serve_reviewandskill_export_static_reviewrequire each eval directory to include bothwith_skillAND baseline (without_skillorold_skill) - If pairs are missing, the tools fail fast with a clear list of what's missing
- Override with
allowPartial: trueonly when intentionally reviewing incomplete data
This prevents a common mistake: reviewing eval results without a baseline comparison, which makes it impossible to judge whether the skill actually improved anything.
What I Learned
1. Skills are software
They need testing, not just writing. The eval-driven approach catches issues you'd never find manually — like a description that triggers on 80% of relevant queries but also fires on 30% of irrelevant ones.
2. Description optimization matters more than skill content
The description field is the primary triggering mechanism. A well-optimized description on an average skill outperforms a poor description on a perfect skill. This is counterintuitive but matches the data.
3. Train/test splits prevent overfitting
Same lesson as ML hyperparameter tuning. If you only evaluate on the queries you optimize for, descriptions become overfit. The 60/40 split keeps you honest about generalization.
4. Human-in-the-loop review is essential
Automation measures triggering accuracy, but humans judge output quality. The visual eval viewer puts outputs side by side so you can see whether the skill produces genuinely useful results, not just correctly-triggered results.
5. Plugin architecture enables composition
Having eval, benchmarking, and review as separate tool calls (instead of a monolithic script) means the agent can interleave them with other work. It can ask the user a question between iterations, read relevant files during eval, or skip steps the user doesn't need.
Try It
npx opencode-skill-creator install --global
Apache 2.0, free, open source. Works with any of OpenCode's supported models.
GitHub: https://github.com/antongulin/opencode-skill-creator
npm: https://www.npmjs.com/package/opencode-skill-creator
opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global
Top comments (1)
Hey — I came across your TypeScript port of the skill-creator for OpenCode, and I really like the direction you took with eval-driven development and the plugin/tool-call architecture.
The separation between SKILL (workflow knowledge) and PLUGIN (execution layer) is especially clean — and the HTTP review server is a big usability upgrade over static reports.
I’ve been working on AI agents and workflow automation (tool orchestration, eval pipelines, voice/LLM systems), so this is very much in my lane.
A couple ideas I thought could push this further:
If you’re open to it, I’d love to contribute — either by building one of these pieces or helping refine the eval pipeline.
Let me know what direction you’re heading next.🤠