Anton Gulin

Posted on Apr 17 • Edited on Jun 15

Porting Anthropic's Skill Creator from Python to TypeScript

#opencode #typescript #ai #opensource

Anthropic's skill-creator for Claude Code is excellent. It introduced eval-driven development for AI agent skills — write a skill, test it with evals, optimize the description, benchmark the results. The methodology is proven.

But it has a limitation: it only works with Claude Code, and skill access requires a paid subscription ($20/month minimum). Free tier users can't use it at all.

OpenCode is free and supports 300+ models. I wanted to bring the same methodology to OpenCode users — for free, with no paywall.

High-Level Architecture

The original has this structure:

Anthropic skill-creator/
├── SKILL.md                    # The skill instructions
├── scripts/
│   ├── run_loop.py             # Eval→improve optimization loop
│   ├── improve_description.py  # LLM-powered description improvement
│   ├── aggregate_benchmark.py   # Benchmark aggregation
│   └── generate_review.py       # HTML report generation
└── evals/
    └── evals.json              # Test query definitions

My version:

opencode-skill-creator/
├── skill-creator/              # The SKILL
│   ├── SKILL.md                # Main skill instructions
│   ├── agents/
│   │   ├── grader.md           # Assertion evaluation
│   │   ├── analyzer.md         # Benchmark analysis
│   │   └── comparator.md       # Blind A/B comparison
│   ├── references/
│   │   └── schemas.md          # JSON schema definitions
│   └── templates/
│       └── eval-review.html    # Eval set review/edit UI
└── plugin/                     # The PLUGIN (npm package)
    ├── package.json            # npm package metadata
    ├── skill-creator.ts         # Entry point
    └── lib/
        ├── utils.ts            # SKILL.md frontmatter parsing
        ├── validate.ts          # Skill structure validation
        ├── run-eval.ts          # Trigger evaluation
        ├── improve-description.ts  # Description optimization
        ├── run-loop.ts          # Eval→improve loop
        ├── aggregate.ts         # Benchmark aggregation
        ├── report.ts            # HTML report generation
        └── review-server.ts     # HTTP eval review server

Key difference: the skill provides workflow knowledge, the plugin provides executable tools. The agent orchestrates everything by calling tools during its session.

Decision 1: Scripts → Plugin Tool Calls

Original: Python scripts invoked via CLI

# Run the optimization loop
python -m scripts.run_loop --skill-path /path/to/skill --eval-set evals.json

New: Plugin tool calls in OpenCode sessions

skill_optimize_loop with:
  evalSetPath: /path/to/evals.json
  skillPath: /path/to/skill
  maxIterations: 5

Why: OpenCode's plugin architecture lets agents call custom tools directly. No subprocess management, no script execution, no Python environment. The agent calls the tool inline and gets results back in the session.

This is cleaner integration but also more composable. The agent can interleave tool calls with other work — read files, ask the user questions, make decisions — between optimization iterations.

Decision 2: Python → TypeScript

The original requires Python 3.11+ and pyyaml. My version requires nothing beyond Node.js (which OpenCode users already have).

All pipeline components — validation, eval, description improvement, loop runner, aggregation, report generation, review server — are TypeScript modules in the plugin. ~256kB unpacked on npm.

Dependency tree is minimal: the plugin only depends on @opencode-ai/plugin (peer dependency).

Decision 3: Static HTML → HTTP Review Server

Original: Python script generates a static HTML file and opens it in the browser.

generate_review.py --workspace /path/to/workspace
# Opens /path/to/workspace/review.html in browser

New: Plugin starts a local HTTP server that serves an interactive eval viewer.

skill_serve_review with:
  workspace: /path/to/workspace
  skillName: "my-skill"

The HTTP server approach has advantages:

Real-time updates when new eval results come in
Interactive review with save buttons that write feedback back to files
Previous/next navigation between eval cases
Benchmark tab with quantitative metrics
No file management — just open localhost:PORT

The server can also generate static HTML for headless environments:

skill_export_static_review with:
  workspace: /path/to/workspace
  outputPath: /path/to/report.html

Decision 4: Subagents → Task Tool

Original: Claude Code's built-in subagent concept, where the skill directly spawns sub-agents.

New: OpenCode's Task tool with general and explore subagent types. The SKILL.md instructs the agent to spawn tasks for:

Running eval cases (with-skill and baseline)
Grading assertions against outputs
Analyzing benchmark results
Blind A/B comparison between skill versions

The agent orchestrates these tasks and synthesizes their results.

Decision 5: Staging Outside the Repo

Original: Evals and benchmarks run alongside the skill in the same directory.

New: Draft skills and eval artifacts go to the system temp directory:

/tmp/opencode-skills/<skill-name>/           # Staged skill
/tmp/opencode-skills/<skill-name>-workspace/  # Eval artifacts

Only the final validated skill gets installed to:

Project: .opencode/skills/<skill-name>/
Global: ~/.config/opencode/skills/<skill-name>/

This keeps the user's repository clean during skill development. Evals create a lot of artifacts (outputs, timing data, grading results, benchmark files) that you don't want mixed into your project.

Decision 6: Strict Review Workflow

Added a "review workflow guard" that enforces paired comparison data by default:

skill_serve_review and skill_export_static_review require each eval directory to include both with_skill AND baseline (without_skill or old_skill)
If pairs are missing, the tools fail fast with a clear list of what's missing
Override with allowPartial: true only when intentionally reviewing incomplete data

This prevents a common mistake: reviewing eval results without a baseline comparison, which makes it impossible to judge whether the skill actually improved anything.

What I Learned

1. Skills are software

They need testing, not just writing. The eval-driven approach catches issues you'd never find manually — like a description that triggers on 80% of relevant queries but also fires on 30% of irrelevant ones.

2. Description optimization matters more than skill content

The description field is the primary triggering mechanism. A well-optimized description on an average skill outperforms a poor description on a perfect skill. This is counterintuitive but matches the data.

3. Train/test splits prevent overfitting

Same lesson as ML hyperparameter tuning. If you only evaluate on the queries you optimize for, descriptions become overfit. The 60/40 split keeps you honest about generalization.

4. Human-in-the-loop review is essential

Automation measures triggering accuracy, but humans judge output quality. The visual eval viewer puts outputs side by side so you can see whether the skill produces genuinely useful results, not just correctly-triggered results.

5. Plugin architecture enables composition

Having eval, benchmarking, and review as separate tool calls (instead of a monolithic script) means the agent can interleave them with other work. It can ask the user a question between iterations, read relevant files during eval, or skip steps the user doesn't need.

Try It

npx opencode-skill-creator install --global

Apache 2.0, free, open source. Works with any of OpenCode's supported models.

GitHub: https://github.com/antongulin/opencode-skill-creator
npm: https://www.npmjs.com/package/opencode-skill-creator

opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global

Top comments (1)

Yuki Ideura • Apr 17

Hey — I came across your TypeScript port of the skill-creator for OpenCode, and I really like the direction you took with eval-driven development and the plugin/tool-call architecture.

The separation between SKILL (workflow knowledge) and PLUGIN (execution layer) is especially clean — and the HTTP review server is a big usability upgrade over static reports.

I’ve been working on AI agents and workflow automation (tool orchestration, eval pipelines, voice/LLM systems), so this is very much in my lane.

A couple ideas I thought could push this further:

adding distributed / parallel eval execution for larger benchmark sets
improving eval dataset management (versioning + train/test splits tooling)
integrating external model providers or routing strategies for comparison runs
extending the review server with collaborative feedback or annotation layers

If you’re open to it, I’d love to contribute — either by building one of these pieces or helping refine the eval pipeline.

Let me know what direction you’re heading next.🤠