DEV Community

Cover image for Porting Anthropic's Skill Creator from Python to TypeScript
Anton Gulin
Anton Gulin

Posted on

Porting Anthropic's Skill Creator from Python to TypeScript

Anthropic's skill-creator for Claude Code is excellent. It introduced eval-driven development for AI agent skills — write a skill, test it with evals, optimize the description, benchmark the results. The methodology is proven.

But it has a limitation: it only works with Claude Code, and skill access requires a paid subscription ($20/month minimum). Free tier users can't use it at all.

OpenCode is free and supports 300+ models. I wanted to bring the same methodology to OpenCode users — for free, with no paywall.

High-Level Architecture

The original has this structure:

Anthropic skill-creator/
├── SKILL.md                    # The skill instructions
├── scripts/
│   ├── run_loop.py             # Eval→improve optimization loop
│   ├── improve_description.py  # LLM-powered description improvement
│   ├── aggregate_benchmark.py   # Benchmark aggregation
│   └── generate_review.py       # HTML report generation
└── evals/
    └── evals.json              # Test query definitions
Enter fullscreen mode Exit fullscreen mode

My version:

opencode-skill-creator/
├── skill-creator/              # The SKILL
│   ├── SKILL.md                # Main skill instructions
│   ├── agents/
│   │   ├── grader.md           # Assertion evaluation
│   │   ├── analyzer.md         # Benchmark analysis
│   │   └── comparator.md       # Blind A/B comparison
│   ├── references/
│   │   └── schemas.md          # JSON schema definitions
│   └── templates/
│       └── eval-review.html    # Eval set review/edit UI
└── plugin/                     # The PLUGIN (npm package)
    ├── package.json            # npm package metadata
    ├── skill-creator.ts         # Entry point
    └── lib/
        ├── utils.ts            # SKILL.md frontmatter parsing
        ├── validate.ts          # Skill structure validation
        ├── run-eval.ts          # Trigger evaluation
        ├── improve-description.ts  # Description optimization
        ├── run-loop.ts          # Eval→improve loop
        ├── aggregate.ts         # Benchmark aggregation
        ├── report.ts            # HTML report generation
        └── review-server.ts     # HTTP eval review server
Enter fullscreen mode Exit fullscreen mode

Key difference: the skill provides workflow knowledge, the plugin provides executable tools. The agent orchestrates everything by calling tools during its session.

Decision 1: Scripts → Plugin Tool Calls

Original: Python scripts invoked via CLI

# Run the optimization loop
python -m scripts.run_loop --skill-path /path/to/skill --eval-set evals.json
Enter fullscreen mode Exit fullscreen mode

New: Plugin tool calls in OpenCode sessions

skill_optimize_loop with:
  evalSetPath: /path/to/evals.json
  skillPath: /path/to/skill
  maxIterations: 5
Enter fullscreen mode Exit fullscreen mode

Why: OpenCode's plugin architecture lets agents call custom tools directly. No subprocess management, no script execution, no Python environment. The agent calls the tool inline and gets results back in the session.

This is cleaner integration but also more composable. The agent can interleave tool calls with other work — read files, ask the user questions, make decisions — between optimization iterations.

Decision 2: Python → TypeScript

The original requires Python 3.11+ and pyyaml. My version requires nothing beyond Node.js (which OpenCode users already have).

All pipeline components — validation, eval, description improvement, loop runner, aggregation, report generation, review server — are TypeScript modules in the plugin. ~256kB unpacked on npm.

Dependency tree is minimal: the plugin only depends on @opencode-ai/plugin (peer dependency).

Decision 3: Static HTML → HTTP Review Server

Original: Python script generates a static HTML file and opens it in the browser.

generate_review.py --workspace /path/to/workspace
# Opens /path/to/workspace/review.html in browser
Enter fullscreen mode Exit fullscreen mode

New: Plugin starts a local HTTP server that serves an interactive eval viewer.

skill_serve_review with:
  workspace: /path/to/workspace
  skillName: "my-skill"
Enter fullscreen mode Exit fullscreen mode

The HTTP server approach has advantages:

  • Real-time updates when new eval results come in
  • Interactive review with save buttons that write feedback back to files
  • Previous/next navigation between eval cases
  • Benchmark tab with quantitative metrics
  • No file management — just open localhost:PORT

The server can also generate static HTML for headless environments:

skill_export_static_review with:
  workspace: /path/to/workspace
  outputPath: /path/to/report.html
Enter fullscreen mode Exit fullscreen mode

Decision 4: Subagents → Task Tool

Original: Claude Code's built-in subagent concept, where the skill directly spawns sub-agents.

New: OpenCode's Task tool with general and explore subagent types. The SKILL.md instructs the agent to spawn tasks for:

  • Running eval cases (with-skill and baseline)
  • Grading assertions against outputs
  • Analyzing benchmark results
  • Blind A/B comparison between skill versions

The agent orchestrates these tasks and synthesizes their results.

Decision 5: Staging Outside the Repo

Original: Evals and benchmarks run alongside the skill in the same directory.

New: Draft skills and eval artifacts go to the system temp directory:

/tmp/opencode-skills/<skill-name>/           # Staged skill
/tmp/opencode-skills/<skill-name>-workspace/  # Eval artifacts
Enter fullscreen mode Exit fullscreen mode

Only the final validated skill gets installed to:

  • Project: .opencode/skills/<skill-name>/
  • Global: ~/.config/opencode/skills/<skill-name>/

This keeps the user's repository clean during skill development. Evals create a lot of artifacts (outputs, timing data, grading results, benchmark files) that you don't want mixed into your project.

Decision 6: Strict Review Workflow

Added a "review workflow guard" that enforces paired comparison data by default:

  • skill_serve_review and skill_export_static_review require each eval directory to include both with_skill AND baseline (without_skill or old_skill)
  • If pairs are missing, the tools fail fast with a clear list of what's missing
  • Override with allowPartial: true only when intentionally reviewing incomplete data

This prevents a common mistake: reviewing eval results without a baseline comparison, which makes it impossible to judge whether the skill actually improved anything.

What I Learned

1. Skills are software

They need testing, not just writing. The eval-driven approach catches issues you'd never find manually — like a description that triggers on 80% of relevant queries but also fires on 30% of irrelevant ones.

2. Description optimization matters more than skill content

The description field is the primary triggering mechanism. A well-optimized description on an average skill outperforms a poor description on a perfect skill. This is counterintuitive but matches the data.

3. Train/test splits prevent overfitting

Same lesson as ML hyperparameter tuning. If you only evaluate on the queries you optimize for, descriptions become overfit. The 60/40 split keeps you honest about generalization.

4. Human-in-the-loop review is essential

Automation measures triggering accuracy, but humans judge output quality. The visual eval viewer puts outputs side by side so you can see whether the skill produces genuinely useful results, not just correctly-triggered results.

5. Plugin architecture enables composition

Having eval, benchmarking, and review as separate tool calls (instead of a monolithic script) means the agent can interleave them with other work. It can ask the user a question between iterations, read relevant files during eval, or skip steps the user doesn't need.

Try It

npx opencode-skill-creator install --global
Enter fullscreen mode Exit fullscreen mode

Apache 2.0, free, open source. Works with any of OpenCode's supported models.

GitHub: https://github.com/antongulin/opencode-skill-creator
npm: https://www.npmjs.com/package/opencode-skill-creator


opencode-skill-creator is free and open source (Apache 2.0). Star it on GitHub. Install: npx opencode-skill-creator install --global

Top comments (1)

Collapse
 
yuki100005 profile image
Yuki Ideura

Hey — I came across your TypeScript port of the skill-creator for OpenCode, and I really like the direction you took with eval-driven development and the plugin/tool-call architecture.

The separation between SKILL (workflow knowledge) and PLUGIN (execution layer) is especially clean — and the HTTP review server is a big usability upgrade over static reports.

I’ve been working on AI agents and workflow automation (tool orchestration, eval pipelines, voice/LLM systems), so this is very much in my lane.

A couple ideas I thought could push this further:

  • adding distributed / parallel eval execution for larger benchmark sets
  • improving eval dataset management (versioning + train/test splits tooling)
  • integrating external model providers or routing strategies for comparison runs
  • extending the review server with collaborative feedback or annotation layers

If you’re open to it, I’d love to contribute — either by building one of these pieces or helping refine the eval pipeline.

Let me know what direction you’re heading next.🤠