Wanda

Posted on Mar 23 • Originally published at apidog.com

How to Create Claude Code Skills Automatically with Skill Creator

#automation #ai #tutorial #tooling

TL;DR

Claude Code Skills are custom extensions that automate and optimize specific developer workflows in Claude. Use the Skill Creator system to define your skill’s purpose, draft the SKILL.md, create test cases, run benchmarks, and iterate until the skill triggers reliably and performs well.

Try Apidog today

Introduction

If you use Claude Code daily, you likely repeat certain sequences: initializing projects, running tests, formatting outputs, and so on. Instead of explaining your workflow every time, Claude Code Skills let you encode these steps once and reuse them indefinitely. The Skill Creator system provides an automated, structured pathway for building, evaluating, and refining these custom skills for your workflow.

This guide covers the end-to-end process: skill anatomy, creation workflow, evaluation, optimization, and practical examples from Anthropic’s official skills repository.

💡 Tip: Building API-related skills? Apidog integrates seamlessly, letting you test endpoints, validate responses, and generate docs in a unified skill workflow.

What Are Claude Code Skills?

Claude Code Skills are markdown-based instruction sets that extend Claude’s built-in capabilities. Treat them like custom plugins for repeatable developer tasks.

The Skill System Architecture

Skills use a three-level loading system:

Metadata (~100 words): Name and description, always in context
SKILL.md body (<500 lines): Core instructions, loaded when skill triggers
Bundled resources (unlimited): Scripts, references, assets loaded on demand

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/
    ├── references/
    └── assets/

When Skills Trigger

Skills appear in Claude’s available_skills list. Claude consults a skill if its description matches a task it can't handle directly. Only complex, multi-step workflows reliably trigger skills.

Real-World Examples

Skill	Purpose	Key Features
skill-creator	Create new skills	Test generation, benchmark evaluation, description tuning
mcp-builder	Build MCP servers	Python/Node templates, evaluation framework
docx	Generate Word documents	python-docx scripts, templates, styling guide
pdf	Extract/manipulate PDFs	Form handling, extraction, reference docs
frontend-design	Build web interfaces	Component library, Tailwind, accessibility checks

The Skill Creation Workflow

Follow this systematic loop:

Capture intent: Define the skill's purpose
Write a draft: Create the SKILL.md file
Create test cases: Define realistic prompts
Run evaluations: Execute with and without the skill
Review results: Analyze feedback and metrics
Iterate: Refine based on findings
Optimize description: Improve trigger accuracy
Package: Distribute as a .skill file

Step 1: Capture Intent

Clarify what the skill should do. Extract patterns from your workflow history.

Key questions:

What outcome should the skill achieve?
When should it trigger (user phrases/contexts)?
What output formats are expected?
Are test cases needed? (Yes for verifiable outputs.)

Example: API Testing Skill

Intent: Help developers test REST APIs systematically
Trigger: User mentions API testing, endpoints, REST, GraphQL, validation
Output: Test reports with pass/fail, curl commands, response comparisons
Test cases: Yes

Step 2: Write the SKILL.md File

Every skill requires a SKILL.md with YAML frontmatter and markdown instructions.

Example Anatomy:

---
name: api-tester
description: How to test REST APIs systematically. Use when users mention API testing, endpoints, REST, GraphQL, or want to validate API responses. Make sure to suggest this skill whenever testing is involved.
compatibility: Requires curl or HTTP client tools
---

# API Tester Skill

## Core Workflow

1. **Understand the endpoint**
2. **Design test cases**
3. **Execute tests** (curl or Apidog)
4. **Validate responses**
5. **Report results**

Best Practices:

Keep SKILL.md under 500 lines; move details to references/
Explain reasoning, not just steps
Use imperative statements ("Always validate status code first")
Include input/output examples

Step 3: Create Test Cases

Draft 2-3 realistic test prompts and store them in evals/evals.json.

Example Format:

{
  "skill_name": "api-tester",
  "evals": [
    {
      "id": 1,
      "prompt": "Test the /users endpoint on api.example.com - it needs a Bearer token and returns a list of users with id, name, email fields",
      "expected_output": "Test report with at least 5 test cases including auth failure, success, and pagination tests",
      "files": []
    },
    ...
  ]
}

Good test prompts are specific, contextual, and describe expected behavior.

Step 4: Run Evaluations

For each test case, run two parallel subagents:

With skill: Uses your custom skill
Baseline: No skill (or previous version)

Workspace structure:

api-tester-workspace/
├── iteration-1/
│   ├── eval-0-auth-failure/
│   │   ├── with_skill/
│   │   ├── without_skill/
│   │   └── eval_metadata.json
│   ├── benchmark.json
│   └── benchmark.md
...

Capture timing:
Store total_tokens and duration_ms in timing.json for each run.

Step 5: Draft Assertions

While runs complete, define quantitative assertions in eval_metadata.json.

Example:

{
  "assertions": [
    {
      "name": "includes_auth_failure_test",
      "description": "Test report includes at least one authentication failure test case",
      "type": "contains",
      "value": "401"
    },
    ...
  ]
}

Step 6: Grade and Aggregate

After runs finish:

Grade runs: Use a grader agent to check assertions; save to grading.json.
Aggregate: Run the aggregation script for benchmarks.

Example aggregation command:

python -m scripts.aggregate_benchmark api-tester-workspace/iteration-1 --skill-name api-tester

Analyze: Look for non-discriminating assertions, flaky evals, or efficiency issues.

Step 7: Launch the Eval Viewer

Visualize outputs and metrics in a browser.

Generate viewer:

nohup python /path/to/skill-creator/eval-viewer/generate_review.py \
  api-tester-workspace/iteration-1 \
  --skill-name "api-tester" \
  --benchmark api-tester-workspace/iteration-1/benchmark.json \
  > /dev/null 2>&1 &
VIEWER_PID=$!

For later iterations, add --previous-workspace.

Headless environments: Use --static to generate a standalone HTML file.

Step 8: Read Feedback and Iterate

After user review, read feedback.json and focus improvements on areas with actionable comments.

Iteration loop:

Apply improvements
Rerun test cases
Relaunch viewer with previous iteration
Repeat until satisfied

Kill the viewer when finished:

kill $VIEWER_PID 2>/dev/null

Step 9: Optimize the Skill Description

The description in SKILL.md is vital for triggering accuracy.

Generate trigger eval queries: Create at least 20, mixing should-trigger and should-not-trigger cases.

Run optimization:

python -m scripts.run_loop \
  --eval-set /path/to/trigger-eval.json \
  --skill-path /path/to/api-tester \
  --model claude-sonnet-4-6 \
  --max-iterations 5 \
  --verbose

Use the best_description from the output to update SKILL.md.

Step 10: Package and Distribute

Package your skill with:

python -m scripts.package_skill /path/to/api-tester

Distribute the resulting .skill file. Users install by placing it in their skills directory or using Claude’s install command.

Common Skill Creation Mistakes

Mistake 1: Vague Description

# Bad
description: A skill for working with APIs

# Good
description: How to test REST APIs systematically. Use when users mention API testing, endpoints, REST, GraphQL, or want to validate API responses...

Mistake 2: Overly Restrictive Instructions

# Bad
ALWAYS use this exact format. NEVER deviate.

# Good
Use this format because it ensures stakeholders can quickly find the information they need. Adapt if your audience has different needs.

Mistake 3: Skipping Test Cases
Even for subjective skills, run a few qualitative checks.

Mistake 4: Ignoring Timing Data
Optimize for efficiency, not just correctness.

Mistake 5: Not Bundling Repeated Scripts
Bundle helper scripts in scripts/ to avoid duplication.

Real-World Skill Examples

MCP Builder Skill

Purpose: Build MCP servers

Features: Python/Node templates, evaluation framework, best practices

mcp-builder/
├── SKILL.md
├── reference/
│   ├── mcp_best_practices.md
│   ├── python_mcp_server.md
│   └── node_mcp_server.md
└── evaluation/
    └── evaluation.md

Docx Skill

Purpose: Generate Word docs

Features: python-docx scripts, templates, styling guide

Workflow: Gather requirements → select template → generate → validate

Frontend Design Skill

Purpose: Build web interfaces

Features: Component library, Tailwind, accessibility checks

Core workflow in SKILL.md, details in references/

Testing Your Skill with Apidog

If you’re building API-related skills, Apidog integrates directly into the workflow.

Example: API Testing Skill Integration

## Running API Tests

Use Apidog for systematic testing:

1. Import the OpenAPI spec into Apidog
2. Generate test cases from the spec
3. Run tests and export results as JSON
4. Validate responses against expected schemas

For custom assertions, use Apidog's scripting feature.

Bundle Apidog Scripts:

api-tester/
├── SKILL.md
└── scripts/
    ├── run-apidog-tests.py
    └── generate-report.py

This standardizes future runs and ensures repeatability.

Conclusion

Claude Code Skills let you encode and automate custom workflows in Claude. The Skill Creator system provides a repeatable, test-driven process:

Define intent
Draft SKILL.md with clear, example-driven instructions
Create realistic test cases
Run evaluations (with/without skill)
Analyze feedback and metrics
Iterate improvements
Optimize description for reliable triggering
Package and distribute as a .skill file

FAQ

How long does it take to create a skill?

Simple skills: 15–30 minutes
Complex skills with references/scripts: 2–3 hours (including evaluation)

Do I need to write test cases for every skill?

Only for skills with objectively verifiable outputs (code, file transforms, data extraction). Subjective skills (writing, design) can be checked qualitatively.

What if my skill doesn’t trigger reliably?

Optimize the description field. Include explicit trigger phrases and run the optimization loop with 20 eval queries.

How do I share skills with my team?

Package with python -m scripts.package_skill <path> and distribute the .skill file. Team members install it in their skills directory.

Can skills call external APIs?

Yes. Bundle scripts for API calls, and use environment variables for keys.

What’s the file size limit for skills?

No hard limit, but keep SKILL.md under 500 lines. Offload details to references/scripts.

How do I update an existing skill?

Copy to a writable location, edit, repackage, and preserve the original name unless creating a variant.

Build smarter workflows and automate repetitive tasks in Claude—start building your own Code Skills today!

DEV Community