WonderLab

Posted on Jun 15

How to Make AI Skill Quality Measurable? Benchmark-Driven Evaluation Framework Design

#ai #agentskills #benchmark #agents

Preface

How do you know if an AI Skill's quality is good or bad?

The most common answer is: "Feels okay to use," or "Ran a few test cases and it looked fine."

The common problem with both answers: without comparable numbers and fixed benchmarks, there's no regression detection capability. You don't know whether this version is better or worse than last month's, and you don't know whether a "small Prompt improvement" has quietly degraded performance on certain edge scenarios.

This article is a complete conceptual design for an AI Skill and Workflow evaluation framework — from dataset construction to metric systems, scoring mechanisms to release gates. The goal is to establish a methodology that makes Skill quality measurable, traceable, and continuously improvable.

Five Design Principles

Principle 1: Benchmark-Driven

Every evaluation must produce comparable numbers. Evaluation isn't "run it and see what happens" — it's "run on a fixed dataset and get scores comparable to historical versions." Without a fixed benchmark, there's no regression detection.

Principle 2: Strict Separation Between Evaluation Subject and Evaluation Execution

The Agent instance that executes the Skill/Workflow, and the Agent instance that scores the execution results, must be completely independent sessions. This comes from SkillLens research's core finding: LLM self-evaluation accuracy is only 46.4% — equivalent to random guessing. The same instance executing and scoring makes the scores invalid.

Principle 3: Ratchet Principle

The prerequisite for a new version of a Skill/Workflow going live is that its score on the test set is strictly better than the current version (ties not accepted). The historical best score is the protection floor — any change that causes regression is not allowed to publish.

Principle 4: Layered Evaluation

Skill evaluation and Workflow evaluation are conducted independently in layers, without mixing. A single Skill's quality problems cannot be masked by "workflow-level performance is still okay"; a workflow's end-to-end quality also doesn't equal the sum of individual Skill scores.

Principle 5: Automation Replacing Humans Is the Design Goal, Not a Future Plan

Manual scoring is the starting point, not the end state. The framework design must leave interfaces for automation from the beginning — even if a particular evaluation dimension currently requires human judgment, those scoring results must become annotation data for training future automated scorers.

Evaluation Object Layering: L1 and L2

L1  Single Skill     Evaluate individual Skill quality on standard inputs (including routing Skills)
L2  Workflow         Evaluate execution chains of multiple Skills in series

Routing Skills (Skills that decide which downstream Skill to invoke) participate in evaluation as a type of L1 — routing is fundamentally also a Skill, just with an output of "which downstream Skill to call" rather than text output.

Workflows are divided into two sub-levels that share the same metrics framework:

L2a Sub-workflow: Skills in series completing a single business objective (e.g., Bug analysis workflow: routing → expert Skill analysis → report generation)
L2b End-to-end workflow: Complete chain from task input to final deliverable (e.g., Bug fixing: Bug analysis → code modification → PR submission)

Each layer is evaluated independently, but data propagation relationships exist: routing Skill errors at L1 cause the L2 analysis accuracy to be counted as 0; L2a output quality affects L2b final scoring.

Test Dataset Design

Initial Test Set: Only Two Types, Others Filled Over Time

The initial test set before Skill launch includes only typical positive cases and anti-pattern data. Failure cases and regression cases start empty and are gradually filled through evaluation iteration and production runs.

Type	Initial State	Description
Typical positive cases	Main body, sampled by dimension coverage	Represents mainstream inputs the Skill is expected to handle
Pre-anticipated edge cases	Few, manually designed	Per Skill Owner judgment, proactively including potentially difficult scenarios
Anti-pattern data	Few, manually designed	Inputs this Skill shouldn't handle — verifying it's not mistakenly triggered
Failure cases	Initially empty	Failure patterns discovered during evaluation, refined and added
Regression cases	Initially empty	Inputs corresponding to previously fixed bugs, preventing reintroduction

How to Select Test Data

The core principle is covering diversity of the input space, not pursuing quantity. Steps:

Identify the main variation dimensions in the Skill's inputs (domain, complexity, completeness, edge conditions, etc.)
Sample from historical real data by dimension matrix, ensuring coverage of major values in each dimension
Proactively design pre-anticipated edge cases: get Skill Owner to identify "potentially difficult" scenarios upfront — don't wait until failures occur
Add anti-patterns: typical inputs this Skill shouldn't process

The risk with random sampling: tends to concentrate on "easy to sample" typical cases, with edge scenarios and minority domains consistently missing — resulting in evaluations that only detect superficial problems.

Using Bug Analysis Skill as an example, input variation dimensions might be:

Variation Dimension	Main Values
Technology domain	Android / QNX / MCU
Bug severity	Crash / Functional anomaly / Performance issue
Log completeness	Complete logs / Missing critical sections / No logs
Root cause depth	Surface symptom / Cross-module / Hardware boundary

Sample by this matrix with at least 1 entry per cell, with priority domains (like Android Crash) having more. Anti-patterns include: feature requirement tickets, operation inquiries, duplicate submissions — these should not trigger Bug analysis.

Statistical Basis for Test Set Size

Insufficient sample sizes produce large biases and high variation between batches. Standard formula:

n = Z² × p × (1−p) / E²

n: Required minimum sample size
Z: Z-value for chosen confidence level (95% confidence → 1.96)
p: Expected pass rate (use 0.5 when unknown — maximizes sample requirement, most conservative)
E: Acceptable error range (±10% means E = 0.10)

Practical reference table (95% confidence level):

Allowed Error	Required Sample Size	Use Case Reference
±15%	43 cases	Quick validation, initial exploration
±10%	97 cases	P0/P1 Skill routine evaluation
±7%	196 cases	A/B tests requiring precise comparison
±5%	385 cases	High-confidence release decisions

Layered size recommendations (initial phase):

Priority	Minimum Recommended Size	Error Range
P0 (critical Skills and core workflows)	≥100 cases	±10%
P1 (important Skills)	≥45 cases	±15%
P2 (others)	Covered through L2, no independent test set	—

L1 Isolation Principle

Core requirement for L1 evaluation: test inputs must be provided directly to the Skill being tested — they must not be triggered through actual execution of upstream Skills.

Example: When evaluating a Skill that "extracts key time points from logs," the L1 test set should directly provide log content as input — not have the system first retrieve logs from a ticket system to trigger this Skill. Once an upstream Skill changes, it pollutes the current Skill's evaluation results — L1 responsibility boundaries must be clear.

Metric System

L1 Single Skill Metrics

Classification axis is output nature, not business domain — the evaluation method is determined by "what the output is":

Skill Type	Typical Example	Main Quality Metrics	Scoring Method
Routing decision	Bug classification routing	Routing accuracy, misrouting distribution	Fully automated
Data retrieval	Get ticket info, get code files	Field completeness rate, retrieval success rate	Fully automated
Information extraction/transformation	Extract log key time points	Extraction recall rate, field accuracy	Mostly automated
Analysis/reasoning	Bug root cause analysis, code review	Conclusion accuracy (0-5 scale), reasoning chain quality	Semi-automated
Content generation	Code generation, documentation	Compilation pass rate / logical correctness	Tiered automated

L2 Workflow Metrics

Metric	Description	Scoring Method
Step completion rate	Proportion of Phases successfully completed; locates chain bottlenecks	Automated
End-to-end accuracy	Comprehensive quality of final output	Analysis: semi-automated; Code: tiered automated
Autonomous completion rate	Proportion fully completed autonomously without human intervention	Automated
Assisted completion rate (HITL)	Proportion completed with human intervention	Manual recording
Token consumption	Full-chain cumulative tokens (input + output)	Automated
Execution time	End-to-end time from task trigger to output delivery	Automated

Relationship between autonomous completion rate and assisted completion rate: The former measures "how far AI can go independently," the latter measures "what can be achieved with human assistance." The gap between them reflects the incremental value of HITL intervention, and is an important indicator for judging whether a workflow is suitable for further automation.

0-5 Scoring Scale

0-5 points with 6 levels, using Bug Analysis as an example:

Score	Generic Definition	Bug Analysis Scenario
5	Completely correct, flawless process and evidence chain	Root cause fully correct, clear analysis process, all evidence correct
4	Correct conclusion, minor errors in details	Root cause correct, some evidence chain details incorrect
3	Basically correct direction but incomplete conclusion	Root cause not fully correct, but analysis approach basically correct
2	Conclusion has deviations but has some indicative value	Analysis approach has deviations but provides some indication for further analysis
1	Has output, but extremely limited value	Has analysis output but value is extremely low, near useless
0	Completely unhelpful	Analysis without any direction, no reference value at all

0 represents completely unhelpful — not distinguishing "harmful" from "useless" because the boundary is hard to accurately define in practice, and both have identical impact on evaluation conclusions.

Scoring Mechanism: From Manual to Automated

Structured Manual Scoring (Current Phase)

Manual scoring is unavoidable, but can be constrained and standardized:

Pre-announce standard answers before scoring: Scorers grade against pre-determined standard answers rather than judging from scratch — reduces subjectivity
Independent dual scoring: Same result scored by two independent scorers; discrepancies take the lower score and mark as "disputed case" for separate analysis
Scorers don't participate in execution: Skill Owners are responsible for improving Skills; scorers don't know which version is being tested

Automation Path

Stage 1: Routing Layer Fully Automated

Routing decisions are deterministic (which Skill was called is an observable fact). Routing accuracy is the first metric to achieve 100% automated scoring.

Stage 2: Independent LLM Scorer

For scoring dimensions requiring semantic understanding, use an independent LLM instance as scorer:

Scorer uses a different model from the executor (different bias patterns)
Scorer receives: standard answers + Agent output + scoring rubric — no version information
Scorer results need periodic calibration against human annotations — at least 20% human review per batch initially

Stage 3: Specialized Scorer Training

After accumulating sufficient human annotation data, train scorers specialized for specific business scenarios. At this stage, the scorer can identify business-specific "correctness" standards rather than generalized "semantic similarity."

Version Management and Release Gates

Skill Version Iteration Flow

Developer modifies Skill → Submit to staging branch → Trigger evaluation (run on test set)
                                        ↓
                             Score > current main version (ratchet principle, tie not accepted)
                             ↓ Yes              ↓ No
                        Merge to main      Reject publication, attach evaluation report

Canary Release Mechanism

For Skill changes with broad impact, after passing test set evaluation, also conduct live Canary release:

New version runs at 20% traffic initially (staging branch)
Collect user experience scores from real traffic
After collecting N valid samples on both sides, do mean comparison
New version mean ≥ old version → full switch; otherwise rollback

The key to Canary: user experience scores must come from real interactions, not offline test sets. Test sets measure accuracy; live Canary measures "does this actually help real users" — these two measure different things.

Quantitative Release Gate Standards (Initial Recommendations)

Release Type	Minimum Requirement
Fix known failure cases	All targeted test cases pass + regression cases don't degrade
Extend new capabilities	New scenario coverage ≥ 80% + existing test set doesn't degrade
Routing logic changes	Routing accuracy ≥ current version + Canary passes
Complete Skill rewrite	All category scores ≥ current version mean

Architecture: Three-Layer Separation

Based on Benchmark-Evaluator separation design principles:

Benchmark (Data Layer)         Test datasets + standard answers + evaluation function definitions
        ↓
Evaluator (Execution Layer)    Triggers workflow execution, collects trajectories and artifacts
        ↓
Scorer (Scoring Layer)         Independent scorer evaluates artifacts (isolated from Evaluator)
        ↓
ResultStore (Storage Layer)    Full retention of raw data, scores, and trajectories
        ↓
Dashboard (Display Layer)      Score trends, regression alerts, comparison views

Responsibilities are clear across three layers: Benchmark doesn't know how to execute, Evaluator doesn't know how to score, Scorer doesn't know execution details — this isolation ensures each layer can iterate independently.

Implementation Roadmap

Phase 1 (Months 0-3): Structured

Goal: Manual scoring becomes reliable, consistent, and traceable

Finalize test dataset v1.0, complete dataset version management
Routing accuracy achieves 100% automated calculation
Establish standardized scoring Rubric, dual-scoring mechanism goes live, calculate inter-rater consistency
Build ResultStore — all evaluation results stored structurally (no more spreadsheets)
Evaluation results bound to Skill versions, supporting historical comparison

Phase 2 (Months 3-6): Semi-automated

Goal: Reduce manual scoring by 60%, automate high-certainty dimensions

Independent LLM scorer goes live, covering initial automated analysis accuracy scoring
Scorer calibration mechanism: 20% human review per batch
Canary release mechanism for frequently-updated Skills
Dashboard live: score trends, cross-Skill comparison, regression alerts

Phase 3 (Months 6-12): Automation-first

Goal: Manual scoring only retained for calibration and anomaly handling; daily evaluation fully automated

Scorer calibration consistency rate reaches 80%+
Test sets continuously auto-expanded from historical failure cases (non-static)
Cross-Skill cascading failure analysis: auto-detect whether Skill A's degradation causes Skill B's failure rate to rise

Summary

Making AI Skill quality measurable requires:

Fixed benchmarks: Every evaluation runs on the same dataset, producing comparable numbers
Execution-scoring separation: Avoids SkillLens's 46.4% random-guessing problem of LLM self-evaluation
Ratchet principle: Scores only go up; tie not accepted; historical best score is the protection floor
Layered evaluation: L1 Skills and L2 Workflows are independent, not masking each other
Progressive automation: Start from routing layer 100% automation, progressively advance toward semantic layer

Building this system isn't overnight, but every step brings real improvement. The distance from "feels okay" to "measurably good" is shorter than most people imagine.

Visit PrimeSkills — a curated AI Agent and skills marketplace where all content is validated through real enterprise workflows. No hype, just what actually works.

For more practical knowledge and interesting products, visit my personal homepage

DEV Community