Preface
How do you know if an AI Skill's quality is good or bad?
The most common answer is: "Feels okay to use," or "Ran a few test cases and it looked fine."
The common problem with both answers: without comparable numbers and fixed benchmarks, there's no regression detection capability. You don't know whether this version is better or worse than last month's, and you don't know whether a "small Prompt improvement" has quietly degraded performance on certain edge scenarios.
This article is a complete conceptual design for an AI Skill and Workflow evaluation framework — from dataset construction to metric systems, scoring mechanisms to release gates. The goal is to establish a methodology that makes Skill quality measurable, traceable, and continuously improvable.
Five Design Principles
Principle 1: Benchmark-Driven
Every evaluation must produce comparable numbers. Evaluation isn't "run it and see what happens" — it's "run on a fixed dataset and get scores comparable to historical versions." Without a fixed benchmark, there's no regression detection.
Principle 2: Strict Separation Between Evaluation Subject and Evaluation Execution
The Agent instance that executes the Skill/Workflow, and the Agent instance that scores the execution results, must be completely independent sessions. This comes from SkillLens research's core finding: LLM self-evaluation accuracy is only 46.4% — equivalent to random guessing. The same instance executing and scoring makes the scores invalid.
Principle 3: Ratchet Principle
The prerequisite for a new version of a Skill/Workflow going live is that its score on the test set is strictly better than the current version (ties not accepted). The historical best score is the protection floor — any change that causes regression is not allowed to publish.
Principle 4: Layered Evaluation
Skill evaluation and Workflow evaluation are conducted independently in layers, without mixing. A single Skill's quality problems cannot be masked by "workflow-level performance is still okay"; a workflow's end-to-end quality also doesn't equal the sum of individual Skill scores.
Principle 5: Automation Replacing Humans Is the Design Goal, Not a Future Plan
Manual scoring is the starting point, not the end state. The framework design must leave interfaces for automation from the beginning — even if a particular evaluation dimension currently requires human judgment, those scoring results must become annotation data for training future automated scorers.
Evaluation Object Layering: L1 and L2
L1 Single Skill Evaluate individual Skill quality on standard inputs (including routing Skills)
L2 Workflow Evaluate execution chains of multiple Skills in series
Routing Skills (Skills that decide which downstream Skill to invoke) participate in evaluation as a type of L1 — routing is fundamentally also a Skill, just with an output of "which downstream Skill to call" rather than text output.
Workflows are divided into two sub-levels that share the same metrics framework:
- L2a Sub-workflow: Skills in series completing a single business objective (e.g., Bug analysis workflow: routing → expert Skill analysis → report generation)
- L2b End-to-end workflow: Complete chain from task input to final deliverable (e.g., Bug fixing: Bug analysis → code modification → PR submission)
Each layer is evaluated independently, but data propagation relationships exist: routing Skill errors at L1 cause the L2 analysis accuracy to be counted as 0; L2a output quality affects L2b final scoring.
Test Dataset Design
Initial Test Set: Only Two Types, Others Filled Over Time
The initial test set before Skill launch includes only typical positive cases and anti-pattern data. Failure cases and regression cases start empty and are gradually filled through evaluation iteration and production runs.
| Type | Initial State | Description |
|---|---|---|
| Typical positive cases | Main body, sampled by dimension coverage | Represents mainstream inputs the Skill is expected to handle |
| Pre-anticipated edge cases | Few, manually designed | Per Skill Owner judgment, proactively including potentially difficult scenarios |
| Anti-pattern data | Few, manually designed | Inputs this Skill shouldn't handle — verifying it's not mistakenly triggered |
| Failure cases | Initially empty | Failure patterns discovered during evaluation, refined and added |
| Regression cases | Initially empty | Inputs corresponding to previously fixed bugs, preventing reintroduction |
How to Select Test Data
The core principle is covering diversity of the input space, not pursuing quantity. Steps:
- Identify the main variation dimensions in the Skill's inputs (domain, complexity, completeness, edge conditions, etc.)
- Sample from historical real data by dimension matrix, ensuring coverage of major values in each dimension
- Proactively design pre-anticipated edge cases: get Skill Owner to identify "potentially difficult" scenarios upfront — don't wait until failures occur
- Add anti-patterns: typical inputs this Skill shouldn't process
The risk with random sampling: tends to concentrate on "easy to sample" typical cases, with edge scenarios and minority domains consistently missing — resulting in evaluations that only detect superficial problems.
Using Bug Analysis Skill as an example, input variation dimensions might be:
| Variation Dimension | Main Values |
|---|---|
| Technology domain | Android / QNX / MCU |
| Bug severity | Crash / Functional anomaly / Performance issue |
| Log completeness | Complete logs / Missing critical sections / No logs |
| Root cause depth | Surface symptom / Cross-module / Hardware boundary |
Sample by this matrix with at least 1 entry per cell, with priority domains (like Android Crash) having more. Anti-patterns include: feature requirement tickets, operation inquiries, duplicate submissions — these should not trigger Bug analysis.
Statistical Basis for Test Set Size
Insufficient sample sizes produce large biases and high variation between batches. Standard formula:
n = Z² × p × (1−p) / E²
- n: Required minimum sample size
- Z: Z-value for chosen confidence level (95% confidence → 1.96)
- p: Expected pass rate (use 0.5 when unknown — maximizes sample requirement, most conservative)
- E: Acceptable error range (±10% means E = 0.10)
Practical reference table (95% confidence level):
| Allowed Error | Required Sample Size | Use Case Reference |
|---|---|---|
| ±15% | 43 cases | Quick validation, initial exploration |
| ±10% | 97 cases | P0/P1 Skill routine evaluation |
| ±7% | 196 cases | A/B tests requiring precise comparison |
| ±5% | 385 cases | High-confidence release decisions |
Layered size recommendations (initial phase):
| Priority | Minimum Recommended Size | Error Range |
|---|---|---|
| P0 (critical Skills and core workflows) | ≥100 cases | ±10% |
| P1 (important Skills) | ≥45 cases | ±15% |
| P2 (others) | Covered through L2, no independent test set | — |
L1 Isolation Principle
Core requirement for L1 evaluation: test inputs must be provided directly to the Skill being tested — they must not be triggered through actual execution of upstream Skills.
Example: When evaluating a Skill that "extracts key time points from logs," the L1 test set should directly provide log content as input — not have the system first retrieve logs from a ticket system to trigger this Skill. Once an upstream Skill changes, it pollutes the current Skill's evaluation results — L1 responsibility boundaries must be clear.
Metric System
L1 Single Skill Metrics
Classification axis is output nature, not business domain — the evaluation method is determined by "what the output is":
| Skill Type | Typical Example | Main Quality Metrics | Scoring Method |
|---|---|---|---|
| Routing decision | Bug classification routing | Routing accuracy, misrouting distribution | Fully automated |
| Data retrieval | Get ticket info, get code files | Field completeness rate, retrieval success rate | Fully automated |
| Information extraction/transformation | Extract log key time points | Extraction recall rate, field accuracy | Mostly automated |
| Analysis/reasoning | Bug root cause analysis, code review | Conclusion accuracy (0-5 scale), reasoning chain quality | Semi-automated |
| Content generation | Code generation, documentation | Compilation pass rate / logical correctness | Tiered automated |
L2 Workflow Metrics
| Metric | Description | Scoring Method |
|---|---|---|
| Step completion rate | Proportion of Phases successfully completed; locates chain bottlenecks | Automated |
| End-to-end accuracy | Comprehensive quality of final output | Analysis: semi-automated; Code: tiered automated |
| Autonomous completion rate | Proportion fully completed autonomously without human intervention | Automated |
| Assisted completion rate (HITL) | Proportion completed with human intervention | Manual recording |
| Token consumption | Full-chain cumulative tokens (input + output) | Automated |
| Execution time | End-to-end time from task trigger to output delivery | Automated |
Relationship between autonomous completion rate and assisted completion rate: The former measures "how far AI can go independently," the latter measures "what can be achieved with human assistance." The gap between them reflects the incremental value of HITL intervention, and is an important indicator for judging whether a workflow is suitable for further automation.
0-5 Scoring Scale
0-5 points with 6 levels, using Bug Analysis as an example:
| Score | Generic Definition | Bug Analysis Scenario |
|---|---|---|
| 5 | Completely correct, flawless process and evidence chain | Root cause fully correct, clear analysis process, all evidence correct |
| 4 | Correct conclusion, minor errors in details | Root cause correct, some evidence chain details incorrect |
| 3 | Basically correct direction but incomplete conclusion | Root cause not fully correct, but analysis approach basically correct |
| 2 | Conclusion has deviations but has some indicative value | Analysis approach has deviations but provides some indication for further analysis |
| 1 | Has output, but extremely limited value | Has analysis output but value is extremely low, near useless |
| 0 | Completely unhelpful | Analysis without any direction, no reference value at all |
0 represents completely unhelpful — not distinguishing "harmful" from "useless" because the boundary is hard to accurately define in practice, and both have identical impact on evaluation conclusions.
Scoring Mechanism: From Manual to Automated
Structured Manual Scoring (Current Phase)
Manual scoring is unavoidable, but can be constrained and standardized:
- Pre-announce standard answers before scoring: Scorers grade against pre-determined standard answers rather than judging from scratch — reduces subjectivity
- Independent dual scoring: Same result scored by two independent scorers; discrepancies take the lower score and mark as "disputed case" for separate analysis
- Scorers don't participate in execution: Skill Owners are responsible for improving Skills; scorers don't know which version is being tested
Automation Path
Stage 1: Routing Layer Fully Automated
Routing decisions are deterministic (which Skill was called is an observable fact). Routing accuracy is the first metric to achieve 100% automated scoring.
Stage 2: Independent LLM Scorer
For scoring dimensions requiring semantic understanding, use an independent LLM instance as scorer:
- Scorer uses a different model from the executor (different bias patterns)
- Scorer receives: standard answers + Agent output + scoring rubric — no version information
- Scorer results need periodic calibration against human annotations — at least 20% human review per batch initially
Stage 3: Specialized Scorer Training
After accumulating sufficient human annotation data, train scorers specialized for specific business scenarios. At this stage, the scorer can identify business-specific "correctness" standards rather than generalized "semantic similarity."
Version Management and Release Gates
Skill Version Iteration Flow
Developer modifies Skill → Submit to staging branch → Trigger evaluation (run on test set)
↓
Score > current main version (ratchet principle, tie not accepted)
↓ Yes ↓ No
Merge to main Reject publication, attach evaluation report
Canary Release Mechanism
For Skill changes with broad impact, after passing test set evaluation, also conduct live Canary release:
- New version runs at 20% traffic initially (staging branch)
- Collect user experience scores from real traffic
- After collecting N valid samples on both sides, do mean comparison
- New version mean ≥ old version → full switch; otherwise rollback
The key to Canary: user experience scores must come from real interactions, not offline test sets. Test sets measure accuracy; live Canary measures "does this actually help real users" — these two measure different things.
Quantitative Release Gate Standards (Initial Recommendations)
| Release Type | Minimum Requirement |
|---|---|
| Fix known failure cases | All targeted test cases pass + regression cases don't degrade |
| Extend new capabilities | New scenario coverage ≥ 80% + existing test set doesn't degrade |
| Routing logic changes | Routing accuracy ≥ current version + Canary passes |
| Complete Skill rewrite | All category scores ≥ current version mean |
Architecture: Three-Layer Separation
Based on Benchmark-Evaluator separation design principles:
Benchmark (Data Layer) Test datasets + standard answers + evaluation function definitions
↓
Evaluator (Execution Layer) Triggers workflow execution, collects trajectories and artifacts
↓
Scorer (Scoring Layer) Independent scorer evaluates artifacts (isolated from Evaluator)
↓
ResultStore (Storage Layer) Full retention of raw data, scores, and trajectories
↓
Dashboard (Display Layer) Score trends, regression alerts, comparison views
Responsibilities are clear across three layers: Benchmark doesn't know how to execute, Evaluator doesn't know how to score, Scorer doesn't know execution details — this isolation ensures each layer can iterate independently.
Implementation Roadmap
Phase 1 (Months 0-3): Structured
Goal: Manual scoring becomes reliable, consistent, and traceable
- Finalize test dataset v1.0, complete dataset version management
- Routing accuracy achieves 100% automated calculation
- Establish standardized scoring Rubric, dual-scoring mechanism goes live, calculate inter-rater consistency
- Build ResultStore — all evaluation results stored structurally (no more spreadsheets)
- Evaluation results bound to Skill versions, supporting historical comparison
Phase 2 (Months 3-6): Semi-automated
Goal: Reduce manual scoring by 60%, automate high-certainty dimensions
- Independent LLM scorer goes live, covering initial automated analysis accuracy scoring
- Scorer calibration mechanism: 20% human review per batch
- Canary release mechanism for frequently-updated Skills
- Dashboard live: score trends, cross-Skill comparison, regression alerts
Phase 3 (Months 6-12): Automation-first
Goal: Manual scoring only retained for calibration and anomaly handling; daily evaluation fully automated
- Scorer calibration consistency rate reaches 80%+
- Test sets continuously auto-expanded from historical failure cases (non-static)
- Cross-Skill cascading failure analysis: auto-detect whether Skill A's degradation causes Skill B's failure rate to rise
Summary
Making AI Skill quality measurable requires:
- Fixed benchmarks: Every evaluation runs on the same dataset, producing comparable numbers
- Execution-scoring separation: Avoids SkillLens's 46.4% random-guessing problem of LLM self-evaluation
- Ratchet principle: Scores only go up; tie not accepted; historical best score is the protection floor
- Layered evaluation: L1 Skills and L2 Workflows are independent, not masking each other
- Progressive automation: Start from routing layer 100% automation, progressively advance toward semantic layer
Building this system isn't overnight, but every step brings real improvement. The distance from "feels okay" to "measurably good" is shorter than most people imagine.
Visit PrimeSkills — a curated AI Agent and skills marketplace where all content is validated through real enterprise workflows. No hype, just what actually works.
For more practical knowledge and interesting products, visit my personal homepage
Top comments (0)