WonderLab

Posted on Jun 21

Skill Series (01): Skill Evaluation — How to Quantify AI Skill Quality

#llm #agentskills #agents #ai

The Two-Layer Problem

Standard software testing has one layer: did the code produce the right output? Skill evaluation has two:

Layer 1 — Trigger: Did the LLM decide this input should invoke this Skill?
Layer 2 — Execution: Did the Skill actually complete the task correctly?

Skip either layer and the evaluation is incomplete. A Skill with 90% task success rate but 60% trigger recall delivers a far worse user experience than the numbers suggest.

The test subject is rnd-technical-writer, a Skill that writes technical blog articles. 20 trigger cases, two task completion runs, one A/B prompt comparison — all numbers from actual output.

Evaluation Framework

Trigger Evaluation

Core metrics:

Recall    = TP / (TP + FN)  ← of all inputs that should trigger, how many did?
Precision = TP / (TP + FP)  ← of all inputs that triggered, how many were correct?
F1        = 2 × Recall × Precision / (Recall + Precision)

Test set composition (20 cases):

TP  (True Positives — should trigger)    ×8   ← explicit write/tutorial/deep-dive requests
TN  (True Negatives — should not)        ×8   ← questions, series planning, code help
EDGE (boundary cases)                    ×4   ← semantically ambiguous, mixed signals

Clear TP/TN cases are easy to get right. Edge cases expose where the Skill description is actually ambiguous.

Automation approach: Feed the Skill description + user input to the LLM, ask it to predict whether the Skill would trigger, and return structured JSON:

TRIGGER_EVAL_PROMPT = """You are evaluating whether a user message would trigger a specific AI Skill.

Skill specification:
{skill_description}

User message: "{user_input}"

Answer in valid JSON only:
{{
  "prediction": "trigger" or "no_trigger",
  "reasoning": "one sentence explanation"
}}"""

Task Completion Evaluation

Two levels of checking:

Level 2 (structural) — rule-based, no LLM required:
  → Word / character count above minimum
  → Code block present
  → At least one H2 section heading

Level 3 (quality, LLM-as-Judge) — 4 dimensions, 1–5 scale:
  → Technical accuracy  (weight 35%)
  → Depth              (weight 25%)
  → Clarity            (weight 20%)
  → Practical value    (weight 20%)

Judge prompt template:

JUDGE_PROMPT = """You are an expert technical content reviewer.
Evaluate the following AI-generated technical article.

Scoring dimensions (1–5 each):
1. Technical accuracy
2. Depth
3. Clarity
4. Practical value

Respond in valid JSON only:
{
  "technical_accuracy": <1-5>,
  "depth": <1-5>,
  "clarity": <1-5>,
  "practical_value": <1-5>,
  "summary": "<one sentence assessment>"
}"""

Run Results

Part 1: Trigger Evaluation

Confusion matrix: TP=11  TN=8  FP=1  FN=0
Accuracy:  95%  (19/20)
Recall:    100%
Precision: 92%
F1:        0.96

19 of 20 cases correct. The single failure:

Input:    "帮我写一个解析 JSON 的 Python 函数"
          (Write me a Python function to parse JSON)
Expected: no_trigger
Got:      trigger
Reason:   "The user is asking for a technical function to parse JSON in Python,
           which falls under the skill's purpose of writing technical articles."

Part 2: Task Completion

[T001] Redis TTL configuration best practices (English)
  Level 2: ✓  All checks passed
  Technical accuracy: 4/5  |  Depth: 3/5  |  Clarity: 5/5  |  Practical value: 4/5
  Weighted score: 3.95/5

[T002] Python 类型注解入门文章 (Chinese)
  Level 2: ✗  Issues: ['Too short: 165 words (min 300)']
  Technical accuracy: 4/5  |  Depth: 3/5  |  Clarity: 5/5  |  Practical value: 4/5
  Weighted score: 3.95/5

Part 3: A/B Prompt Comparison

[Version A] Original system prompt
  Weighted: 4.20/5

[Version B] Improved prompt (pain-point hook + required checklist)
  Weighted: 4.20/5

Result: No significant difference (<0.1 delta)

Three Engineering Findings

Finding 1: One FP Exposes a Skill Description Gap

The failing case was: "Write me a Python function to parse JSON"

The model's reasoning: the user wants something written — a technical piece of Python code. The Skill description says writing technical articles or tutorials, so "write" triggered it.

The description specified what should trigger the Skill, but never excluded "write code." The model made a reasonable inference from an incomplete spec.

Adding one negative example fixes it:

Do NOT trigger when:
  - User asks for code snippets, functions, or scripts (write the code directly)

That single FP, with its reasoning, is more useful than F1=0.96. It points to the exact line in the Skill description that needs tightening.

Finding 2: Level 2 Word-Count Check Silently Fails on Chinese

T002 (Chinese article about Python type hints) reported "Too short: 165 words (min 300)" — but the article content was visually complete.

The issue is in the implementation:

word_count = len(article.split())  # splits on whitespace

Chinese text has no spaces between words. split() on a Chinese sentence returns one or a handful of tokens, not a word count in any meaningful sense. On top of that, the model wrapped its output in a `markdown code block, further disrupting the count.

"165 words" was actually a full Chinese article — likely 800+ characters of content.

Fix:

`python def count_content_length(text: str) -> int: """Characters for CJK text, words for Latin.""" clean = re.sub(r"`.*?`", "", text, flags=re.DOTALL) # strip code fences cjk = len(re.findall(r"[一-鿿]", clean)) if cjk > 50: return cjk return len(clean.split()) `

Level 2 checks look trivial but carry language assumptions. In Chinese/English mixed environments, branch the length check by detected language at minimum.

Finding 3: LLM Judges Have Real Limits on Fine-Grained Comparisons

Both prompts in the A/B test received identical scores — 4/4/5/4 across all four dimensions — resulting in a 4.20/4.20 tie.

Version B had three additions: open with a developer pain point, include at least 2 annotated code examples, end with a checklist. Observable, structural changes. The Judge scored them identically anyway.

Two explanations:

For this input and glm-4-flash, both prompts genuinely produced comparable output
The Judge's scoring resolution is too coarse — the 4-to-5 gap absorbs differences a human reviewer would notice

Mitigations:

Replace single-shot scoring with win-rate comparison (run 5+ comparisons, consider > 60% win-rate significant)
Ask the Judge to list specific advantages of each version, not just assign a score
Don't use the same model as both the Skill and the Judge — self-evaluation is unreliable; use a stronger model for judging

Implementation Roadmap

`plaintext
Week 1:
□ Build a complete test set for your highest-priority Skill (TP:TN:edge = 8:8:4)
□ Run manually, establish baseline Recall / Precision numbers

Week 2:
□ Add Task completion evaluation (Level 2 structural + Level 3 LLM-as-Judge)
□ Fix Chinese word-count bug
□ Record scores in the Skill's documentation

Week 3:
□ Modify the Skill, use the evaluation framework to verify improvement
□ Document one complete "modify → evaluate → conclude" cycle
`

Design Checklist

Trigger evaluation

[ ] Test set covers TP / TN / edge cases (8:8:4 is a reasonable starting ratio)
[ ] Skill description includes negative examples (Do NOT trigger when)
[ ] If F1 < 0.9, inspect the Skill description for ambiguous trigger language first

Task completion evaluation

[ ] Level 2 word count uses character count for Chinese, not split()
[ ] Level 2 strips markdown code fences before checking structure
[ ] Judge model is stronger than the model the Skill uses

A/B comparison

[ ] When single-shot scores tie, switch to multi-run win-rate (5+ comparisons)
[ ] Judge and subject should be different models

Summary

F1 scores don't tell you where to look — failures do: the single FP pointed directly at the ambiguous line in the Skill description. That's more actionable than 95% accuracy
Level 2 checks have hidden language assumptions: split() for Chinese is a bug; code-fence wrapping breaks length checks you thought were trivial
LLM judges need multiple samples for fine differences: a 4.20 vs 4.20 tie means the scoring resolution ran out, not that the prompts are equal

References

Promptfoo A/B testing
Full demo code: skill-01-trigger-eval

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community