The Two-Layer Problem
Standard software testing has one layer: did the code produce the right output? Skill evaluation has two:
Layer 1 — Trigger: Did the LLM decide this input should invoke this Skill?
Layer 2 — Execution: Did the Skill actually complete the task correctly?
Skip either layer and the evaluation is incomplete. A Skill with 90% task success rate but 60% trigger recall delivers a far worse user experience than the numbers suggest.
The test subject is rnd-technical-writer, a Skill that writes technical blog articles. 20 trigger cases, two task completion runs, one A/B prompt comparison — all numbers from actual output.
Evaluation Framework
Trigger Evaluation
Core metrics:
Recall = TP / (TP + FN) ← of all inputs that should trigger, how many did?
Precision = TP / (TP + FP) ← of all inputs that triggered, how many were correct?
F1 = 2 × Recall × Precision / (Recall + Precision)
Test set composition (20 cases):
TP (True Positives — should trigger) ×8 ← explicit write/tutorial/deep-dive requests
TN (True Negatives — should not) ×8 ← questions, series planning, code help
EDGE (boundary cases) ×4 ← semantically ambiguous, mixed signals
Clear TP/TN cases are easy to get right. Edge cases expose where the Skill description is actually ambiguous.
Automation approach: Feed the Skill description + user input to the LLM, ask it to predict whether the Skill would trigger, and return structured JSON:
TRIGGER_EVAL_PROMPT = """You are evaluating whether a user message would trigger a specific AI Skill.
Skill specification:
{skill_description}
User message: "{user_input}"
Answer in valid JSON only:
{{
"prediction": "trigger" or "no_trigger",
"reasoning": "one sentence explanation"
}}"""
Task Completion Evaluation
Two levels of checking:
Level 2 (structural) — rule-based, no LLM required:
→ Word / character count above minimum
→ Code block present
→ At least one H2 section heading
Level 3 (quality, LLM-as-Judge) — 4 dimensions, 1–5 scale:
→ Technical accuracy (weight 35%)
→ Depth (weight 25%)
→ Clarity (weight 20%)
→ Practical value (weight 20%)
Judge prompt template:
JUDGE_PROMPT = """You are an expert technical content reviewer.
Evaluate the following AI-generated technical article.
Scoring dimensions (1–5 each):
1. Technical accuracy
2. Depth
3. Clarity
4. Practical value
Respond in valid JSON only:
{
"technical_accuracy": <1-5>,
"depth": <1-5>,
"clarity": <1-5>,
"practical_value": <1-5>,
"summary": "<one sentence assessment>"
}"""
Run Results
Part 1: Trigger Evaluation
Confusion matrix: TP=11 TN=8 FP=1 FN=0
Accuracy: 95% (19/20)
Recall: 100%
Precision: 92%
F1: 0.96
19 of 20 cases correct. The single failure:
Input: "帮我写一个解析 JSON 的 Python 函数"
(Write me a Python function to parse JSON)
Expected: no_trigger
Got: trigger
Reason: "The user is asking for a technical function to parse JSON in Python,
which falls under the skill's purpose of writing technical articles."
Part 2: Task Completion
[T001] Redis TTL configuration best practices (English)
Level 2: ✓ All checks passed
Technical accuracy: 4/5 | Depth: 3/5 | Clarity: 5/5 | Practical value: 4/5
Weighted score: 3.95/5
[T002] Python 类型注解入门文章 (Chinese)
Level 2: ✗ Issues: ['Too short: 165 words (min 300)']
Technical accuracy: 4/5 | Depth: 3/5 | Clarity: 5/5 | Practical value: 4/5
Weighted score: 3.95/5
Part 3: A/B Prompt Comparison
[Version A] Original system prompt
Weighted: 4.20/5
[Version B] Improved prompt (pain-point hook + required checklist)
Weighted: 4.20/5
Result: No significant difference (<0.1 delta)
Three Engineering Findings
Finding 1: One FP Exposes a Skill Description Gap
The failing case was: "Write me a Python function to parse JSON"
The model's reasoning: the user wants something written — a technical piece of Python code. The Skill description says writing technical articles or tutorials, so "write" triggered it.
The description specified what should trigger the Skill, but never excluded "write code." The model made a reasonable inference from an incomplete spec.
Adding one negative example fixes it:
Do NOT trigger when:
- User asks for code snippets, functions, or scripts (write the code directly)
That single FP, with its reasoning, is more useful than F1=0.96. It points to the exact line in the Skill description that needs tightening.
Finding 2: Level 2 Word-Count Check Silently Fails on Chinese
T002 (Chinese article about Python type hints) reported "Too short: 165 words (min 300)" — but the article content was visually complete.
The issue is in the implementation:
word_count = len(article.split()) # splits on whitespace
Chinese text has no spaces between words. split() on a Chinese sentence returns one or a handful of tokens, not a word count in any meaningful sense. On top of that, the model wrapped its output in a `markdown code block, further disrupting the count.
"165 words" was actually a full Chinese article — likely 800+ characters of content.
Fix:
`python.*?
def count_content_length(text: str) -> int:
"""Characters for CJK text, words for Latin."""
clean = re.sub(r"``", "", text, flags=re.DOTALL) # strip code fences
cjk = len(re.findall(r"[一-鿿]", clean))
if cjk > 50:
return cjk
return len(clean.split())
`
Level 2 checks look trivial but carry language assumptions. In Chinese/English mixed environments, branch the length check by detected language at minimum.
Finding 3: LLM Judges Have Real Limits on Fine-Grained Comparisons
Both prompts in the A/B test received identical scores — 4/4/5/4 across all four dimensions — resulting in a 4.20/4.20 tie.
Version B had three additions: open with a developer pain point, include at least 2 annotated code examples, end with a checklist. Observable, structural changes. The Judge scored them identically anyway.
Two explanations:
- For this input and glm-4-flash, both prompts genuinely produced comparable output
- The Judge's scoring resolution is too coarse — the 4-to-5 gap absorbs differences a human reviewer would notice
Mitigations:
- Replace single-shot scoring with win-rate comparison (run 5+ comparisons, consider > 60% win-rate significant)
- Ask the Judge to list specific advantages of each version, not just assign a score
- Don't use the same model as both the Skill and the Judge — self-evaluation is unreliable; use a stronger model for judging
Implementation Roadmap
`plaintext
Week 1:
□ Build a complete test set for your highest-priority Skill (TP:TN:edge = 8:8:4)
□ Run manually, establish baseline Recall / Precision numbers
Week 2:
□ Add Task completion evaluation (Level 2 structural + Level 3 LLM-as-Judge)
□ Fix Chinese word-count bug
□ Record scores in the Skill's documentation
Week 3:
□ Modify the Skill, use the evaluation framework to verify improvement
□ Document one complete "modify → evaluate → conclude" cycle
`
Design Checklist
Trigger evaluation
- [ ] Test set covers TP / TN / edge cases (8:8:4 is a reasonable starting ratio)
- [ ] Skill description includes negative examples (Do NOT trigger when)
- [ ] If F1 < 0.9, inspect the Skill description for ambiguous trigger language first
Task completion evaluation
- [ ] Level 2 word count uses character count for Chinese, not
split() - [ ] Level 2 strips markdown code fences before checking structure
- [ ] Judge model is stronger than the model the Skill uses
A/B comparison
- [ ] When single-shot scores tie, switch to multi-run win-rate (5+ comparisons)
- [ ] Judge and subject should be different models
Summary
- F1 scores don't tell you where to look — failures do: the single FP pointed directly at the ambiguous line in the Skill description. That's more actionable than 95% accuracy
-
Level 2 checks have hidden language assumptions:
split()for Chinese is a bug; code-fence wrapping breaks length checks you thought were trivial - LLM judges need multiple samples for fine differences: a 4.20 vs 4.20 tie means the scoring resolution ran out, not that the prompts are equal
References
- Promptfoo A/B testing
- Full demo code: skill-01-trigger-eval
Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.
Find more useful knowledge and interesting products on my Homepage
Top comments (0)