DEV Community

Cover image for Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
WonderLab
WonderLab

Posted on

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

How Did You Last Validate Your Skill?

You finished writing a skill, triggered it manually a couple of times, the output looked reasonable — and then you shipped it.

That's probably the full validation workflow for most people. Slightly embarrassing to admit, but true. We write unit tests and run CI for regular code. But when it comes to Skills, we somehow regress to the era of "going by feel."

The problem isn't laziness. The problem is we don't have a clear picture of how Skills quietly fail — and we don't have a shared vocabulary for what "good" even means.

This article addresses both. First, we'll map out the failure paths. Then we'll use that failure map to reverse-engineer a validation system.


How Do Skills Fail?

Before talking about how to test, let's think through what can go wrong. Skill failures typically follow four paths — and they tend to be quiet. No loud errors, just results that are "a bit off."

Path 1: The Skill Never Triggered

This is the most invisible failure. The user said "format my code," but the Agent never invoked your code-formatting Skill — it just used its own knowledge to make some changes that seemed reasonable.

The root cause is usually a vague description field in SKILL.md. Too broad and it conflicts with other Skills; too narrow and it misses a wide range of legitimate triggers. The tricky part: this failure is nearly impossible to catch in manual testing, because you test using the most canonical trigger phrases, while real users express the same intent in a thousand different ways.

Path 2: Triggered, But the Task Wasn't Completed

The Skill was invoked, tools were called, but the job didn't get done. Maybe three files were supposed to be created and only two were. Maybe a migration script started but exited early.

This is Outcome failure — the most direct and impactful type. Users see the result. If the result is wrong, the Skill might as well not exist.

Path 3: The Right Result, the Wrong Path

This one is subtler. The final output looks fine, but the execution path was wrong: the wrong tools were called, the steps happened out of order, or the Agent took a long detour to get there.

Example: a database migration Skill where the correct sequence is "backup → migrate → verify." If the Agent migrates first and backs up second, the output files might look identical — but the next time a migration fails, you have no usable backup. This is Process failure, completely invisible to result-only validation.

Path 4: Completed, But Below Quality Bar

The task finished. The process was correct. But: the generated code doesn't match the project's style conventions. The commit message format doesn't follow your team's standards. The task used 500 tokens when 100 would have sufficed.

This is Style and Efficiency failure. It won't throw an error. It accumulates silently as technical debt, team friction, and rising costs.


Defining "Success": Four Validation Dimensions

These four failure paths map directly to four success criteria. Until you've defined all four, you haven't really specified what the Skill is supposed to do.

Dimension Corresponding Failure Core Question
Outcome Task not completed Did it do what it was supposed to do?
Process Wrong execution path Were the right tools used in the right order?
Style Quality below bar Does the output conform to conventions?
Efficiency Wasted resources Any unnecessary detours? Reasonable token usage?

Here's how to validate each one.


Validating Outcome: Deterministic Checks

Outcome is the most quantifiable dimension — best validated with deterministic graders: parse the run log or inspect filesystem state to confirm whether the task completed.

Build a Small Test Set First

You don't need hundreds of test cases. Ten to twenty is enough — but they need to cover three types:

Explicit trigger:  "/use code-formatter please format this file"
Implicit trigger:  "this code looks messy, can you clean it up?"
Negative control: "write me a sorting algorithm" (should NOT trigger the formatter)
Enter fullscreen mode Exit fullscreen mode

Negative controls are especially important. They check whether the Skill is being triggered when it shouldn't be — over-triggering is just as much a problem as never triggering.

Deterministic Assertions on JSON Output

codex exec --json produces a structured JSONL run log containing details of every tool call. For Outcome validation:

import json, sys

# Load run log
with open("run_output.jsonl") as f:
    events = [json.loads(line) for line in f]

# Check whether target files were created
created_files = [
    e["path"] for e in events
    if e.get("type") == "file_write"
]

expected = ["src/index.ts", "src/types.ts", "README.md"]
missing = [f for f in expected if f not in created_files]

if missing:
    print(f"❌ FAIL: The following files were not created: {missing}")
    sys.exit(1)
else:
    print("✅ PASS: All expected files were created")
Enter fullscreen mode Exit fullscreen mode

The advantage here: these checks are deterministic — no model judgment involved, results are stable and reproducible. Keep Outcome-level Evals this way. Save the subjective assessment for Rubric scoring later.


Validating Process: Tool Call Sequence Verification

Outcome checks tell you "was the job done?" Process checks tell you "how was it done?"

For Skills with defined execution ordering requirements, you need to verify the sequence of tool calls:

# Extract all tool calls in order
tool_calls = [
    e["tool"] for e in events
    if e.get("type") == "tool_use"
]

# Define the expected call sequence
expected_sequence = ["db_backup", "db_migrate", "db_verify"]

# Check whether it appears as a subsequence (other tools allowed between)
def is_subsequence(expected, actual):
    it = iter(actual)
    return all(step in it for step in expected)

if not is_subsequence(expected_sequence, tool_calls):
    print(f"❌ FAIL: Tool call sequence doesn't match expected")
    print(f"   Expected to contain: {expected_sequence}")
    print(f"   Actual calls: {tool_calls}")
    sys.exit(1)
else:
    print("✅ PASS: Tool call sequence is correct")
Enter fullscreen mode Exit fullscreen mode

Process validation has a more advanced use: detecting command thrashing — the Agent repeatedly retrying the same operation, bouncing back and forth. This usually signals that the Skill's instructions are ambiguous enough that the Agent is flailing. Detect it by counting consecutive repeated calls:

from itertools import groupby

for tool, group in groupby(tool_calls):
    count = sum(1 for _ in group)
    if count > 3:
        print(f"⚠️  WARNING: '{tool}' called {count} times consecutively — possible thrashing")
Enter fullscreen mode Exit fullscreen mode

Validating Style and Efficiency: Rubric-Based Model Scoring

Outcome and Process are factual checks — pass or fail, black or white. Style and Efficiency are qualitative judgments: is the code style right? Is the commit message format correct? Did the Agent take unnecessary detours? These don't have a single right answer.

This is where you switch tools: let another model score the output — but give it a clear rubric and use --output-schema to force structured JSON responses, making scores comparable across runs.

Define Your Rubric

# rubric.yaml
criteria:
  - name: code_style
    description: "Does the generated code conform to the project's ESLint rules?"
    scale: [1, 5]
    anchor_1: "Totally non-conformant, many violations"
    anchor_5: "Fully conformant, zero violations"

  - name: commit_format
    description: "Does the commit message follow Conventional Commits specification?"
    scale: [1, 5]
    anchor_1: "Format completely wrong"
    anchor_5: "Fully correct  type, scope, and description all proper"

  - name: efficiency
    description: "Did the Agent have obvious redundant steps or unnecessary tool calls?"
    scale: [1, 5]
    anchor_1: "Lots of redundancy, chaotic execution"
    anchor_5: "Clean and efficient execution path"
Enter fullscreen mode Exit fullscreen mode

Force Structured Output with --output-schema

import subprocess, json

output_schema = {
    "type": "object",
    "properties": {
        "code_style":    {"type": "integer", "minimum": 1, "maximum": 5},
        "commit_format": {"type": "integer", "minimum": 1, "maximum": 5},
        "efficiency":    {"type": "integer", "minimum": 1, "maximum": 5},
        "reasoning":     {"type": "string"}
    },
    "required": ["code_style", "commit_format", "efficiency", "reasoning"]
}

result = subprocess.run(
    ["codex", "exec", "--output-schema", json.dumps(output_schema),
     "--", "Evaluate the following run output for code style compliance..."],
    capture_output=True, text=True
)

scores = json.loads(result.stdout)
print(f"Code style:    {scores['code_style']}/5")
print(f"Commit format: {scores['commit_format']}/5")
print(f"Efficiency:    {scores['efficiency']}/5")
print(f"Reasoning:     {scores['reasoning']}")
Enter fullscreen mode Exit fullscreen mode

The core value of structured output: cross-version comparability. You adjust a line in the Skill's instructions, re-run the Eval, and the Style score goes from 3.2 to 4.1. That's a trustworthy improvement signal — not "it feels better now."


Progressive Stacking: Let Your Evals Grow with Your Skill

You don't need to build all four dimensions at once. Skill validation should iterate alongside the Skill itself.

Phase 1 (Skill just written): Run manual tests, confirm basic Outcome looks right.

Phase 2 (Ready to share): Add deterministic Outcome checks, build 10 test cases.

Phase 3 (Team is using it): Add Process sequence validation, add Style Rubric scoring.

Phase 4 (Production-critical path): Add command thrashing detection, token usage monitoring, build validation, runtime smoke tests.

This progressive approach has one key benefit: you build the Eval habit when the Skill is simplest, without getting blocked waiting to build the full system. A Skill with two Outcome checks is meaningfully safer than one with none.

As your Eval suite matures, wire it into your CI pipeline so every Skill change triggers an automatic run. At that point, you're iterating on Skills with real confidence — not gambling on "this should be fine."


Summary

Back to the opening question: is your Skill actually good?

Now we have a framework to answer it:

You want to know… Use this
Was the task completed? Deterministic checks (parse JSONL, verify files/state)
Were the steps correct? Tool call sequence validation + thrashing detection
Was the quality good? Rubric model scoring (structured JSON output)
Was anything wasted? Token usage tracking + redundant step detection

Good Evals do two things: make regressions clear — you know exactly what change caused the score to drop; and make failures explainable — not "something feels off" but "the tool call sequence was wrong at step 3."

That's what gives you the confidence to keep improving your Skills without second-guessing every change.


Source: Core methodology from the OpenAI developer blog — Testing Agent Skills Systematically with Evals.


🎉 Thanks for reading — let's enjoy what technology has to offer!

Visit my personal homepage for all resources I share: Homepage

Top comments (0)