Takayuki Kawazoe

Posted on May 3

Claude Code Skills vs deterministic verify commands — same checks, very different ergonomics

#claude #ai #devops #testing

A peer showed me their verification pipeline last week. We were both running coding agents over real tickets, both wanted "did the agent actually finish the work" to be a hard gate, and we both had a thing for it. Theirs was a Claude Code Skill — a Markdown file with a few prompt fragments and tool permissions, invoked by the agent itself when it thought verification was warranted. Mine was a deterministic shell command pinned into a workflow step that runs every time, no agent judgment involved. Same intent. Very different operational shape.

The conversation that followed — "why did you pick the one you picked, and where does it bite you" — is what this post is. I'm building an AI dev harness called Codens; what's relevant here is that its verification layer is on the command side of this divide, and my read on Skill-based verification is partly observational from peers running them and partly from the Skills I do have installed for developer-side work in the same repo. I'll be honest about which is which.

What "Skill-based verification" looks like

A Skill in Claude Code is a Markdown-defined behavior contract. It lives in .claude/skills/<name>/SKILL.md (project-scoped) or in the user's home directory, has a YAML frontmatter with a name and a description, and a body that's free-form prompt text plus optional references to scripts, sub-agents, or tool permissions. The agent reads the available Skills' descriptions on every turn and decides — in the same way it decides whether to call a tool — whether the current situation warrants invoking one.

The shape of a verification Skill is something like:

name: verify-task
description: |
  Run after implementing a Notion ticket. Checks that lint, types,
  and unit tests pass for the touched code paths, and reports back
  in plain English with a recommended next action.

# Verify Task

When invoked, do the following:
1. Read the diff and infer which packages were touched.
2. For each touched package, run the appropriate lint/type/test commands.
3. If everything passes, return a one-line summary.
4. If something fails, summarize the smallest reproducer, link to the
   failing line, and suggest one concrete fix.

The strengths are what you'd expect from giving an LLM a verification rubric instead of a script:

Contextual judgment. The Skill decides what to verify based on the diff. A pure-frontend change skips the backend test suite.
Natural-language failure explanations. When something breaks, the agent doesn't dump 1500 bytes of stderr — it summarizes. "The test test_normalize_uv_venv fails because the function now strips a trailing slash the test expects to remain."
Output that varies in shape with the run. A clean pass returns a one-liner; a messy failure returns a structured triage.
The agent decides if and when to call it. No engine logic to wire.

That last property is also where Skill-based verification gets a little scary. It depends on the agent reading the room correctly — exactly the property that makes me reluctant to put production gating on it.

What `verify_commands` looks like

The mechanism on the other side is what Codens uses internally. The unit of verification is a shell command (or a chain of them) declared on the ticket itself, and the workflow engine runs it as a step. No model in the loop, no judgment about whether to run, no judgment about how to interpret the result. Pass means exit 0. Fail means non-zero. Output is captured verbatim and the tail surfaces in the failure message.

The runner is small. Here's the actual thing, from purple-codens/backend/src/infrastructure/workflow/steps/run_tests.py:

test_command = config.get("test_command", "npm run test")
# Interpolate context variables (e.g., {verify_commands} from Notion)
try:
    test_command = test_command.format(**context.variables)
except (KeyError, ValueError):
    pass
# Build final command: prepend project-level defaults before task-specific commands.
# Each chunk runs inside its own subshell `(...)` so a `cd dir && ...` step
# in the project default does not leak its pwd into the task-level verify and
# cause the next `cd dir` to fail. The agent service runs the final string
# under /bin/sh (dash on Debian/Ubuntu), where a failed `cd` returns exit 2 —
# which RunTestsStep would otherwise surface as the misleading
# "Tests failed (exit code: 2)".
default_verify = context.get_var("default_verify_commands")
task_verify = context.get_var("verify_commands")
if default_verify or task_verify:
    parts = [f"({p})" for p in [default_verify, task_verify] if p]
    test_command = " && ".join(parts)

Three things worth pointing out:

Two layers chained. default_verify_commands is the project-level "always run this first" — typically format && lint && typecheck. verify_commands is the task-level — usually a targeted test that exercises the specific change. They concatenate with &&, so defaults must pass before task-specific runs.
Subshell wrapping. Each chunk is wrapped in (...) so a cd dir && ... in one part doesn't leak its working directory into the next. The kind of thing you only fix after staring at "Tests failed (exit code: 2)" in Slack and realizing dash exits 2 when cd fails.
The preset wires it. From notion_compatible.json:

{
  "id": "verify",
  "type": "run_tests",
  "config": {
    "test_command": "{verify_commands}",
    "timeout_seconds": 300
  },
  "on_success": "notify_success",
  "on_failure": "fix_verify"
}

When verify fails, the surfaced error includes the tail of the captured output — not just the exit code — so the next step has something concrete to feed back to the agent:

output = result.get("output", "") or ""
_ERROR_TAIL_BYTES = 1500
tail = output[-_ERROR_TAIL_BYTES:] if len(output) > _ERROR_TAIL_BYTES else output
exit_code = result.get("exit_code", "unknown")
if tail.strip():
    msg = f"Tests failed (exit code: {exit_code}). Last output:\n{tail}"
else:
    msg = f"Tests failed (exit code: {exit_code}); agent reported no output"
return StepResult.error(
    msg,
    {
        "test_output": output,
        "previous_step_output": output,
    },
)

previous_step_output is the variable the next claude_code step (fix_verify) interpolates into its prompt: "Verification failed. Fix the issues. Output: <last 1500 bytes>." The agent gets stderr to react to, but the decision about whether verification passed or failed is made by the engine reading the exit code — not by the agent reading the output.

Strengths, mirroring the Skill side:

It always runs. No "did the agent reach for the Skill" question. Step exists, step executes, exit code resolves.
Bool pass/fail. Engine routes on on_success or on_failure. No interpretation layer.
Structured failure output. A 1500-byte tail of real stderr, not an LLM's summary of it. Sometimes the LLM is wrong about what failed; the raw tail is not.
Drop into CI. The same command line runs in the workflow, in GitHub Actions, and on a contributor's laptop. Portable in a way a Skill is not.

Where each one is the better fit

Both mechanisms can do "verification." The interesting question is when each one is ergonomically the right tool, not just when it's technically possible.

Skill is the better fit when:

The check requires contextual judgment about what to look at. "Did the PR include sensible test coverage for the changed lines" is a Skill question — what counts as sensible varies by file, by convention, by whether the change is a refactor or new behavior. A command can't make that judgment without becoming an LLM under the hood.
The output is a report, not a gate. "Summarize the risk of this change" produces prose that varies with the run. Forcing it into a deterministic command's stdout feels backwards.
The check is opportunistic — useful when relevant, skippable when not. The agent deciding "this is a docs-only change, verify-task doesn't apply" is the right behavior. A command would have to bake that branch in.
The check is for the human reviewer. PR descriptions, change-impact estimates, "what else might this affect" — content for the conversation log, not a gate that blocks the PR from existing.

Command is the better fit when:

The check is a gate. The pipeline must not advance if it fails. There must be no path where the agent decided not to call it.
The check has a natural deterministic implementation. Linters, type checkers, test runners, schema diffs, contract tests against an OpenAPI spec — they exist already and have spent years getting their exit codes right. Wrapping them in a Skill adds an interpretation layer that only makes them less reliable.
You want one source of truth across local, CI, and the agent loop. The same npm run test:e2e runs in all three. If the Skill version wins, you now have two truth sources, and they will eventually disagree.
The failure output is load-bearing. When the agent gets routed back into a fix loop with "Verification failed. Output: <tail>," it needs the actual stderr, not a paraphrase. Paraphrases drop information the next turn needs.

A meta-property: commands compose with other commands; Skills compose less well with non-LLM tooling. If verify also gates Slack, the merge step, the deploy — that's engine-level routing, and the engine wants bool.

The traps each side hits

Skill side, failure modes I keep seeing in peers' setups:

The agent forgets to invoke. The Skill description matched ten previous turns; this turn the agent is mid tool-use chain and never reaches for it. No "did you remember to verify" hook — just the agent's running judgment, which is a function of the prompt window at that exact turn.
Skill bloat eats context. Each installed Skill's description sits in the context window every turn. A repo with twenty Skills pays twenty descriptions of attention budget for the chance one matches. Verification needs to be always-available, which means its description has to fit in the "always relevant" slot.
Hard to verify which version is running. Skills get edited; the agent reads the file at invocation time. Did the last fix actually take effect? With a command, the answer is git log -- path/to/script. With a Skill, it's "let me read the file and hope no one edited it in the last hour."
Self-reports are not gates. A Skill returning "verification passed" is an LLM telling you the LLM thinks it passed. Fine for a PR comment or triage note. Not fine for "should this merge to main." The pass-signal is structurally weaker than an exit code, and the only way to upgrade it is to put a deterministic check inside the Skill — at which point the Skill is just a shell wrapper around a command.

Command side, scars I have from each:

Feedback latency. Commands take minutes. A full verify chain (format → lint → typecheck → test) is slower. A Skill that says "looks fine" in three seconds feels great by comparison — until it's wrong.
Output-size overflow. Captured stdout can be megabytes. We tail to 1500 bytes for the error message; truncation can land mid-error. If the traceback is at the top of the output and the bottom is just pytest's summary, the tail you surface is the wrong tail.
Sloppy exit codes hide "why." Plenty of CLI tools exit non-zero for both "your code failed" and "I couldn't even start." cd nonexistent && pytest exits 2 from the failed cd, which the runner reports as "Tests failed (exit code: 2)" — same shape as a real test exit 2. The subshell-wrapping fix landed precisely because this confusion was costing afternoons.
No graceful "doesn't apply." A command that runs pytest backend/tests/ on a frontend-only PR runs the whole suite anyway, or fails because backend fixtures aren't set up on this branch. There's no built-in "skip if irrelevant" — either the command short-circuits itself, or the engine adds routing logic. Both are work the Skill gets for free from the agent's judgment.

My current division of labor (provisional)

In Codens, the production workflow is commands all the way down. Verify is a run_tests step. Conflict checks are a check_conflicts step. CI gating is a wait_ci_checks step. The agent (claude_code step) is the thing that makes the changes; everything that judges whether a change is acceptable is deterministic. The reasoning is the gating property — these checks must run, must pass before the next step, and must produce structured output that the next agent step can react to.

What I do have on the Skill side is developer-facing: the Skills in purple-codens/.claude/skills/ are test-generator, pr-creator, and bug-analyzer. These run when I'm working in Claude Code locally on the Codens repo itself — when I ask "add tests for this," the test-generator Skill picks up. They are not invoked by the production workflow that runs on customer code. So the honest answer to "does Codens use Skills" is "yes, for the human-in-Claude-Code experience inside the repo; no, for the autonomous workflow that ships to customers." Different mechanisms for different operating modes, even though the underlying tasks (generate tests, write PR descriptions, analyze bugs) are similar.

A first-pass mapping of what I've settled on for which mechanism does what:

Task	Mechanism	Why
Lint, type check, unit test	command	Gate. Already a tool with a good exit code. Composes with CI.
Run task-specific verify	command	Gate. Output is load-bearing for `fix_verify`.
Generate PR description	Skill	Report, not a gate. Wants prose. Varies with run.
Analyze bug from stack trace	Skill	Contextual. Output is for the human reading triage.
Estimate change impact across files	Skill	Judgment-heavy. Doesn't gate anything.
Conflict detection	command	Gate. Engine routes on result.
Test generation	Skill	One-shot, in-conversation. Not a workflow step.
Wait for CI checks	command	External signal, deterministic poll.

The pattern that fell out: gating belongs to commands, judgment-as-a-report belongs to Skills. When I find myself wanting "the gate decision should depend on the agent's judgment," that's usually a sign that the gate is wrong, not that I should put a Skill behind it. Either tighten the command so it fails on the right things, or stop trying to gate that property.

The unresolved question

Both mechanisms have escape hatches that, if I leaned on them, would blur the distinction.

On the Skill side, you could imagine a hook — "force the agent to invoke this Skill before exiting" — that turned the Skill into a mandatory step. That removes the "agent forgets to invoke" trap, but in exchange you've recreated the command. The Skill's contextual judgment now happens in a fixed slot, the engine routes on its output, and the only difference from a command is that the Skill body is Markdown instead of a shell script. It's not clear that buys you anything except a worse failure-output story.

On the command side, you could add "skip this command if X" routing — if diff_files match frontend/**, skip backend/run_tests — and recover some of the Skill's irrelevance handling. That works for a few cases, but the routing logic accumulates: skip if frontend-only, skip if docs-only, skip if dependency-only, skip if revert. Each new condition is a string of YAML or a few lines of Python. At some point you've hand-coded the agent's judgment into a rule engine, and you'd have been better off just letting the model decide.

So maybe the real shape is that "Skill vs command" is the two ends of a slider labeled how much judgment do I delegate to the model. All the way to the command end: zero delegation, the engine decides everything. All the way to the Skill end: full delegation, the model decides whether and how to verify. The interesting question isn't "which one" but "how much delegation does this specific check tolerate." Gates tolerate very little. Reports tolerate a lot. Most things sit somewhere on the slider, and the choice between Skill and command is really a choice about where on the slider to draw the line for that one check.

The version of this I haven't resolved: what's a check that genuinely benefits from being on the slider's middle, not at either end? "Verify-with-judgment" — the engine gates on a deterministic signal, but the interpretation of the signal for the next agent's prompt is generated by a Skill. That'd give you the gate's certainty and the Skill's prose. I haven't built it yet. I suspect when I do, the seam between the two halves — "the deterministic part decided pass/fail, the prose part explains why" — is where the next class of bugs lives.

Closing

If you've shipped both Skill-based and command-based verification in the same agent system and have a heuristic for when to use which, I'd want to hear your rule. My current rule — if it gates, it's a command; if it reports, it's a Skill — feels right but I've only been running it for a few months, and I haven't seen it under the load that breaks it. The peer who showed me their Skill-based pipeline was running it for a smaller, less gated workflow than mine, and we both came away unconvinced that either of us had the universally right answer.

The other thing I'm curious about is the middle of the slider: have you found a way to combine the two — deterministic gate plus agent-generated explanation — that doesn't decay into either "the explanation is unreliable so we ignore it" or "the gate is too rigid so we override it"? That's where I think the next interesting design lives, and I don't yet have the shape of it.

DEV Community

Claude Code Skills vs deterministic verify commands — same checks, very different ergonomics

What "Skill-based verification" looks like

What `verify_commands` looks like

Where each one is the better fit

The traps each side hits

My current division of labor (provisional)

The unresolved question

Closing

Top comments (0)

What "Skill-based verification" looks like

What verify_commands looks like

Where each one is the better fit

The traps each side hits

My current division of labor (provisional)

The unresolved question

Closing

What `verify_commands` looks like