DEV Community

How to Write Terminal Skills That AI Agents Can Actually Use

Alex Shev on June 08, 2026

Most AI agent advice still sounds like prompt advice. Add more context. Write clearer instructions. Give the model examples. Use a better system p...
Collapse
 
jugeni profile image
Mike Czerwinski

The Stop Conditions section is the one most other writing on agent skills underwrites, and it is the one that does the work. Most documentation of agentic workflows lives almost entirely in the happy path. Stop conditions are what turn a skill from a confident demo into something an operator can deploy and walk away from.

The framing of skill-as-contract is what makes the post generalize. A contract is what an operator-side decision record is supposed to be: a written down promise of trigger, workflow, output, verification, and refusal. Most agent stacks have prompts and tools but skip the contract layer, which means every run is improvised. Your point that the contract does not make the agent less intelligent, it makes the work less dependent on fresh reasoning every time, is the part that should sit on a wall somewhere.

One small bridge that may be useful for people coming from team work: your "produce this output, run these checks, stop under these conditions" is structurally what a Definition of Done is in a lean or agile context. The vocabulary is already developed for team-level work, and it transfers directly to agent-level work. Lean teams have been arguing for years that verification should be externally authored and that incomplete work should be visibly incomplete. Agent skills are the same shape one floor sideways.

The harder question I keep landing on is who reads the contract. Drift in team Definitions of Done usually comes not from missing contract text but from no one being on the hook for whether the contract is actually honored. The same shape will catch agent skills the moment they become widespread. The contract has to live somewhere a counterparty can flinch when it gets violated.

Collapse
 
alexshev profile image
Alex Shev

That contract framing is the right bridge. A skill should not just tell the agent what to do; it should define the shape of a responsible run. Trigger, workflow, output, verification, refusal, and stop conditions are what make the behavior repeatable instead of improvised.

Collapse
 
jugeni profile image
Mike Czerwinski

Yes, and the contract makes the skill auditable as well as repeatable, which is the underrated second-order effect. With trigger / workflow / output / verification / refusal / stop named explicitly, you can reconstruct after the fact what the agent should have done, separate from what it did. The improvised version gives you a transcript and an outcome and nothing in between, so disagreement collapses into vibes. The contracted version gives you six places to point at when something went sideways, and the agent has six places to point back. That asymmetry between what is fixed by the contract and what is left to the run is also what lets you change one slot without rewriting the skill, swap the verifier without touching the workflow, tighten the refusal without re-deriving the trigger. Composable in the boring sense, not the marketing one.

Thread Thread
 
alexshev profile image
Alex Shev

That audit trail point is important. A skill contract should make disagreement concrete: was the trigger wrong, the workflow under-specified, the verifier weak, or the refusal missing? Without those slots, every failure turns into a debate about model behavior. With them, you can improve the system without rewriting the whole skill.

Thread Thread
 
jugeni profile image
Mike Czerwinski

Trigger / workflow / verifier / refusal as typed slots is the cut that makes contract-failure debuggable instead of relitigated. The model-behavior debate is the failure mode you get when none of those slots are first-class, because every failure resolves into the same unfalsifiable bucket.

The follow-up I'd hold is that each slot needs its own typed evidence for the wrong-call, not just a label. "Trigger wrong" with no record of what fired and why is a slot in name only, and it pushes the debate one floor down instead of dissolving it. The trigger record needs to carry which keyword matched, which context predicate was true, and what the skill expected. Same shape for the other three. Slot without evidence is theater.

Without that, the four slots become four new places to argue.

Thread Thread
 
alexshev profile image
Alex Shev

Yes. Slot without evidence is theater is the right warning.

The evidence record is what makes the contract debuggable: what triggered, which predicate matched, what verifier ran, what refusal condition applied, and what output was accepted. Without that, the slots become labels for the same old argument.

Collapse
 
alexshev profile image
Alex Shev

Small update after looking at current Terminal Skills search demand: the strongest pattern is not just people asking for more agent prompts. They are searching for skills around codebase architecture: how to inspect structure, find coupling, and make a change plan without turning the repo into vibes.

That is exactly where a skill should beat a prompt. The useful artifact is a repeatable architecture workflow: read the map, identify risk, propose a small refactor path, run checks, and leave evidence for the next agent or human.

Collapse
 
jugeni profile image
Mike Czerwinski

This is the case that tests the contract framing hardest, because the output is a plan, not a diff, and a plan is the artifact people verify least. A refactor that runs checks has something to point at when it fails. A change plan that says read the map, find coupling, propose a path has no failing test if the map was read wrong. The verification slot is the one everyone quietly drops here.

So the evidence-for-the-next-agent line is the load-bearing one, and it has to carry more than the conclusion. The risk-identification step needs typed evidence the same way: which couplings it found, which it looked for and did not find, what it could not see. A plan that records only the risks it surfaced inherits the blind spot of the summary, it looks complete because the parts it missed never entered the artifact. Same failure shape as a reconstructed audit trail, one floor up.

The discipline that makes it work is the one this thread already landed on: the map read has to stamp what it inspected at inspection time, not what it concluded after. The next agent can re-derive a plan from a record of what was looked at. It cannot re-derive one from a verdict that already threw away the search. That is the difference between a skill that hands off architecture and a confident note that hands off a guess.