DEV Community

How to Write Terminal Skills That AI Agents Can Actually Use

Alex Shev on June 08, 2026

Most AI agent advice still sounds like prompt advice. Add more context. Write clearer instructions. Give the model examples. Use a better system p...

Read full post

Alex Shev • Jun 27

Small update after looking at current Terminal Skills search demand: the strongest pattern is not just people asking for more agent prompts. They are searching for skills around codebase architecture: how to inspect structure, find coupling, and make a change plan without turning the repo into vibes.

That is exactly where a skill should beat a prompt. The useful artifact is a repeatable architecture workflow: read the map, identify risk, propose a small refactor path, run checks, and leave evidence for the next agent or human.

Mike Czerwinski • Jun 27

This is the case that tests the contract framing hardest, because the output is a plan, not a diff, and a plan is the artifact people verify least. A refactor that runs checks has something to point at when it fails. A change plan that says read the map, find coupling, propose a path has no failing test if the map was read wrong. The verification slot is the one everyone quietly drops here.

So the evidence-for-the-next-agent line is the load-bearing one, and it has to carry more than the conclusion. The risk-identification step needs typed evidence the same way: which couplings it found, which it looked for and did not find, what it could not see. A plan that records only the risks it surfaced inherits the blind spot of the summary, it looks complete because the parts it missed never entered the artifact. Same failure shape as a reconstructed audit trail, one floor up.

The discipline that makes it work is the one this thread already landed on: the map read has to stamp what it inspected at inspection time, not what it concluded after. The next agent can re-derive a plan from a record of what was looked at. It cannot re-derive one from a verdict that already threw away the search. That is the difference between a skill that hands off architecture and a confident note that hands off a guess.

Alex Shev • Jul 14

Yes, planning is the hardest version of this because the artifact can sound complete while hiding what was never inspected.

For an architecture skill, I would want the plan to carry negative evidence too: what was checked and not found, which files or modules were out of scope, what coupling signals were weak, and where confidence is low.

Otherwise the plan becomes a polished list of surfaced risks, not a map of the search. The next agent needs the missing edges as much as the conclusions.

Mike Czerwinski • Jul 14

Negative evidence in the plan is the part that actually distinguishes a search from a conclusion wearing search's clothes. A plan that only lists what it found reads the same whether the agent checked everywhere and found little, or checked three files and stopped. Both produce a short, confident-looking document.

The harder version of what you're asking: negative evidence has to be falsifiable the same way the positive kind does, "checked and not found" needs to name what would have counted as found, or it's just a second list of assertions with a different label. "Weak coupling signal in module X" only means something if the plan also states what a strong signal would have looked like there. Otherwise low-confidence becomes its own hiding place, a plan can mark everything uncertain and get credit for honesty while still not having looked very hard.

Alex Shev • Jul 14

That is the right pressure on negative evidence. Checked-and-not-found only helps if the reader knows what would have counted as found.

For planning, I think the useful artifact is closer to a search ledger than a summary: inspected areas, expected signals, missing signals, confidence, and what could not be inspected. Otherwise uncertainty becomes a polished hiding place.

Mike Czerwinski • Jul 14

A search ledger only resists the hiding-place problem if the "expected signals" column isn't also self-authored. If the agent gets to write both what it expected to find and what it found, under-declaring the expected list is the same move as under-declaring dependencies in the severity-labeling thread going around this week, just don't list the signal you didn't check, and the ledger reads as thorough by omission. The fix there was deriving dependencies from what the computation actually touched instead of what the author declared. Same shape here: "expected signals" should come from a fixed taxonomy for the artifact class, a known checklist of coupling types for architecture, not from the agent's own scoping of the task, so a missing row is a visible gap against the taxonomy rather than an invisible one against the agent's private plan.

Otherwise the ledger format is the right instinct, inspected areas, expected signals, missing signals, confidence. It just needs the expected-signals column pinned outside the hand writing the rest of the row.

Mike Czerwinski • Jun 23

The Stop Conditions section is the one most other writing on agent skills underwrites, and it is the one that does the work. Most documentation of agentic workflows lives almost entirely in the happy path. Stop conditions are what turn a skill from a confident demo into something an operator can deploy and walk away from.

The framing of skill-as-contract is what makes the post generalize. A contract is what an operator-side decision record is supposed to be: a written down promise of trigger, workflow, output, verification, and refusal. Most agent stacks have prompts and tools but skip the contract layer, which means every run is improvised. Your point that the contract does not make the agent less intelligent, it makes the work less dependent on fresh reasoning every time, is the part that should sit on a wall somewhere.

One small bridge that may be useful for people coming from team work: your "produce this output, run these checks, stop under these conditions" is structurally what a Definition of Done is in a lean or agile context. The vocabulary is already developed for team-level work, and it transfers directly to agent-level work. Lean teams have been arguing for years that verification should be externally authored and that incomplete work should be visibly incomplete. Agent skills are the same shape one floor sideways.

The harder question I keep landing on is who reads the contract. Drift in team Definitions of Done usually comes not from missing contract text but from no one being on the hook for whether the contract is actually honored. The same shape will catch agent skills the moment they become widespread. The contract has to live somewhere a counterparty can flinch when it gets violated.

Alex Shev • Jun 23

That contract framing is the right bridge. A skill should not just tell the agent what to do; it should define the shape of a responsible run. Trigger, workflow, output, verification, refusal, and stop conditions are what make the behavior repeatable instead of improvised.

Mike Czerwinski • Jun 24

Yes, and the contract makes the skill auditable as well as repeatable, which is the underrated second-order effect. With trigger / workflow / output / verification / refusal / stop named explicitly, you can reconstruct after the fact what the agent should have done, separate from what it did. The improvised version gives you a transcript and an outcome and nothing in between, so disagreement collapses into vibes. The contracted version gives you six places to point at when something went sideways, and the agent has six places to point back. That asymmetry between what is fixed by the contract and what is left to the run is also what lets you change one slot without rewriting the skill, swap the verifier without touching the workflow, tighten the refusal without re-deriving the trigger. Composable in the boring sense, not the marketing one.

Alex Shev • Jun 25

That audit trail point is important. A skill contract should make disagreement concrete: was the trigger wrong, the workflow under-specified, the verifier weak, or the refusal missing? Without those slots, every failure turns into a debate about model behavior. With them, you can improve the system without rewriting the whole skill.

Mike Czerwinski • Jun 25

Trigger / workflow / verifier / refusal as typed slots is the cut that makes contract-failure debuggable instead of relitigated. The model-behavior debate is the failure mode you get when none of those slots are first-class, because every failure resolves into the same unfalsifiable bucket.

The follow-up I'd hold is that each slot needs its own typed evidence for the wrong-call, not just a label. "Trigger wrong" with no record of what fired and why is a slot in name only, and it pushes the debate one floor down instead of dissolving it. The trigger record needs to carry which keyword matched, which context predicate was true, and what the skill expected. Same shape for the other three. Slot without evidence is theater.

Without that, the four slots become four new places to argue.

Alex Shev • Jun 26

Yes. Slot without evidence is theater is the right warning.

The evidence record is what makes the contract debuggable: what triggered, which predicate matched, what verifier ran, what refusal condition applied, and what output was accepted. Without that, the slots become labels for the same old argument.