Alex Shev

Posted on Jun 8

How to Write Terminal Skills That AI Agents Can Actually Use

#productivity #cli #devtools #ai

Most AI agent advice still sounds like prompt advice.

Add more context.
Write clearer instructions.
Give the model examples.
Use a better system prompt.

That helps, but it misses the part that breaks in real work.

The problem is not always that the agent does not know a command. The problem is that the agent does not know your workflow.

It does not know when to inspect first.
It does not know which defaults are safe.
It does not know what "done" means.
It does not know when to stop instead of guessing.
It does not know which checks matter before something leaves the local machine.

That is where Terminal Skills become useful.

A Terminal Skill is not just a command shortcut. It is a small reusable operating procedure for an agent.

It teaches:

when to use a workflow
what inputs are acceptable
which commands or scripts are preferred
what output should exist
how to verify the result
when to stop and ask for help

That last part is the difference between a useful agent workflow and a confident mess.

Here is the pattern I use when writing one.

Start with the task, not the tool

The easiest mistake is to begin with a tool name.

Bad starting point:

Make an FFmpeg skill.

Better starting point:

Make a skill that turns a raw video file into an X-ready MP4, then verifies the upload is likely to work.

Those are different scopes.

The first one is a tool wrapper.

The second one is a workflow.

Agents do not need every possible FFmpeg flag. They need a stable path through a common problem.

The same applies to any terminal workflow:

not "make a git skill"
but "review a dirty worktree without touching unrelated user changes"
not "make a deploy skill"
but "deploy a Next.js app to Vercel and verify the live URL"
not "make a search skill"
but "inspect a repo and find the smallest safe file set for this task"

The more specific the workflow, the more useful the skill.

A useful skill has a contract

I like thinking about a Terminal Skill as a contract between the human, the agent, and the machine.

The contract says:

If this kind of task appears,
and these inputs exist,
follow this workflow,
produce this output,
run these checks,
and stop under these conditions.

That sounds simple, but it removes a lot of randomness.

Without a contract, the agent improvises.

With a contract, the agent has a default operating path.

That does not make the agent less intelligent. It makes the work less dependent on fresh reasoning every time.

The basic structure

For most Terminal Skills, I would start with a folder like this:

my-skill/
  SKILL.md
  scripts/
    run.sh
  examples/
    input-example.txt
  README.md

Not every skill needs a script.

Some skills are mostly procedural. Some are wrappers around existing CLIs. Some are just strong instructions plus verification commands.

But the SKILL.md is the important part.

It should tell the agent how to work, not just what the tool does.

Here is a practical template.

# Skill Name

## Use When

Use this skill when the user asks to:
- ...
- ...

## Do Not Use When

Do not use this skill when:
- ...
- ...

## Inputs

Expected inputs:
- source file or directory
- target format
- optional config

## Workflow

1. Inspect the input.
2. Choose the smallest safe action.
3. Run the script or command.
4. Verify the output.
5. Report the result with file paths and any warnings.

## Commands

```bash
./scripts/run.sh input-file
```

## Verification

Check:
- output file exists
- output format is correct
- command exited successfully
- logs contain no obvious errors

## Stop Conditions

Stop and ask the user if:
- required input is missing
- output validation fails
- the task would publish, delete, charge, email, or deploy something
- the command could overwrite user data

That is already more useful than a loose prompt.

It gives the agent a map.

The most important section is "Stop Conditions"

Most people underwrite this part.

They document the happy path and skip the failure boundaries.

But agents need stop conditions badly.

A good skill should say things like:

stop if the repo has unrelated user changes
stop if the API token is missing
stop if the public page cannot be verified
stop if the output file has no video stream
stop if the command would delete or overwrite source files
stop if the user approved a draft but did not approve publishing

This is where agent workflows become safer.

For example, a social publishing skill should not only say:

Open the composer and publish the post.

It should say:

Before posting, verify the composer contains the exact approved text.
After posting, verify the final permalink shows the full text and attached media.
If media is missing, do not report success.

That is a real operating rule.

It captures the part of the workflow that normally lives in someone's head.

Verification should be concrete

"Check that it worked" is not enough.

A good Terminal Skill names the actual check.

For a video skill:

ffprobe -v error -show_streams -show_format output.mp4

For a code skill:

npm test
git diff --check

For a content publishing skill:

Open the final live URL.
Confirm the title, body, tags, canonical URL, and media are visible.
Do not trust the editor preview as final verification.

For a data export skill:

Check row count, headers, encoding, and sample records before sending the file.

Agents are very good at saying "done" too early.

Verification commands make "done" harder to fake.

Keep the skill narrow

The best skills are boringly specific.

Bad:

content-automation

Better:

devto-draft-from-markdown
x-safe-video-export
reddit-comment-visibility-check
vercel-preview-deploy-and-verify

Narrow skills have clearer triggers and fewer hidden assumptions.

They are also easier to improve.

If a video export fails, fix the video skill.
If a DEV.to draft misses a canonical URL, fix the DEV.to skill.
If a Reddit comment is visible to the owner but not public, fix the visibility check.

One giant "content automation" skill would hide all of those failures inside one vague blob.

Small skills make the workflow inspectable.

Use scripts for mechanics, instructions for judgment

I do not think every skill should become a huge script.

Scripts are good for mechanical repeatability:

convert this file
validate this JSON
resize this image
call this API
generate this report

Instructions are better for judgment:

when to use the script
which candidates to reject
how to handle approval
what counts as verification
when not to continue

The skill should combine both.

For example:

SKILL.md explains the workflow.
scripts/export.sh performs the conversion.
Verification commands prove the output.
Stop conditions prevent the agent from bluffing.

That is the useful shape.

Not "the agent has a tool."

"The agent has a way of working."

Example: a tiny repo-inspection skill

Here is a small example that does not need a script.

# Repo Inspection

## Use When

Use this before editing an unfamiliar codebase.

## Workflow

1. Print the current directory.
2. Check git status.
3. List top-level files.
4. Identify package/framework files.
5. Search for relevant code with ripgrep.
6. Read the smallest useful files before editing.

## Preferred Commands

```bash
pwd
git status --short
rg --files | head -80
rg -n "keyword|component|route" .
```

## Stop Conditions

Stop before editing if:
- the user has unrelated changes in the target file
- the task requires a destructive git command
- the repo structure is unclear after inspection

## Verification

Before reporting done:
- show changed files
- run the smallest relevant test or check
- explain any test that could not be run

This is not glamorous.

But it prevents a lot of common agent mistakes.

It teaches the agent the shape of careful work.

What makes a skill good?

I usually judge a skill by five questions.

1. Does it have a clear trigger?

The agent should know when to use it.

If the trigger is vague, the skill will either be ignored or overused.

2. Does it reduce repeat reasoning?

A good skill saves the agent from rediscovering the same workflow again.

If the workflow is only used once, it may not need a skill yet.

3. Does it define done?

The skill should say what output must exist and how to verify it.

If "done" is subjective, the agent will guess.

4. Does it include stop conditions?

This is the safety layer.

The skill should prevent confident continuation when the workflow is missing a required input, external approval, or verification.

5. Is it small enough to maintain?

If the skill becomes a giant manual for everything, it stops being useful.

Small, composable skills are easier to trust.

The bigger point

AI agents are getting better at tool use.

That does not mean every workflow should be improvised in chat.

The more capable the agent becomes, the more important operating procedures become.

Prompts are good for intent.
Tools are good for capability.
Skills are good for repeatable work.

That is the layer I think more developers should build.

Not because it is flashy.

Because boring, reusable workflows are what turn agents from demos into something you can actually depend on.

I am collecting more examples of this pattern at Terminal Skills.

If you are building your own agent workflows, start with one annoying task you repeat every week.

Write down the trigger, workflow, verification, and stop conditions.

That is your first skill.

Top comments (29)

Alex Shev • Jun 27

Small update after looking at current Terminal Skills search demand: the strongest pattern is not just people asking for more agent prompts. They are searching for skills around codebase architecture: how to inspect structure, find coupling, and make a change plan without turning the repo into vibes.

That is exactly where a skill should beat a prompt. The useful artifact is a repeatable architecture workflow: read the map, identify risk, propose a small refactor path, run checks, and leave evidence for the next agent or human.

Mike Czerwinski • Jun 27

This is the case that tests the contract framing hardest, because the output is a plan, not a diff, and a plan is the artifact people verify least. A refactor that runs checks has something to point at when it fails. A change plan that says read the map, find coupling, propose a path has no failing test if the map was read wrong. The verification slot is the one everyone quietly drops here.

So the evidence-for-the-next-agent line is the load-bearing one, and it has to carry more than the conclusion. The risk-identification step needs typed evidence the same way: which couplings it found, which it looked for and did not find, what it could not see. A plan that records only the risks it surfaced inherits the blind spot of the summary, it looks complete because the parts it missed never entered the artifact. Same failure shape as a reconstructed audit trail, one floor up.

The discipline that makes it work is the one this thread already landed on: the map read has to stamp what it inspected at inspection time, not what it concluded after. The next agent can re-derive a plan from a record of what was looked at. It cannot re-derive one from a verdict that already threw away the search. That is the difference between a skill that hands off architecture and a confident note that hands off a guess.

Alex Shev • Jul 14

Yes, planning is the hardest version of this because the artifact can sound complete while hiding what was never inspected.

For an architecture skill, I would want the plan to carry negative evidence too: what was checked and not found, which files or modules were out of scope, what coupling signals were weak, and where confidence is low.

Otherwise the plan becomes a polished list of surfaced risks, not a map of the search. The next agent needs the missing edges as much as the conclusions.

Mike Czerwinski • Jul 14

Negative evidence in the plan is the part that actually distinguishes a search from a conclusion wearing search's clothes. A plan that only lists what it found reads the same whether the agent checked everywhere and found little, or checked three files and stopped. Both produce a short, confident-looking document.

The harder version of what you're asking: negative evidence has to be falsifiable the same way the positive kind does, "checked and not found" needs to name what would have counted as found, or it's just a second list of assertions with a different label. "Weak coupling signal in module X" only means something if the plan also states what a strong signal would have looked like there. Otherwise low-confidence becomes its own hiding place, a plan can mark everything uncertain and get credit for honesty while still not having looked very hard.

Alex Shev • Jul 14

That is the right pressure on negative evidence. Checked-and-not-found only helps if the reader knows what would have counted as found.

For planning, I think the useful artifact is closer to a search ledger than a summary: inspected areas, expected signals, missing signals, confidence, and what could not be inspected. Otherwise uncertainty becomes a polished hiding place.

Mike Czerwinski • Jul 14

A search ledger only resists the hiding-place problem if the "expected signals" column isn't also self-authored. If the agent gets to write both what it expected to find and what it found, under-declaring the expected list is the same move as under-declaring dependencies in the severity-labeling thread going around this week, just don't list the signal you didn't check, and the ledger reads as thorough by omission. The fix there was deriving dependencies from what the computation actually touched instead of what the author declared. Same shape here: "expected signals" should come from a fixed taxonomy for the artifact class, a known checklist of coupling types for architecture, not from the agent's own scoping of the task, so a missing row is a visible gap against the taxonomy rather than an invisible one against the agent's private plan.

Otherwise the ledger format is the right instinct, inspected areas, expected signals, missing signals, confidence. It just needs the expected-signals column pinned outside the hand writing the rest of the row.

Alex Shev • Jul 14

That outside taxonomy point is strong. If the agent authors the expected-signals list, the ledger can look complete while quietly shrinking the search space. For architecture work, the expected rows probably need to come from known coupling categories: API boundary, data shape, persistence, async flow, auth, caching, deployment, and tests. Missing rows then become visible.

Mike Czerwinski • Jul 15

Good list, and it's specific enough to actually pin a missing row against. One thing worth naming before it ships: the taxonomy itself needs the same protection the expected-signals column just got, because if the agent can add or drop a category from that list when a case is inconvenient (this pattern doesn't really have an async flow, skip that row), you've reopened the self-authored gap one level up, at the taxonomy instead of at the signal list.

Practical version: the eight categories are a fixed enum somewhere outside the agent's reach, additions require an explicit, logged change to that enum (not a per-run decision), and every ledger entry has to account for all eight, present, absent, or not-applicable-with-a-stated-reason. Not-applicable is fine, silently missing is the thing the taxonomy was supposed to make visible. The failure mode to watch for is a category that's technically present in the enum but gets marked not-applicable so often it stops meaning anything, which is the same rot the fixed-rules thread going around this week already ran into.

Alex Shev • Jul 15

Yes, the enum itself becomes part of the trust boundary. If categories can be changed per run, the taxonomy is just another narrative surface. I like the rule that every entry must account for every category as present, absent, or not applicable with a reason. The long-term smell is exactly what you named: not-applicable quietly becoming the new hiding place.

Mike Czerwinski • Jun 23

The Stop Conditions section is the one most other writing on agent skills underwrites, and it is the one that does the work. Most documentation of agentic workflows lives almost entirely in the happy path. Stop conditions are what turn a skill from a confident demo into something an operator can deploy and walk away from.

The framing of skill-as-contract is what makes the post generalize. A contract is what an operator-side decision record is supposed to be: a written down promise of trigger, workflow, output, verification, and refusal. Most agent stacks have prompts and tools but skip the contract layer, which means every run is improvised. Your point that the contract does not make the agent less intelligent, it makes the work less dependent on fresh reasoning every time, is the part that should sit on a wall somewhere.

One small bridge that may be useful for people coming from team work: your "produce this output, run these checks, stop under these conditions" is structurally what a Definition of Done is in a lean or agile context. The vocabulary is already developed for team-level work, and it transfers directly to agent-level work. Lean teams have been arguing for years that verification should be externally authored and that incomplete work should be visibly incomplete. Agent skills are the same shape one floor sideways.

The harder question I keep landing on is who reads the contract. Drift in team Definitions of Done usually comes not from missing contract text but from no one being on the hook for whether the contract is actually honored. The same shape will catch agent skills the moment they become widespread. The contract has to live somewhere a counterparty can flinch when it gets violated.

Alex Shev • Jun 23

That contract framing is the right bridge. A skill should not just tell the agent what to do; it should define the shape of a responsible run. Trigger, workflow, output, verification, refusal, and stop conditions are what make the behavior repeatable instead of improvised.

Mike Czerwinski • Jun 24

Yes, and the contract makes the skill auditable as well as repeatable, which is the underrated second-order effect. With trigger / workflow / output / verification / refusal / stop named explicitly, you can reconstruct after the fact what the agent should have done, separate from what it did. The improvised version gives you a transcript and an outcome and nothing in between, so disagreement collapses into vibes. The contracted version gives you six places to point at when something went sideways, and the agent has six places to point back. That asymmetry between what is fixed by the contract and what is left to the run is also what lets you change one slot without rewriting the skill, swap the verifier without touching the workflow, tighten the refusal without re-deriving the trigger. Composable in the boring sense, not the marketing one.

Alex Shev • Jun 25

That audit trail point is important. A skill contract should make disagreement concrete: was the trigger wrong, the workflow under-specified, the verifier weak, or the refusal missing? Without those slots, every failure turns into a debate about model behavior. With them, you can improve the system without rewriting the whole skill.

Mike Czerwinski • Jun 25

Trigger / workflow / verifier / refusal as typed slots is the cut that makes contract-failure debuggable instead of relitigated. The model-behavior debate is the failure mode you get when none of those slots are first-class, because every failure resolves into the same unfalsifiable bucket.

The follow-up I'd hold is that each slot needs its own typed evidence for the wrong-call, not just a label. "Trigger wrong" with no record of what fired and why is a slot in name only, and it pushes the debate one floor down instead of dissolving it. The trigger record needs to carry which keyword matched, which context predicate was true, and what the skill expected. Same shape for the other three. Slot without evidence is theater.

Without that, the four slots become four new places to argue.

Alex Shev • Jun 26

Yes. Slot without evidence is theater is the right warning.

The evidence record is what makes the contract debuggable: what triggered, which predicate matched, what verifier ran, what refusal condition applied, and what output was accepted. Without that, the slots become labels for the same old argument.

Mike Czerwinski • Jun 26

That list is the schema, and the discipline it enforces is that the record gets written by the run, not narrated about it afterward. The moment the evidence is reconstructed from the transcript instead of stamped at each slot as it fires, you're back to a label, because the reconstruction only sees what the happy path chose to log. Triggered-predicate, verifier-result, refusal-condition, accepted-output each have to be recorded at the point they happen, or the slot is decorative.

That's also what makes the contract improvable without a rewrite. When a run fails, the evidence record tells you which slot lied, so you patch the predicate or the verifier instead of reopening the model-behavior debate. Slot plus evidence is the difference between a skill you can debug and a demo you can only admire. Good thread.

Alex Shev • Jun 27

Yes, that is what makes it operational instead of narrative. If the evidence is reconstructed after the run, it can only describe the path the agent chose to remember. Stamping each slot at execution time gives you something closer to an audit trail: what fired, what failed, what was refused, and what output actually passed.

Mike Czerwinski • Jun 27

That's the line I'd put on it: closer to an audit trail than a label, and the difference is entirely when the record gets written. Stamped at the slot, it can disagree with the story the happy path would have told. Reconstructed after, it can only ever agree. Good thread.

Alex Shev • Jun 27

Yes, the timing of the record is the whole difference. If the slot writes while the work happens, it can contradict the later story. If the record is reconstructed after the fact, it mostly inherits the same blind spots as the summary.

Mike Czerwinski • Jun 28

Exactly, and that is the whole test for whether a record is evidence or decoration: could it contradict the story told later. A write that happens during the work can. A reconstruction cannot, by construction it is downstream of the same narrator. So the slot earns its keep only if it is append-only and timestamped at the moment of action, never backfilled. The day you let it be reconstructed for completeness, you have turned the audit trail back into the summary it was supposed to check.

Alex Shev • Jun 28

That distinction is exactly right. An audit slot only earns trust if it can embarrass the summary later. If it is produced after the agent has already decided what happened, it is just another polished narrative. For skills, I like the same rule: the skill should create evidence while the work is happening, not ask the agent to remember the evidence afterward.

Mike Czerwinski • Jun 28

That is the version I want to steal. "Create evidence while the work is happening" is sharper than anything I had phrased it as, because it pushes the cost onto the actor at the moment the action is cheap to observe, not onto the auditor at the moment it has already been narrated. Skills that emit structured traces during execution are doing exactly that. Skills that ask the model to summarize what it did are the narrative version dressed up as evidence.

The mechanical test is whether the trace would survive the agent flipping its conclusion. If the same trace can support both "I did the right thing" and "I did the wrong thing" depending on which one the agent wrote down, it was decor.

Alex Shev • Jul 14

That mechanical test is strong: could the trace survive the agent changing its conclusion?

If the trace only supports whatever story the agent writes afterward, it is not evidence. It is formatting. A useful skill trace should be boringly specific enough to outlive the summary: this predicate fired, this tool ran, this verifier passed or failed, this output was accepted.

Then the trace can make the agent uncomfortable later, which is exactly why it is worth keeping.

Mike Czerwinski • Jul 14

"Boringly specific enough to outlive the summary" is the actual spec, and it's a good one because it's checkable without reading the summary at all: does the trace still mean something if you delete every sentence the agent wrote about it. This predicate fired, this exit code, this diff, none of that needs the narrative to be true.

The uncomfortable part, worth naming since you're already halfway there: a trace that survives the agent changing its story can still be cherry-picked before it's written. Nothing stops an agent from running five checks, keeping the trace for the one that supports the conclusion, and quietly dropping the other four before anything gets written down. Boringly specific doesn't defend against selective, it defends against fabricated. Those are different attacks, and a trace built to survive the second one doesn't automatically survive the first, you'd need the trace to log what was run, not just what passed.

Alex Shev • Jul 14

Yes. Boringly specific defends against fabrication, but not against selective evidence.

That is the next layer: the trace should log the attempt set, not only the winning proof. If the agent ran five checks, the record needs to show five checks, including the ones that did not support the conclusion. Otherwise the trace can be true and still misleading.

Mike Czerwinski • Jul 14

Logging the attempt set closes the cherry-picking hole, but only for checks that got run. The harder case is a check that should have existed and never got written, so it never enters the attempt set either, nothing to log because nothing ran. That's not a different problem from the one you're describing over in the negative-evidence thread on this same post, checked-and-not-found needs to name what would have counted as found, or it's silent by omission either way. An attempt set with five checks looks thorough right up until someone asks why there wasn't a sixth.

Practical version: pair the attempt log with a declared coverage claim made before the checks run, not after, here is what this class of check is supposed to catch, then log against that declaration. A short attempt set against a narrow declared scope is honest. The same five checks against an undeclared scope is indistinguishable from the agent stopping wherever it got tired.

Alex Shev • Jul 14

Exactly. An attempt log without a declared coverage claim can become a polished list of whatever happened to run. The declaration matters because it gives reviewers something to compare against. I like the narrow-scope version: this check is meant to cover these failure modes, not all possible failure modes. Then a short attempt set can still be honest.

Mike Czerwinski • Jul 15

Narrow-scope-declared solves the completeness-theater problem but opens a narrower one: nothing stops the scope declaration itself from being narrowed to make the check look thorough. "This check covers X and Y" is honest if X and Y are actually the failure modes that matter, and it's just as gameable as an undeclared attempt set if the agent gets to pick X and Y after seeing what's easy to verify.

So the declaration probably needs its own outside anchor the same way the attempt set does, tied to something like a known failure taxonomy for the artifact class (the coupling categories from the other branch of this thread are a good instance of it), so a scope that quietly excludes the hard failure mode is visible as a gap against that list rather than just a narrower, still-plausible-looking sentence. Otherwise "declared before the checks run" fixes the ordering problem but not the selection problem, an agent can still choose what to declare with the same foresight it used to choose what to attempt.

Alex Shev • Jul 15

Agreed. A declared scope is only useful if the scope has a reference point outside the run. Otherwise the agent can make the declaration honest-looking while still choosing the easiest rectangle to defend. A fixed failure taxonomy is probably the missing anchor: the declaration can be narrow, but it has to be narrow against a visible map.

View full discussion (29 comments)