DEV Community

Alex Shev
Alex Shev

Posted on

How to Write Terminal Skills That AI Agents Can Actually Use

Most AI agent advice still sounds like prompt advice.

Add more context.
Write clearer instructions.
Give the model examples.
Use a better system prompt.

That helps, but it misses the part that breaks in real work.

The problem is not always that the agent does not know a command. The problem is that the agent does not know your workflow.

It does not know when to inspect first.
It does not know which defaults are safe.
It does not know what "done" means.
It does not know when to stop instead of guessing.
It does not know which checks matter before something leaves the local machine.

That is where Terminal Skills become useful.

A Terminal Skill is not just a command shortcut. It is a small reusable operating procedure for an agent.

It teaches:

  • when to use a workflow
  • what inputs are acceptable
  • which commands or scripts are preferred
  • what output should exist
  • how to verify the result
  • when to stop and ask for help

That last part is the difference between a useful agent workflow and a confident mess.

Here is the pattern I use when writing one.

Start with the task, not the tool

The easiest mistake is to begin with a tool name.

Bad starting point:

Make an FFmpeg skill.
Enter fullscreen mode Exit fullscreen mode

Better starting point:

Make a skill that turns a raw video file into an X-ready MP4, then verifies the upload is likely to work.
Enter fullscreen mode Exit fullscreen mode

Those are different scopes.

The first one is a tool wrapper.

The second one is a workflow.

Agents do not need every possible FFmpeg flag. They need a stable path through a common problem.

The same applies to any terminal workflow:

  • not "make a git skill"
  • but "review a dirty worktree without touching unrelated user changes"
  • not "make a deploy skill"
  • but "deploy a Next.js app to Vercel and verify the live URL"
  • not "make a search skill"
  • but "inspect a repo and find the smallest safe file set for this task"

The more specific the workflow, the more useful the skill.

A useful skill has a contract

I like thinking about a Terminal Skill as a contract between the human, the agent, and the machine.

The contract says:

If this kind of task appears,
and these inputs exist,
follow this workflow,
produce this output,
run these checks,
and stop under these conditions.
Enter fullscreen mode Exit fullscreen mode

That sounds simple, but it removes a lot of randomness.

Without a contract, the agent improvises.

With a contract, the agent has a default operating path.

That does not make the agent less intelligent. It makes the work less dependent on fresh reasoning every time.

The basic structure

For most Terminal Skills, I would start with a folder like this:

my-skill/
  SKILL.md
  scripts/
    run.sh
  examples/
    input-example.txt
  README.md
Enter fullscreen mode Exit fullscreen mode

Not every skill needs a script.

Some skills are mostly procedural. Some are wrappers around existing CLIs. Some are just strong instructions plus verification commands.

But the SKILL.md is the important part.

It should tell the agent how to work, not just what the tool does.

Here is a practical template.

# Skill Name

## Use When

Use this skill when the user asks to:
- ...
- ...

## Do Not Use When

Do not use this skill when:
- ...
- ...

## Inputs

Expected inputs:
- source file or directory
- target format
- optional config

## Workflow

1. Inspect the input.
2. Choose the smallest safe action.
3. Run the script or command.
4. Verify the output.
5. Report the result with file paths and any warnings.

## Commands

```bash
./scripts/run.sh input-file
```

## Verification

Check:
- output file exists
- output format is correct
- command exited successfully
- logs contain no obvious errors

## Stop Conditions

Stop and ask the user if:
- required input is missing
- output validation fails
- the task would publish, delete, charge, email, or deploy something
- the command could overwrite user data
Enter fullscreen mode Exit fullscreen mode

That is already more useful than a loose prompt.

It gives the agent a map.

The most important section is "Stop Conditions"

Most people underwrite this part.

They document the happy path and skip the failure boundaries.

But agents need stop conditions badly.

A good skill should say things like:

  • stop if the repo has unrelated user changes
  • stop if the API token is missing
  • stop if the public page cannot be verified
  • stop if the output file has no video stream
  • stop if the command would delete or overwrite source files
  • stop if the user approved a draft but did not approve publishing

This is where agent workflows become safer.

For example, a social publishing skill should not only say:

Open the composer and publish the post.
Enter fullscreen mode Exit fullscreen mode

It should say:

Before posting, verify the composer contains the exact approved text.
After posting, verify the final permalink shows the full text and attached media.
If media is missing, do not report success.
Enter fullscreen mode Exit fullscreen mode

That is a real operating rule.

It captures the part of the workflow that normally lives in someone's head.

Verification should be concrete

"Check that it worked" is not enough.

A good Terminal Skill names the actual check.

For a video skill:

ffprobe -v error -show_streams -show_format output.mp4
Enter fullscreen mode Exit fullscreen mode

For a code skill:

npm test
git diff --check
Enter fullscreen mode Exit fullscreen mode

For a content publishing skill:

Open the final live URL.
Confirm the title, body, tags, canonical URL, and media are visible.
Do not trust the editor preview as final verification.
Enter fullscreen mode Exit fullscreen mode

For a data export skill:

Check row count, headers, encoding, and sample records before sending the file.
Enter fullscreen mode Exit fullscreen mode

Agents are very good at saying "done" too early.

Verification commands make "done" harder to fake.

Keep the skill narrow

The best skills are boringly specific.

Bad:

content-automation
Enter fullscreen mode Exit fullscreen mode

Better:

devto-draft-from-markdown
x-safe-video-export
reddit-comment-visibility-check
vercel-preview-deploy-and-verify
Enter fullscreen mode Exit fullscreen mode

Narrow skills have clearer triggers and fewer hidden assumptions.

They are also easier to improve.

If a video export fails, fix the video skill.
If a DEV.to draft misses a canonical URL, fix the DEV.to skill.
If a Reddit comment is visible to the owner but not public, fix the visibility check.

One giant "content automation" skill would hide all of those failures inside one vague blob.

Small skills make the workflow inspectable.

Use scripts for mechanics, instructions for judgment

I do not think every skill should become a huge script.

Scripts are good for mechanical repeatability:

  • convert this file
  • validate this JSON
  • resize this image
  • call this API
  • generate this report

Instructions are better for judgment:

  • when to use the script
  • which candidates to reject
  • how to handle approval
  • what counts as verification
  • when not to continue

The skill should combine both.

For example:

SKILL.md explains the workflow.
scripts/export.sh performs the conversion.
Verification commands prove the output.
Stop conditions prevent the agent from bluffing.
Enter fullscreen mode Exit fullscreen mode

That is the useful shape.

Not "the agent has a tool."

"The agent has a way of working."

Example: a tiny repo-inspection skill

Here is a small example that does not need a script.

# Repo Inspection

## Use When

Use this before editing an unfamiliar codebase.

## Workflow

1. Print the current directory.
2. Check git status.
3. List top-level files.
4. Identify package/framework files.
5. Search for relevant code with ripgrep.
6. Read the smallest useful files before editing.

## Preferred Commands

```bash
pwd
git status --short
rg --files | head -80
rg -n "keyword|component|route" .
```

## Stop Conditions

Stop before editing if:
- the user has unrelated changes in the target file
- the task requires a destructive git command
- the repo structure is unclear after inspection

## Verification

Before reporting done:
- show changed files
- run the smallest relevant test or check
- explain any test that could not be run
Enter fullscreen mode Exit fullscreen mode

This is not glamorous.

But it prevents a lot of common agent mistakes.

It teaches the agent the shape of careful work.

What makes a skill good?

I usually judge a skill by five questions.

1. Does it have a clear trigger?

The agent should know when to use it.

If the trigger is vague, the skill will either be ignored or overused.

2. Does it reduce repeat reasoning?

A good skill saves the agent from rediscovering the same workflow again.

If the workflow is only used once, it may not need a skill yet.

3. Does it define done?

The skill should say what output must exist and how to verify it.

If "done" is subjective, the agent will guess.

4. Does it include stop conditions?

This is the safety layer.

The skill should prevent confident continuation when the workflow is missing a required input, external approval, or verification.

5. Is it small enough to maintain?

If the skill becomes a giant manual for everything, it stops being useful.

Small, composable skills are easier to trust.

The bigger point

AI agents are getting better at tool use.

That does not mean every workflow should be improvised in chat.

The more capable the agent becomes, the more important operating procedures become.

Prompts are good for intent.
Tools are good for capability.
Skills are good for repeatable work.

That is the layer I think more developers should build.

Not because it is flashy.

Because boring, reusable workflows are what turn agents from demos into something you can actually depend on.

I am collecting more examples of this pattern at Terminal Skills.

If you are building your own agent workflows, start with one annoying task you repeat every week.

Write down the trigger, workflow, verification, and stop conditions.

That is your first skill.

Top comments (15)

Collapse
 
jugeni profile image
Mike Czerwinski

The Stop Conditions section is the one most other writing on agent skills underwrites, and it is the one that does the work. Most documentation of agentic workflows lives almost entirely in the happy path. Stop conditions are what turn a skill from a confident demo into something an operator can deploy and walk away from.

The framing of skill-as-contract is what makes the post generalize. A contract is what an operator-side decision record is supposed to be: a written down promise of trigger, workflow, output, verification, and refusal. Most agent stacks have prompts and tools but skip the contract layer, which means every run is improvised. Your point that the contract does not make the agent less intelligent, it makes the work less dependent on fresh reasoning every time, is the part that should sit on a wall somewhere.

One small bridge that may be useful for people coming from team work: your "produce this output, run these checks, stop under these conditions" is structurally what a Definition of Done is in a lean or agile context. The vocabulary is already developed for team-level work, and it transfers directly to agent-level work. Lean teams have been arguing for years that verification should be externally authored and that incomplete work should be visibly incomplete. Agent skills are the same shape one floor sideways.

The harder question I keep landing on is who reads the contract. Drift in team Definitions of Done usually comes not from missing contract text but from no one being on the hook for whether the contract is actually honored. The same shape will catch agent skills the moment they become widespread. The contract has to live somewhere a counterparty can flinch when it gets violated.

Collapse
 
alexshev profile image
Alex Shev

That contract framing is the right bridge. A skill should not just tell the agent what to do; it should define the shape of a responsible run. Trigger, workflow, output, verification, refusal, and stop conditions are what make the behavior repeatable instead of improvised.

Collapse
 
jugeni profile image
Mike Czerwinski

Yes, and the contract makes the skill auditable as well as repeatable, which is the underrated second-order effect. With trigger / workflow / output / verification / refusal / stop named explicitly, you can reconstruct after the fact what the agent should have done, separate from what it did. The improvised version gives you a transcript and an outcome and nothing in between, so disagreement collapses into vibes. The contracted version gives you six places to point at when something went sideways, and the agent has six places to point back. That asymmetry between what is fixed by the contract and what is left to the run is also what lets you change one slot without rewriting the skill, swap the verifier without touching the workflow, tighten the refusal without re-deriving the trigger. Composable in the boring sense, not the marketing one.

Thread Thread
 
alexshev profile image
Alex Shev

That audit trail point is important. A skill contract should make disagreement concrete: was the trigger wrong, the workflow under-specified, the verifier weak, or the refusal missing? Without those slots, every failure turns into a debate about model behavior. With them, you can improve the system without rewriting the whole skill.

Thread Thread
 
jugeni profile image
Mike Czerwinski

Trigger / workflow / verifier / refusal as typed slots is the cut that makes contract-failure debuggable instead of relitigated. The model-behavior debate is the failure mode you get when none of those slots are first-class, because every failure resolves into the same unfalsifiable bucket.

The follow-up I'd hold is that each slot needs its own typed evidence for the wrong-call, not just a label. "Trigger wrong" with no record of what fired and why is a slot in name only, and it pushes the debate one floor down instead of dissolving it. The trigger record needs to carry which keyword matched, which context predicate was true, and what the skill expected. Same shape for the other three. Slot without evidence is theater.

Without that, the four slots become four new places to argue.

Thread Thread
 
alexshev profile image
Alex Shev

Yes. Slot without evidence is theater is the right warning.

The evidence record is what makes the contract debuggable: what triggered, which predicate matched, what verifier ran, what refusal condition applied, and what output was accepted. Without that, the slots become labels for the same old argument.

Thread Thread
 
jugeni profile image
Mike Czerwinski

That list is the schema, and the discipline it enforces is that the record gets written by the run, not narrated about it afterward. The moment the evidence is reconstructed from the transcript instead of stamped at each slot as it fires, you're back to a label, because the reconstruction only sees what the happy path chose to log. Triggered-predicate, verifier-result, refusal-condition, accepted-output each have to be recorded at the point they happen, or the slot is decorative.

That's also what makes the contract improvable without a rewrite. When a run fails, the evidence record tells you which slot lied, so you patch the predicate or the verifier instead of reopening the model-behavior debate. Slot plus evidence is the difference between a skill you can debug and a demo you can only admire. Good thread.

Thread Thread
 
alexshev profile image
Alex Shev

Yes, that is what makes it operational instead of narrative. If the evidence is reconstructed after the run, it can only describe the path the agent chose to remember. Stamping each slot at execution time gives you something closer to an audit trail: what fired, what failed, what was refused, and what output actually passed.

Thread Thread
 
jugeni profile image
Mike Czerwinski

That's the line I'd put on it: closer to an audit trail than a label, and the difference is entirely when the record gets written. Stamped at the slot, it can disagree with the story the happy path would have told. Reconstructed after, it can only ever agree. Good thread.

Thread Thread
 
alexshev profile image
Alex Shev

Yes, the timing of the record is the whole difference. If the slot writes while the work happens, it can contradict the later story. If the record is reconstructed after the fact, it mostly inherits the same blind spots as the summary.

Thread Thread
 
jugeni profile image
Mike Czerwinski

Exactly, and that is the whole test for whether a record is evidence or decoration: could it contradict the story told later. A write that happens during the work can. A reconstruction cannot, by construction it is downstream of the same narrator. So the slot earns its keep only if it is append-only and timestamped at the moment of action, never backfilled. The day you let it be reconstructed for completeness, you have turned the audit trail back into the summary it was supposed to check.

Thread Thread
 
alexshev profile image
Alex Shev

That distinction is exactly right. An audit slot only earns trust if it can embarrass the summary later. If it is produced after the agent has already decided what happened, it is just another polished narrative. For skills, I like the same rule: the skill should create evidence while the work is happening, not ask the agent to remember the evidence afterward.

Thread Thread
 
jugeni profile image
Mike Czerwinski

That is the version I want to steal. "Create evidence while the work is happening" is sharper than anything I had phrased it as, because it pushes the cost onto the actor at the moment the action is cheap to observe, not onto the auditor at the moment it has already been narrated. Skills that emit structured traces during execution are doing exactly that. Skills that ask the model to summarize what it did are the narrative version dressed up as evidence.

The mechanical test is whether the trace would survive the agent flipping its conclusion. If the same trace can support both "I did the right thing" and "I did the wrong thing" depending on which one the agent wrote down, it was decor.

Collapse
 
alexshev profile image
Alex Shev

Small update after looking at current Terminal Skills search demand: the strongest pattern is not just people asking for more agent prompts. They are searching for skills around codebase architecture: how to inspect structure, find coupling, and make a change plan without turning the repo into vibes.

That is exactly where a skill should beat a prompt. The useful artifact is a repeatable architecture workflow: read the map, identify risk, propose a small refactor path, run checks, and leave evidence for the next agent or human.

Collapse
 
jugeni profile image
Mike Czerwinski

This is the case that tests the contract framing hardest, because the output is a plan, not a diff, and a plan is the artifact people verify least. A refactor that runs checks has something to point at when it fails. A change plan that says read the map, find coupling, propose a path has no failing test if the map was read wrong. The verification slot is the one everyone quietly drops here.

So the evidence-for-the-next-agent line is the load-bearing one, and it has to carry more than the conclusion. The risk-identification step needs typed evidence the same way: which couplings it found, which it looked for and did not find, what it could not see. A plan that records only the risks it surfaced inherits the blind spot of the summary, it looks complete because the parts it missed never entered the artifact. Same failure shape as a reconstructed audit trail, one floor up.

The discipline that makes it work is the one this thread already landed on: the map read has to stamp what it inspected at inspection time, not what it concluded after. The next agent can re-derive a plan from a record of what was looked at. It cannot re-derive one from a verdict that already threw away the search. That is the difference between a skill that hands off architecture and a confident note that hands off a guess.