DEV Community

최해일
최해일

Posted on • Originally published at github.com

The Loadout Pattern: Handing the Wheel to an Autonomous LLM

The Loadout Pattern: Handing the Wheel to an Autonomous LLM

The core idea

Conventional automation executes a procedure — code runs a fixed sequence of steps and decides
nothing; same input, same path, every time. The loadout pattern keeps the steps but moves the
deciding to the model. At each step the brain — an autonomous LLM — judges: what matters,
which tool to reach for, whether to act at all. It's handed a purpose and the latitude to pursue
it, and it drives — choosing its own tools as it goes. Code executes; the brain decides. Those
tools come as a loadout — a curated, self-describing set drawn from a shared toolbox — and
the brain is observed at the interface it calls, not by the side effects it leaves behind. The
model is the driver; your system is the suit it wears. Everything below is how to build that.

Most LLM integrations bolt a model into your code. This is about the opposite: letting the
model drive your system — equipping itself, on its own initiative, with a loadout: the
curated, self-describing set of tools it picks for each mission. The system stops being the
program that calls an LLM, and becomes the suit the LLM wears.

Audience: engineers building agentic/automation systems. There's code, and there's a bit of
philosophy — because the philosophy is what makes the code shaped the way it is.

Two words, kept distinct (the whole post hinges on this):
a toolbox (or catalog) is every tool you own — the whole armory.
A loadout is the curated subset a routine equips for one mission — what it actually suits
up with. The entire MCP server is a toolbox; a loadout is the handful of tools one routine is
handed at wake.


The usual way, and the constraint hiding in it

In a typical LLM integration the model lives inside your process. Your code calls it:

answer = agent.invoke({"input": "What changed in the market overnight?"})
Enter fullscreen mode Exit fullscreen mode

This is great for human-triggered work: a person asks, the system fetches and answers. The
human is the caller; nothing happens until they show up. The LLM is a component — a function
your program calls and pays per token to use.

This post is about the other mode: the LLM doing the work on its own initiative. A routine wakes
on a schedule and gets on with it — digesting overnight news every hour, posting a morning briefing,
watching a queue, reconciling a ledger. No one asked; the routine is its own caller. Wake the model
on a cron — say, a headless Claude Code session every hour — and it is no longer a component inside
your program. It's outside, periodically taking the wheel and deciding what to do.

The line that matters isn't human-vs-cron — and it isn't even steps-or-no-steps. It's executing
versus deciding
: a script runs its steps and decides nothing, while the brain — even when it
follows steps — decides at each one, choosing its own tools toward the goal.

That inversion changes what your system should be.

The metaphor: you, the brain, and the suit

Three layers, and it matters which is which:

  • You are Tony Stark — the owner. You delegate intent and occasionally override.
  • The LLM is JARVIS — the brain. An AI that operates the suit autonomously on your behalf: it judges, acts, and reports, and you don't micromanage it. (Throughout this post, "the brain" means exactly this — the LLM that drives.)
  • Your system is the suit — sensors, memory, tools, power.

Here's the leverage that falls out of this: you don't hand-author JARVIS's intelligence. It
comes from the model — and it improves when you swap in a better model, not when you write more
code. What you build is the suit — what the brain can sense, remember, and do. So the central
question of the whole system becomes: how do we equip the brain well — give it the right loadout —
and let it reach for the right tool at the right moment?

The problem: skills that tangle mission with mechanics

When you first wire a cron-woken routine, you write a prompt ("skill") that mixes two very
different things: the mission (what to judge, the actual work) and the mechanics (raw
curl, database queries, hardcoded IDs). A real before-state:

# news-digest skill (before)
1. Query Mongo for new headlines since the watermark:
   docker exec db mongosh app --eval 'db.news.find({publishedAt:{$gt: ...}})...'
2. Decide which are new stories vs updates vs noise.  ← the actual mission
3. Post the briefing:
   curl -X POST http://localhost:9000/notify -d '{"type":"SIGNAL", ...}'
   Then create a Notion page: data_source_id "<your-notion-data-source>", icon "📰", ...
Enter fullscreen mode Exit fullscreen mode

Two problems compound. First, the mission (step 2 — judgment) is drowned in plumbing. Second,
every other routine that needs to "post a notification" re-describes that same curl in its own
prompt. Change the notification URL and you edit five skills. The mechanics are copy-pasted prose.

The pattern: a per-mission loadout + mission-only skills

Split the system along the seam between interface and implementation.

1. Tools are named capabilities — a stable name over a swappable implementation. Most are small,
dumb, independent scripts, but the name is the only thing the brain depends on; what sits behind
it is free to vary. Usually it wraps mechanics (a curl, a DB query, a stubbed no-op, a different
backend tomorrow). But a tool can just as well hand off to another agent — a sub-brain with its
own loadout — or trigger the next task in a pipeline. To the brain it's all the same: a name it
can reach for. So a tool is sometimes an interface over mechanics, and sometimes the next move
another agent, or the start of the next step. notify sends a notification; read_news reads. They
don't know about each other. Together, all of them are your toolbox (the catalog).

#!/usr/bin/env bash
# notify.sh — send a notification (hides the URL/payload mechanics)
set -euo pipefail
[ "${1:-}" = "--describe" ] && { echo "notify|action|send a notification"; exit 0; }
TYPE="$1"; TITLE="$2"; MSG="$3"
payload="$(jq -n --arg t "$TYPE" --arg ti "$TITLE" --arg m "$MSG" '{type:$t,title:$ti,message:$m}')"
curl -s -X POST "${NOTIFY_URL:-http://localhost:9000/notify}" \
     -H 'Content-Type: application/json' -d "$payload"
Enter fullscreen mode Exit fullscreen mode

2. Tools describe themselves. One line, --describe, is the single source of truth for what
the tool is. Not the skill, not a wiki — the tool.

3. A loadout assembler hands the brain its kit. Given a list of tool names (the loadout),
it prints their self-descriptions. This is what the routine runs at wake — it downloads its
loadout
from the toolbox.

#!/usr/bin/env bash
# loadout.sh <tool> [tool...] — print the self-descriptions of the named tools (the loadout)
set -euo pipefail
DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
echo "🧰 loadout for this mission"
for t in "$@"; do
  IFS='|' read -r name kind desc <<<"$(bash "$DIR/$t.sh" --describe)"
  echo "  - $name ($kind): $desc"
done
Enter fullscreen mode Exit fullscreen mode

4. The skill becomes mission only. It states what to do and names its loadout. The
descriptions arrive with the loadout, not re-written in the skill:

# news-digest skill (after)
## Loadout — download at start
bash tools/loadout.sh read_news write_story notify publish_notion

## Mission
Turn new headlines into a running ledger of stories: skip repeats, extend ongoing
stories, open new ones, ignore noise. Each morning, post a briefing from the ledger.
Enter fullscreen mode Exit fullscreen mode

5. The brain thinks for itself — tools don't auto-chain. Keep notify and publish_notion
separate; do not make "writing to Notion" secretly also send a notification. The moment you fuse
two tools in the plumbing you've frozen a policy — you can no longer publish quietly, or notify
without publishing. Leave the tools independent and let the brain reason about whether to call one,
the other, or both. The thinking is the brain's job; the wiring must not pre-decide it.

From the model's point of view, this is the whole win. When the routine wakes, it receives two
cleanly separated things: a mission — what to accomplish and how to judge it — and a
loadout — the named capabilities it is allowed to use. It never has to excavate the how (a
URL, a query, an ID) out of the what; the mechanics are simply not in its field of view, leaving
only the decision and the set of moves available to make it. The skill carries judgment (which
changes often); the toolbox carries capability (stable, shared); a loadout is just the names a
routine picks from it. A new routine lists tool names and gets their descriptions for free — change
a URL and you edit one tool, not five prompts.

Observability: log the interface, not just the result

Side effects are not proof. A notification arriving does not establish that the model invoked the
tool, and a tool whose implementation is a no-op stub produces no side effect at all even when the
model used it correctly. Verifying behavior therefore means observing the interface — the
moment a tool is called — separately from what its implementation did.

Each tool logs at that boundary:

# _log.sh (sourced by every tool)
tlog() {  # tlog <event> [detail]   event: INVOKED | OK | DRY | ERR
  printf '%s | %-12s | %-7s | %s\n' \
    "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$(basename "$0" .sh)" "$1" "${*:2}" \
    >> "$LOG_FILE"
}
Enter fullscreen mode Exit fullscreen mode

The log separates two questions that side effects conflate:

... | notify | INVOKED | SIGNAL | Morning briefing   # the interface was called
... | notify | OK      | SIGNAL HTTP 200             # the implementation sent it
... | notify | DRY     | SIGNAL                      # called, but did not send (DRY_RUN)
Enter fullscreen mode Exit fullscreen mode

INVOKED records that the model used the tool, independent of any outcome; OK/DRY/ERR
records what the implementation did. Because the model depends on the interface rather than the
implementation, the same routine can run in a shadow mode — where notify only logs and never
sends — with no change in the model's behavior. The boundary log is also the reliable way to audit
a past run: it records what executed, not merely what the skill instructed.

A pattern, and its limit

This is a pattern, not a framework — and that's its honest limit: nothing enforces it at
runtime.
There's no base class, no inversion of control, nothing that prevents a routine from
ignoring its loadout or inlining raw mechanics again. Adherence is a matter of discipline — or, if
you want a guardrail, a CI check that every skill carries a mission and a declared loadout. What
you get in exchange is lightness: nothing to install, and incremental adoption — one tool, one
routine at a time, in any stack that runs a script and a prompt.

Why it matters

Go back to the suit. You upgrade the brain by adopting a better model — that's not code you write,
it's a model you swap in. Your day-to-day engineering goes into the equipment: what the brain can
discover, reach for, and be observed using. And because the brain depends on interfaces — a loadout
of named tools — the suit is model-agnostic: change the model and the same loadout still fits. A
self-describing, observable loadout is precisely how the brain takes the wheel: it wakes,
downloads the tools it's allowed, sees what it can do, and acts — and you can watch it do so at the
interface, not by guessing from side effects. The system stops being a program that occasionally
calls a model, and becomes a suit a capable model wears.

Recipe (TL;DR)

  1. Tools = small scripts, one capability each, mechanics hidden — together, your toolbox. Add a --describe line and a boundary log (INVOKED + OK/DRY/ERR). Side-effecting tools support a DRY_RUN.
  2. Loadout = an assembler that prints a routine's named tools' self-descriptions. The routine runs it at wake to download its loadout.
  3. Skills = mission (judgment) + a loadout list. No mechanics. Let the brain compose tools; never auto-chain them in the wiring.

Runnable examples are in examples/.

Top comments (0)