Elliott

Posted on May 29

Skills are Prompts. Here's how Hermes Apprentice turns them into weights

#hermesagentchallenge #devchallenge #agents #ai

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge

What I Built

It's 2 AM and Telegram lights up:

[gc-7f3a] Graduation candidate: "SKU extraction" — 14 examples, agreement 91%.
Reply train gc-7f3a to start training, skip gc-7f3a to dismiss.

You're half-asleep. You reply train gc-7f3a and put the phone down. Forty
minutes later you check Grafana, and the orange line (upstream tokens per
hour for this Hermes agent) has bent downward. A green line marked
specialist routed requests has stepped into its place. The next ten
thousand SKU-extraction requests cost nothing.

Skills are prompts. Apprentice turns some of them into weights.

Hermes Agent already ships an answer for the first ten patterns you want
it to handle: a Markdown skill file with a YAML frontmatter, dropped into
~/.hermes/skills/<name>/SKILL.md. The agent's LLM-judged selector picks
the right skill per request, and you're done. This works for the first
ten patterns. It strains at twenty. By thirty you spend more time editing
SKILL.md files than writing features, and the model is still paying full
upstream cost on tasks it has now seen a thousand times.

The obvious answer is to fine-tune. The unobvious cost is the
infrastructure: pair extraction from session history, PII redaction, a
baseline runner, a promotion gate, a versioned registry, a router that
decides per request whether to use the specialist or fall back to the
big model, a canary that rolls the new specialist out safely, and some
way to find out when any of this breaks. Most teams won't build it.

Apprentice is that infrastructure, packaged as a tool you install
alongside Hermes. It observes Hermes' SQLite session database, clusters
recurring patterns with a small embedding model, fires a Telegram
graduation message when a pattern matures, kicks off an Unsloth QLoRA
training run on your local GPU (or on RunPod), and runs the trained
model through a held-out validation gate. If the specialist beats a
baseline, the proxy starts serving it. Future matching requests get
routed to that local specialist. Misses fall through to OpenRouter.

The v0.2 surface organizes into two groups. On the rollout side, new
specialists begin at 5% traffic and auto-advance through 15, 25, 50, and
100 percent as their shadow-comparison agreement with the upstream model
stays above threshold; a drop triggers auto-demote and quarantine. The
trainer accepts three base models out of the box (Qwen2.5-1.5B as
default, Qwen2.5-3B, Llama-3.2-3B), chosen per pattern from a
user-editable trainer/supported_models.yaml. Two related specialists
can be collapsed into one via an MCP-proposed merge that requires
Telegram approval and survives a regression gate against both parent
baselines.

On the operations side, the proxy authenticates per request via
X-Apprentice-Tenant plus an API key header, applies a per-tenant
token-bucket rate limit, and tracks quotas; global patterns remain
visible to every tenant. A monthly budget posts Telegram alerts at 80%,
95%, and 100%, with budget increase 10 as the recovery path. When the
local GPU is busy, the orchestrator spills training to RunPod A100,
A6000, or L40S spot instances, gated by the same budget. Grafana shows
eight panels covering request rate, latency p50/p95/p99, error rate,
cost saved, top patterns, specialist-vs-upstream latency, status, and
24-hour counters. OpenRouter handles upstream traffic, with Fireworks,
MiniMax, and Together as fallback tiers.

These all live in real, named modules of the repo. They aren't roadmap
promises in a slide deck.

Demo

One command brings the whole loop up against a seeded fixture:

bash scripts/demo-run.sh

That script seeds a synthetic Hermes session log, runs the detector,
graduates a pattern, executes the full pipeline (dataset-builder,
trainer, merge, validate, promote), starts the serving and proxy, sends
a test request that matches the new specialist, and prints the Grafana
dashboard URL with a summary table at the end. The whole thing finishes
in well under an hour on a 2080 Ti.

The Grafana view that matters most during the demo is the cost-saved
panel. The orange "upstream tokens" line goes down while the green
"specialist routed requests" line goes up. The latency panel shows
specialist inference settling around 38 ms p50 and 85 ms p95, well

Code

Repo: github.com/eschmechel/hermes-apprentice
License: Apache 2.0

The project splits into small modules. Go handles the hot path (observer,
detector, dataset-builder, proxy, registry, burst). Python handles the ML
and orchestration (trainer, validator, serving, orchestrator, telegram,
installer):

hermes-apprentice/
├── observer/             — Go    Tails ~/.hermes/state.db, normalises pairs
├── detector/             — Go    BGE-small ONNX → HDBSCAN → candidate patterns
├── dataset-builder/      — Go    Fetches pairs, redacts PII, splits 80/10/10
├── trainer/              — Py    Unsloth QLoRA + manifest signer + multi-base-model
├── validator/            — Py    Baseline runner + promotion gate + registry
├── serving/              — Py    vLLM HTTP server + residency control plane
├── proxy/                — Go    OpenAI-compat router with canary/tenants/aliases
├── registry-service/     — Go    Read-only HTTP over ~/.apprentice/registry/
├── orchestrator/         — Py    Autonomous pipeline driver + MCP tools + budget
├── telegram/             — Py    Templates + outbox + getUpdates reply poller
├── installer/            — Py    Interactive setup: detect host, build venvs + Go
├── burst/                — Go    RunPod A100 spot dispatcher (signed jobs)
└── deploy/               — YAML  Docker compose, Grafana dashboards, Prometheus

The interactive installer is the intended entry point. It detects your
host's GPU, KVM, Docker, and uv state, recommends an isolation profile,
walks you through Telegram and OpenRouter credentials, picks a base
model, sets a monthly cloud budget, and emits cron lines for your
scheduler:

apprentice-setup --apply
apprentice-setup --apply --profile docker     # if you'd rather not run Firecracker
bash scripts/demo-run.sh                       # end-to-end smoke test

All settings persist in ~/.apprentice/.env. Re-running the installer
only updates what you provide.

My Tech Stack

Languages: Go 1.26 (proxy, observer, detector, dataset-builder, registry, burst), Python 3.10+ (trainer, validator, serving, orchestrator, telegram, installer).
Base model: Qwen2.5-1.5B-Instruct (Apache 2.0). Fits 11 GB of VRAM for QLoRA training and fp16 serving on the same card. Qwen2.5-3B and Llama-3.2-3B are configured alternates.
Training: Unsloth QLoRA, 4-bit base plus LoRA rank 16, sized per GPU via trainer/profiles/profile_*.yaml.
Serving: vLLM 0.21 with --enable-lora --max-loras 4. Multiple specialists share one warm base model; adapters hot-swap in for about 18 MB of extra VRAM each.
Routing: BGE-small (Apache 2.0) via ONNX runtime, 384-dimensional L2-normalized embeddings, cosine match against per-pattern centroids.
Privacy: Microsoft Presidio sidecar for PII redaction in dataset-builder. Secrets scanner runs pre-train. Per-pattern data cards capture provenance.
Observability: Prometheus scrape against the proxy's /metrics, Grafana dashboards in deploy/docker/compose.monitoring.yml.
Cloud burst: RunPod A100 spot instances, dispatched by signed jobs from burst/. Budget-gated.
Upstream: OpenRouter primary, multi-provider fallback chain (Fireworks, MiniMax, Together).
Isolation: Firecracker microVM is the default for the Hermes process; Docker Compose is the portable alternative.
Control plane: MCP server in the orchestrator exposes dispatch_training, propose_merge, cost_summary, roi, demote, and the budget tools.
Operator UX: Telegram for graduation approval, merge approval, and budget increases. Apprentice rides Hermes' own Telegram adapter rather than running a separate bot process on the host.

How I Used Hermes Agent

Apprentice would not work the same way against any other agent runtime.
That's the point of building it for this challenge. Hermes' substrate is
what we built on: a SQLite session database, a Markdown skill registry,
no_agent cron jobs, and an existing Telegram adapter.

The session DB is the input

Hermes writes every chat to ~/.hermes/state.db, a SQLite database in
WAL mode. The schema is straightforward: a sessions table with id,
source, model, system prompt, and token counts; a messages table with
role, content, tool_calls, and timestamps; an FTS5 virtual table that
makes full-text search a single query away. Apprentice's observer
(Go) tails this database and normalises each session into clean
(user-input, big-model-output) pairs. There are no Hermes patches
required, no schema migrations, and no fork of the agent. The observer
reads what Hermes already writes.

The detector (Go, BGE-small via ONNX) ingests the pair stream from the
observer, computes 384-dimensional embeddings on the user side of each
pair, and clusters them with HDBSCAN. When a cluster crosses a sample
threshold and shows consistent upstream-response shape, it becomes a
graduation candidate, a row in the orchestrator's job state.

Hermes skills are the output

When a specialist passes the promotion gate, the validator writes a
Markdown skill file to ~/.hermes/skills/<pattern-id>/SKILL.md. Under
the Firecracker profile, that file is scp'd into the microVM, and Hermes
picks it up in its skill registry on the next /reload-skills or
session start. This does two things at once. It tells Hermes'
LLM-judged selector that the pattern exists (so it shows up in
hermes skills list), and it points the proxy at the right adapter
via the pattern id stored in the skill's frontmatter. Routing itself
happens deterministically in the proxy, via cosine match on the
embedding, not in the LLM selector. The SKILL.md exists for ecosystem
visibility; the centroid exists for correctness.

`hermes cron no_agent` jobs are the heartbeat

The autonomous side of Apprentice runs as hermes cron --no-agent jobs
registered inside the Hermes microVM:

ssh root@GUEST 'hermes cron create --name apprentice-telegram --no-agent \
    --script apprentice-telegram-dispatch.sh --deliver telegram "every 5m"'
ssh root@GUEST 'hermes cron create --name apprentice-poll-replies --no-agent \
    --script apprentice-telegram-poll.sh "every 1m"'

no_agent mode matters: we do not want Hermes' LLM to interpret these
crons. They are shell scripts that run on a schedule and exit. The
dispatch script flushes the outbox of graduation notifications, merge
proposals, and budget alerts. The poll script reads Telegram replies
through Hermes' getUpdates adapter and turns train gc-7f3a into a
structured job request for the orchestrator. The orchestrator's
watcher.tick is a third cron job that reads pending requests and runs
the pipeline.

This kept the Apprentice process model small. We did not have to run
python-telegram-bot on the host or stand up a separate webhook server;
every operator-facing piece of the loop rides infrastructure Hermes
already exposes.

A graduation, end to end

Concrete paths for one full loop:

~/.hermes/state.db. Hermes writes a new session for a prompt that asked, in essence, "extract SKU and quantity from this email."
observer. Tails the DB, normalises the chat into a pair, ships it to the detector.
detector. Embeds the user side (BGE-small ONNX, about 2 ms), clusters via HDBSCAN. After the 14th pair, the cluster crosses threshold.
orchestrator. Creates a graduation candidate and enqueues a Telegram notification with id gc-7f3a.
Telegram, via Hermes cron. Your phone buzzes at 2 AM. You reply train gc-7f3a. The poll cron picks the reply up within the minute.
dataset-builder. Fetches all 14 pairs, runs them through Presidio for PII redaction, applies quality filters and fuzzy dedup, and splits 80/10/10. Roughly 30 seconds for about a thousand records on the demo profile.
apprentice-trainer. Unsloth QLoRA on Qwen2.5-1.5B-Instruct, rank
1. About 25 minutes on a 2080 Ti, or about 45 minutes on a RunPod A100 spot instance. Output is an 18 MB adapter.
Merge to fp16. save_pretrained_merged is the right path here. It avoids the Unsloth and vLLM tokenizer drift that bites adapter hot-swap.
apprentice-validate. Runs the merged model against the held-out 10% test set and a baseline model on the same prompts. The promotion gate requires the specialist beat baseline by at least 10 percentage points on F1; anything less is a failure report rather than a promotion.
Registry promote. The manifest is signed with the Ed25519 trainer key, and the SKILL.md is pushed to the microVM.
Canary ramp. The proxy starts routing 5% of matching requests to the new specialist while shadow-comparing every routed turn against the upstream model. Above 80% agreement the ramp auto-advances; below, it auto-demotes to "broken" and alerts you.
Live. Once at 100%, all matching requests stay local. The specialist serves at roughly 38 ms p50 and 85 ms p95.

The numbers

All measured on a single 2080 Ti, per docs/benchmarks.md:

Stage	Latency / size
Embedding (BGE-small ONNX)	~2 ms
Cosine match against 100 centroids	<0.1 ms
Specialist inference (Qwen2.5-1.5B fp16)	p50 ~38 ms, p95 ~85 ms, p99 ~150 ms
LoRA adapter on disk	~18 MB
Adapter VRAM cost on a warm base	+18 MB
Training (60 steps, QLoRA r=16)	~25 minutes
Throughput	~120 tokens/sec

An end-to-end routed turn lands at roughly 40 ms p50 (2 ms embed plus
38 ms inference), and around 166 ms at p99 when the long tail of
inference hits. The upstream OpenRouter round-trip for the same prompt
is multiple seconds, and it costs real money per token.

The promotion gate's design floor is a 10-point F1 delta versus
baseline. Specialists that fail to clear the un-tuned base model never
leave the validator. The ROI ledger tracks training cost in
GPU-seconds (plus any teacher tokens) against the cumulative dollars
saved by routing matched requests locally instead of upstream.
Break-even arrives when the saved side passes the spent side,
typically within hours for any high-volume pattern.

Why this only works because of Hermes

In principle we could have pointed Apprentice at any
OpenAI-compatible upstream, but each piece of the loop borrows
something specific from Hermes. The session DB is what makes pair
extraction free. The skill registry is the deployment surface the rest
of the Hermes ecosystem already understands. no_agent cron is the
heartbeat we did not have to invent. The Telegram adapter is the
operator UX we did not have to stand up. From the outside, Apprentice
ends up looking like a feature of Hermes, because that's how it was
built.

The v0.3 work continues in the same direction: multimodal pattern
detection against vision skills, federated training across tenants on
a shared registry, and a deeper canary with full %-ramp and A/B
multi-LoRA comparison. All of it sits on top of Hermes rather than
next to it.

Top comments (9)

Mike Czerwinski • Jun 22

"Skills are prompts. Apprentice turns some of them into weights" is the economic case for what I'd been arguing as a correctness case: deterministic gates aren't only safer, they're cheaper. The "pulling routing decisions out of the LLM and onto deterministic math" line in your reply to Andy is the one that should be on a sticker. That's the same shape as locked-decision-by-id Redis lookup on the policy side — different artifact (centroid vs decision schema), same principle: the gate doesn't depend on a model's interpretation at all.

The promotion gate at +10pp F1 vs baseline is the deterministic threshold version of what I've been writing as status transitions on a decision lifecycle (proposed → accepted → locked). The shadow-comparison post-rollout catching agreement drift maps directly to what I'd been calling supersession-harvest on a decision graph. Different artifact types, same operator-discipline shape.

One question that sits in the gap: shadow-comparison catches specialist-vs-upstream divergence. What catches the case where the specialist and upstream drift together — both trained before a new SKU format, both confidently wrong in the same direction? Agreement stays high while ground truth has moved under both. The skill-metadata problem (last_verified_at, model_created_at) shows up here too, just on the specialist side instead of the prompt-skill side. Curious whether the v0.3 work has a story for that, or whether it's structurally outside what shadow-comparison can see.

Elliott • Jun 22

Fair hit. Agreement is a consistency signal, not a correctness one, so correlated drift is invisible to it. The lever orthogonal to agreement is one Apprentice already computes: the routing embedding. A new SKU format moves the inputs and distance-to-centroid trips before any output diverges. What it misses is the meaner case: same inputs, changed correct answer. There, the embedding stays put, and agreement holds, so both stay wrong together. No output-side signal sees that, so it would need a periodic ground-truth sample to stay consistent.

Alb,iet I've currently halted further work on the Apprentice side till I finish some better data ingestion/stripping on the Hermes side. Still working on getting a full ecosystem setup securely. But it's on my to-do list!

Mike Czerwinski • Jun 22

Routing embedding as the input-side drift catch is the structural answer — embedding distance is the only signal in the loop that doesn't go through either model's verdict, so it catches the case where the world moves before either model notices. Same-inputs-changed-correct-answer is the harder one and you named it precisely: no signal inside the routing/specialist/upstream triangle sees it, because they all share the lineage. Periodic ground-truth sampling is the only honest answer, and the interesting design constraint there is that the sample has to come from a path the specialist and the upstream both don't touch — a labeled holdout that's refreshed by something outside the system, otherwise the holdout drifts with everything else. Same shape as needing an external timestamp or an operator-authored anchor on the policy side.

Halt-and-finish-data-work is the right order regardless. Looking forward to whatever you ship from the Hermes side; the schema work upstream tends to dictate what's measurable downstream, and the periodic-ground-truth question is genuinely easier once the ingestion shape settles.

Elliott • Jun 22

Exactly, the holdout's only an oracle if something outside the loop authors it; otherwise, it drifts along with everything else. Operator-authored, off the routing path. Agreed on the order, and I'll know a lot more once the ingestion shape settles. Great thread.

Mike Czerwinski • Jun 23

Operator-authored, off the routing path — that's the shape. Same with the order: schema first, ground truth second, otherwise you're measuring against a target you helped move. Ping when ingestion settles and the index sits on something stable — curious what the off-path holdout looks like as a real artifact. Great thread back.

Harjot Singh • May 31

This is one of the most economically interesting agent ideas I've seen in a while, because it closes the loop most people leave open: a prompt-based skill is expensive forever (you pay frontier tokens every single call), but once it's run enough times with high agreement, that's a labeled dataset sitting right there, and graduating it into a small specialist turns a recurring cost into a near-zero one. The "agreement 91%, 14 examples" gate is the smart part, you only distill skills that have proven stable, so you're not fine-tuning on noise. And the Grafana detail (upstream tokens bending down as specialist-routed steps up) is the whole thesis made visible: routing the matured tasks to the cheap path is exactly where the savings live. This is the same conviction behind how I build Moonshift, the cheapest model that clears the bar per task, and tasks that prove themselves should get cheaper over time, not stay frontier-priced out of habit. The human-in-the-loop train/skip via Telegram is a nice gate too. How do you guard against quality regression after graduation, shadow the specialist against the frontier model for a while, or trust the pre-graduation agreement score?

Elliott • Jun 3

Thank you so much. Sorry for the late response since I’ve been out of town. To answer your question:

Before a specialist sees real traffic it clears a promotion gate that checks how closely it matches upstream on a held-out test set. If it passes, the canary ramp routes 5% of matched traffic to it, scaling to 100% as agreement scores hold. 5% of traffic keeps shadowing upstream after full rollout to catch drift. Merged specialists also regression-gate against their parents.

Andy Stewart • Jun 3

This project hits the absolute sweet spot! Turning fine-tuning and routing into a local, lightweight closed loop, and using deterministic 'status gating' to squeeze maximum value out of compute while crushing upstream black-box costs. This is the kind of system architecture built for hardcore geeks who demand absolute control. Love it!

Elliott • Jun 3

Thanks, Andy. Glad the local-control angle came through. Pulling routing decisions out of the LLM and onto deterministic math is what made the system stop feeling like a demo. Glad you like it!