DEV Community

EmaadS
EmaadS

Posted on

How Hermes Agent's self-improving 'skills' actually work — notes from building a real agent on it

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent.

Most "AI agents" are goldfish. They do a task, the context window closes, and
everything they figured out evaporates. The next run starts from zero.

Hermes Agent (Nous Research, MIT)
is built around the opposite idea: when it does something non-trivial, it can
write itself a skill — and then improve that skill the next time it's
useful. I spent a day building a small real project on it, and the self-improving
loop is the part worth writing about, because it's easy to under-appreciate until
you watch it happen in your own ~/.hermes folder.

This post is a hands-on look at how that loop actually works — the file format,
where skills live, how they get created and reused, and an honest take on the rough
edges.

The 60-second mental model

Hermes is a self-hosted agent: it runs on your machine, talks to any model
(Nous Portal, OpenRouter, OpenAI, local — whatever), and has real tools
(a terminal, web, files), plus persistent memory, a cron scheduler, and subagents.
You drive it interactively (hermes), as a one-shot (hermes -z "..."), or as a
library.

The differentiator is the closed learning loop:

do a task → distill what worked into a skill → reuse the skill next time →
refine the skill as you learn more.

Skills are just Markdown files Hermes reads back into context when relevant. That's
it. No fine-tuning, no vector DB ceremony — a written playbook the agent maintains
for itself.

What a skill actually is

After Hermes completes a complex task, it can author a skill into
~/.hermes/skills/<category>/<name>/SKILL.md. The format is plain Markdown with a
little front matter:

---
name: bounty-triage
description: Evaluate open-source bounties for AI-assisted development.
author: Hermes Agent
version: 0.1
category: bounty-scout
---

# Bounty Triage Evaluation Method
## Steps:
1. Retrieve candidates: `gh search issues --label bounty --state open ...`
2. Score each 0–2 on: funded? AI-allowed (VETO if it bans AI)? tractable? ...
3. Rank, pick top 5, verdict pursue/maybe/avoid.
## Pitfalls:
...
Enter fullscreen mode Exit fullscreen mode

I didn't write that. Hermes did — after I asked it (once) to scout and triage
funded GitHub bounties. It turned the procedure it had just executed into a reusable
SKILL.md, gave it a name and a description, and registered it:

$ hermes skills list
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ Name          ┃ Category     ┃ Source ┃ Trust ┃ Status  ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ bounty-triage │ bounty-scout │ local  │ local │ enabled │
└───────────────┴──────────────┴────────┴───────┴─────────┘
Enter fullscreen mode Exit fullscreen mode

The description matters: it's how Hermes decides when a skill is relevant on a
future run. Skills are progressive disclosure for agents — the index is cheap, the
body loads when it applies.

The part that surprised me: it improved its own skill

On a second run I told it to scout again and improve its skill if it found a
weakness. It used the skill it had written, then edited the SKILL.md itself. The
diff it made to its own playbook:

  • Funded? → "Clear cash payout explicitly stated (now robustly parsed from title, including decimals)."
  • Dollars-vs-effort? → "scoring now includes a type check for the numerical estimated dollar amount."

It had noticed its dollar-amount parsing was brittle on the first run and patched
the procedure so the next run starts sharper. Nobody told it which line to change.
That's the whole pitch made concrete: an agent that keeps a written, improving record
of how to do a job.

Setup notes that actually mattered

A few practical things from getting it running, since "self-hosted, any model" hides
some sharp edges:

  • Install is clean. pip install hermes-agent && hermes postinstall (the postinstall bootstraps Node, ripgrep, ffmpeg, a browser). I isolated it in a uv venv on Python 3.11 to keep it tidy.
  • Point it at OpenRouter and you get ~200 models behind one key:
  hermes config set OPENROUTER_API_KEY sk-or-...
  hermes -z "your task" -m google/gemini-2.5-flash --provider openrouter --yolo
Enter fullscreen mode Exit fullscreen mode
  • -z for one-shots, --yolo to auto-run tools. This is what makes it scriptable — you can put a Hermes call in a shell script or cron and it runs the whole fetch → reason → write-file → author-skill chain unattended.
  • Model choice is load-bearing for skill quality. A free model I tried rate-limited (HTTP 429); gemini-2.5-flash was a reliable, cheap tool-caller (my whole two-run demo cost about $0.25). The agentic plumbing works on a cheap model; the judgment in the skills it writes gets better with a stronger one.
  • "Do a normal chat first." The docs say it, and they're right: confirm a plain task works before piling on tools — it saves you debugging the wrong layer.

Honest take

What's genuinely good:

  • The skill loop is real and useful, not a gimmick. For a recurring, messy job (triage, monitoring, repetitive ops) an agent that writes down and refines its own procedure is exactly what you want.
  • Model-agnostic + self-hosted + real terminal tool = it does actual work, not just chat.
  • Skills are inspectable Markdown you can read, edit, and version — no black box.

What's rough:

  • Skill quality tracks model quality. On a cheap model the prose it writes is solid-but-templated; the structure is great, the wording is generic.
  • It's a big surface (cron, gateways, subagents, MCP, memory providers) and the docs are still catching up in places — expect some hermes <command> --help spelunking.

Why the loop is the point

Anyone can wrap a model in a while loop. The interesting thing Hermes does is let
the agent accumulate competence in writeable artifacts across runs. Point that at
a problem that changes over time and never fully "finishes" — and most real problems
are like that — and you've got something that gets better while you sleep, with a
plain-text audit trail of why.

I liked it enough that the skill above is part of a small project I also entered in
the Build prompt — an agent that scouts funded open-source bounties and, fittingly,
taught itself how to judge them: github.com/emaadshamsi/bounty-scout.

Top comments (0)