DEV Community

Cover image for The Anatomy of a Self-Improving AI Agent — How Hermes Agent's Closed Learning Loop Actually Works
Nilam Bora
Nilam Bora

Posted on

The Anatomy of a Self-Improving AI Agent — How Hermes Agent's Closed Learning Loop Actually Works

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

Every AI agent framework solves the same problem: how to make an LLM use tools. Hermes Agent asks a different question entirely — how do you make an LLM that used a tool today use it better tomorrow?

That question sounds simple. The engineering required to answer it is not.

I've spent the past few weeks dissecting Hermes Agent's architecture, reading through its source code, studying the GEPA evolution pipeline, and mapping how its three-layer memory system actually works under the hood. What I found isn't just another agentic wrapper around API calls — it's a genuinely novel approach to building AI systems that compound in capability over time.

This article is a technical deep dive. No "getting started" tutorial, no surface-level overview. We're going straight into the internals — the Closed Learning Loop, the SKILL.md format, the progressive disclosure system that keeps token costs sane, and the DSPy + GEPA pipeline that lets Hermes evolve its own skills without a single GPU.

Let's crack it open.


The Goldfish Problem: Why Most Agent Frameworks Start From Zero Every Time

Here's a thought experiment. You ask an AI agent to scrape a website, parse the data, and save it to a CSV. It takes 12 tool calls, hits two errors, retries with different selectors, and eventually succeeds.

Tomorrow, you ask it to scrape a different website.

It starts from scratch. Same errors. Same retries. Same 12 tool calls. It learned nothing from yesterday.

This is what I call the Goldfish Problem — and it plagues every major agentic framework:

Framework Mental Model Learns From Past Tasks?
LangChain / LangGraph Graph-based state machine ❌ No — state resets per workflow
CrewAI Role-based multi-agent teams ❌ No — crews are stateless
AutoGen Conversational multi-agent dialogue ❌ No — conversation history only
Hermes Agent Persistent self-improving runtime ✅ Yes — Closed Learning Loop

LangGraph gives you surgical control over state machines. CrewAI lets you prototype multi-agent teams fast. AutoGen enables agent-to-agent dialogue. These are good tools for their intended use cases.

But none of them treat past execution as training data for future execution. They orchestrate. They don't learn.

Hermes Agent does both.


The Closed Learning Loop — Execute, Evaluate, Extract, Retrieve

The core innovation in Hermes Agent is a four-phase cycle that Nous Research calls the Closed Learning Loop. Here's how it works:

┌──────────────────────────────────────────────────┐
│                                                  │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│   │          │    │          │    │          │   │
│   │ EXECUTE  │───▶│ EVALUATE │───▶│ EXTRACT  │  │
│   │          │    │          │    │          │   │
│   └──────────┘    └──────────┘    └──────────┘  │
│        ▲                               │         │
│        │          ┌──────────┐         │         │
│        │          │          │         │         │
│        └──────────│ RETRIEVE │◀────────┘         │
│                   │          │                    │
│                   └──────────┘                    │
│                                                  │
└──────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Phase 1: Execute

The agent performs a task using its available tools — web search, browser automation, terminal commands, file operations, code execution. Standard agentic behavior. Nothing unusual here.

Phase 2: Evaluate

After the task completes, the agent doesn't just move on. It reviews its own execution trace — every tool call, every reasoning step, every error and recovery. It asks: What worked? What failed? What took more steps than necessary?

This is the critical divergence from other frameworks. Most agents treat completion as the end. Hermes treats it as the beginning of learning.

Phase 3: Extract

If the agent identifies a reusable pattern — a sequence of tool calls that solved a class of problems, an error-handling strategy that recovered gracefully, a workflow that could be templated — it codifies that pattern into a Skill.

Skills are written as Markdown files (SKILL.md) with structured YAML frontmatter. They're human-readable, version-controllable, and composable. More on this in the next section.

Phase 4: Retrieve

The next time the agent encounters a similar task, it searches its skill library before acting. If a relevant skill exists, it loads the skill's instructions and follows the proven workflow — skipping the trial-and-error phase entirely.

Then the cycle repeats. The skill gets refined. Edge cases get covered. The agent gets measurably faster and more reliable at recurring tasks.

The neuroscience parallel is striking. This is essentially how human procedural memory works. You don't consciously think through every step of riding a bike — your brain retrieves a compressed motor program built from hundreds of past attempts. Hermes does the same thing with tool-calling sequences.

And critically, this is architecturally different from the three common approaches to "giving agents memory":

  • RAG retrieves information. Hermes retrieves procedures.
  • Fine-tuning bakes knowledge into weights. Hermes stores it in editable text files.
  • Prompt caching saves tokens. Hermes saves entire workflows.

SKILL.md — The Anatomy of an Agent's Procedural Memory

This is where Hermes Agent gets genuinely interesting from an engineering perspective. Let's look at what a Skill actually is.

Every skill lives in ~/.hermes/skills/ as a directory containing a SKILL.md file. Here's a realistic example:

---
name: web-scraping-with-retry
description: >
  Scrapes web pages using browser automation with intelligent retry
  logic, selector fallbacks, and anti-detection measures.
version: 1.2.0
author: hermes-agent (auto-generated)
license: MIT
metadata:
  tags: [web-scraping, browser, automation, retry]
  related_skills: [data-parsing-csv, browser-stealth-config]
  platform: [linux, macos]
  min_tools: [browser, file]
  trigger_conditions:
    - "user requests web scraping"
    - "task involves extracting data from websites"
  success_rate: 0.94
  total_executions: 47
---

# Web Scraping with Retry

## Overview
This skill handles web scraping tasks with built-in resilience.
It was originally created after a session where naive scraping
failed due to dynamic content loading and rate limiting.

## Procedure

### Step 1: Assess the Target
Before scraping, check for:
- robots.txt restrictions
- Dynamic rendering (SPA vs server-rendered)
- Rate limiting headers (X-RateLimit-*)

### Step 2: Choose Strategy
- **Static HTML**: Use `fetch_url` tool directly
- **Dynamic/SPA**: Use `browser_navigate` + wait for selectors
- **Rate-limited**: Add 2-5 second delays between requests

### Step 3: Implement Fallback Selectors
Always prepare 3 levels of selectors:
1. Semantic: `[data-testid="product-price"]`
2. Structural: `.product-card > .price-container > span`
3. XPath fallback: `//div[contains(@class,"price")]`

### Step 4: Error Recovery
If a selector fails:
- Wait 3 seconds and retry (content may still be loading)
- Try next fallback selector
- If all selectors fail, capture a screenshot and report

## Anti-Patterns to Avoid
- Never scrape without checking robots.txt first
- Never use fixed sleep() — use element-wait conditions
- Never store raw HTML when structured data is available
Enter fullscreen mode Exit fullscreen mode

Notice what's happening here. This isn't a prompt template. It isn't a function signature. It's a structured decision-making guide that encodes judgment — when to use which strategy, what to watch out for, how to recover from specific failure modes.

The agent wrote this itself, after 47 executions of scraping tasks, refining the procedure each time. The success_rate: 0.94 in the metadata isn't a fabrication — it's tracked from actual execution outcomes.

Progressive Disclosure: How Hermes Keeps Token Costs Sane

Here's the engineering problem: if you have 200 skills, you can't load all of them into the context window. That would burn tens of thousands of tokens before the agent even starts working.

Hermes solves this with a three-level progressive disclosure system:

Level 0 — The Index (Always Loaded)
The agent sees only skill names and one-line descriptions. This is a compact lookup table that costs maybe 500 tokens for dozens of skills:

Available Skills:
- web-scraping-with-retry: Scrapes web pages with retry logic and fallbacks
- data-parsing-csv: Parses messy CSV/TSV data into clean structured formats
- git-pr-review: Reviews pull requests for code quality and security issues
- email-digest-summary: Summarizes email threads into actionable bullet points
Enter fullscreen mode Exit fullscreen mode

Level 1 — Full Skill (Loaded On Demand)
When the agent decides a skill is relevant to the current task, it loads the full SKILL.md content. Now it has the complete procedure, anti-patterns, and decision logic.

Level 2 — Deep References (Rare)
Some skills include a references/ subdirectory with example scripts, configuration templates, or API documentation. These are loaded only when the agent needs specific implementation details — like a regex pattern for parsing dates, or a template for a specific API's authentication flow.

This three-tier system means the token cost scales with relevance, not with library size. An agent with 500 skills pays roughly the same base cost as one with 50 — only the skills actually needed get loaded.

When Does the Agent Create a New Skill?

Not every task becomes a skill. Hermes uses specific triggers:

  1. Complexity threshold: The task required 5+ tool calls to complete
  2. Error recovery: The agent hit errors and successfully recovered — that recovery pattern is worth saving
  3. Novelty detection: The task doesn't match any existing skill in the library
  4. User confirmation: In some configurations, the agent asks the user before creating a skill

The threshold is intentionally high. You don't want a skill for "read a file" — that's trivial. You want skills for multi-step procedures where the agent's learned judgment actually saves time.


Self-Evolution: How DSPy + GEPA Optimize Skills Without a GPU

Manual skill creation through the Closed Learning Loop is powerful. But Nous Research pushed further with an automated evolution pipeline called GEPAGenetic-Pareto Prompt Evolution.

This is where Hermes Agent stops being "just" an agent framework and starts looking like an AI research platform.

The GEPA Pipeline (5 Stages)

Stage 1: Execution Trace Collection
GEPA reads the agent's SQLite database of past sessions. It doesn't just look at pass/fail outcomes — it analyzes the full reasoning trace: every tool call, every decision branch, every error message and recovery attempt.

Stage 2: Reflective Failure Analysis
An LLM examines failed traces and generates what Nous calls "Actionable Side Information" — a diagnosis of why the skill failed. Not "it didn't work," but "the CSS selector assumed a class name that changed between page loads" or "the retry logic didn't account for 429 rate-limit responses."

Stage 3: Targeted Mutation
Based on the failure analysis, GEPA proposes specific text edits to the SKILL.md file. These aren't random perturbations — they're targeted fixes and improvements:

  ### Step 4: Error Recovery
  If a selector fails:
  - Wait 3 seconds and retry
+ - Check for 429 status code — if present, back off exponentially
+   (5s, 15s, 45s) before retrying
  - Try next fallback selector
  - If all selectors fail, capture a screenshot and report
Enter fullscreen mode Exit fullscreen mode

Stage 4: Multi-Candidate Evaluation
GEPA generates multiple mutated variants of the skill and evaluates each against a test suite — either synthetic test cases or replayed real-world scenarios. It uses Pareto optimization across multiple dimensions (success rate, token efficiency, execution speed) rather than optimizing a single metric.

This is critical. A skill that succeeds 100% of the time but uses 50,000 tokens per run isn't necessarily better than one that succeeds 95% of the time using 5,000 tokens. Pareto selection lets you keep multiple specialized variants.

Stage 5: Human-in-the-Loop Review
The winning variant isn't automatically deployed. GEPA generates a Pull Request with the proposed changes, a diff showing exactly what changed, and the evaluation metrics that justify the change. A human reviews and approves before the skill goes live.

This safety rail is non-negotiable. Without it, you risk skill hallucination — the agent convincing itself that a broken procedure works because it evaluated against flawed test cases.

Why This Isn't Reinforcement Learning

GEPA looks superficially like RL, but it's fundamentally different:

Property Traditional RL GEPA
Optimization target Neural network weights Plain text (Markdown)
Requires GPU Yes No — LLM API calls only
Transparency Black box Fully readable diffs
Rollouts needed Thousands ~35x fewer than GRPO
Iteration unit Training step Git commit

Nous Research calls this "evolutionary search over text" — and it's a surprisingly effective paradigm. You get the benefits of automated optimization without the opacity of weight updates.


The Broader Architecture: What Powers the Loop

The Closed Learning Loop doesn't exist in isolation. It's supported by a comprehensive runtime architecture:

Three-Layer Memory System

  • Layer 1 — Core Memory: SOUL.md (persona), MEMORY.md (facts), USER.md (preferences). Loaded every session. Think of it as the agent's "identity."
  • Layer 2 — Procedural Memory: The skill library. Loaded on demand via progressive disclosure. This is where the learning loop writes to.
  • Layer 3 — Episodic Memory: SQLite database with full session history, FTS5 full-text search, and LLM-summarized recall. The raw material that GEPA mines for improvements.

Model Agnostic by Design

Hermes doesn't lock you into a provider. It works with OpenAI, Anthropic, OpenRouter, Nous Portal, and local models via Ollama or vLLM. Swap models without changing skills — the procedural knowledge is in Markdown, not in weights.

Multi-Platform Gateway

A single Hermes process can serve you across CLI, Telegram, Discord, Slack, WhatsApp, Signal, and Email simultaneously. Your skills, memory, and learning history follow you across every surface.

~70 Built-In Tools + MCP Extensibility

Web search, browser automation (with anti-detection via Camofox), terminal, file operations, code execution, vision, image generation, and a cron scheduler for autonomous recurring tasks. Plus Model Context Protocol (MCP) support for dynamically loading external tool servers.

Sub-Agent Delegation

For complex workflows, Hermes spawns isolated child agents with their own conversation history and toolsets. Only the final result returns to the parent — preventing context window flooding while enabling parallelizable work.


When Should You Actually Use Hermes Agent?

Not every problem needs a self-improving agent. Here's an honest decision framework:

Use Hermes Agent when:

  • ✅ You need a persistent assistant that handles recurring tasks (daily summaries, weekly reports, ongoing monitoring)
  • ✅ You want an agent that genuinely gets better at your specific workflows over time
  • ✅ You care about data privacy and want full self-hosted control
  • ✅ You need multi-platform access (message your agent from Telegram, get results in Slack)
  • ✅ You're doing AI research and want to experiment with skill evolution

Use something else when:

  • ➡️ One-off pipeline orchestration → LangGraph gives you more control over deterministic state machines
  • ➡️ Quick multi-agent prototyping → CrewAI's role-task-crew abstraction is faster to set up
  • ➡️ Agent-to-agent collaboration experiments → AutoGen's conversational paradigm is more natural
  • ➡️ You need a library, not a runtime → Hermes is an always-on process, not a function you call

The honest truth: if you're building a one-shot data pipeline that runs once and dies, Hermes is overkill. Its value compounds over time. The longer it runs, the more skills it accumulates, the more efficient it becomes. Day one, it's just another agent. Day ninety, it's an agent that knows your infrastructure, your coding style, your failure modes.


The Compounding Agent Thesis

Here's what I think most people miss about Hermes Agent, and why I believe the Closed Learning Loop matters beyond this one framework:

The real moat in AI agents isn't who has the best LLM. Models are converging — GPT-4o, Claude, Gemini, Llama, they're all remarkably capable. The real moat is accumulated institutional knowledge.

Every company, every developer, every team has specific procedures, preferences, failure modes, and tribal knowledge that no foundation model can know out of the box. The agent that captures and operationalizes that knowledge — automatically, incrementally, without manual prompt engineering — wins.

Hermes Agent's Closed Learning Loop is the first serious implementation of this idea at the framework level. It treats every task completion not as an endpoint, but as a data point for improvement. Every error recovered from becomes a guardrail. Every successful workflow becomes a reusable template. Every edge case encountered becomes a documented anti-pattern.

The result isn't just an agent that runs tools. It's an agent that builds its own playbook — and gets better every time you use it.

That's not incremental improvement. That's a fundamentally different category of AI system.


Hermes Agent is open-source under the MIT license. You can find the code at github.com/NousResearch/hermes-agent, the self-evolution pipeline at github.com/NousResearch/hermes-agent-self-evolution, and the official documentation at hermes-agent.org.

Top comments (0)