Atharva Atal

Posted on May 29

Your Agent Is Only as Smart as Its Toolbox — Hermes Agent Challenge Wants to Change That

#hermesagentchallenge #devchallenge #agents

Hermes Agent Challenge Submission: Write About Hermes Agent

Most production agents are built around a fixed set of tools. You write the functions, wire them up, and the LLM picks from a menu you designed ahead of time. This works — until the task needs something you didn't anticipate.

Nous Research's Hermes Agent Challenge (HAC) flips this entirely. Instead of giving the agent a toolbox, it gives the agent the ability to build tools. At runtime. Mid-task. By writing Python.

The Core Idea: Skill Memory Over Static Tools

Traditional agent frameworks (LangChain, AutoGPT, Semantic Kernel) follow the same formula:

Traditional: Agent = LLM + predefined tools

HAC: Agent = LLM + sandbox + skill memory

The agent starts with almost nothing — a Python executor and file access. If it needs to parse a CSV, it writes a parser. If it needs a moving average, it codes one, tests it, and keeps it in a persistent skill library for later reuse.

This is not just a clever trick. Capability is a dynamic output of the agent, not a fixed input from the developer.

The GEPA Loop: Goal → Environment → Plan → Act

Most agents run a flavor of ReAct: Thought → Action → Observation. GEPA is more deliberate, and the separation matters.

Letter	Phase	What it does
G	Goal	Fixed natural language target. Doesn't change across the episode.
E	Environment	Files, outputs, and crucially — the agent's current skill library.
P	Plan	Multi-step strategy: "write new skill" or "call existing skill."
A	Act	Execute one step, update environment, loop back.

💡 Key insight: The agent sees its own skill library as part of the environment. So planning can explicitly say "use parse_csv" — just like a developer checks imports before writing new code. This is impossible in a naive ReAct loop.

Skill Creation: Generate → Test → Store

When no existing skill handles a sub-problem, the agent writes a Python function, sandbox-tests it, and stores it only if tests pass.

Step 1 — Generate: LLM writes a def solve(...): function from a natural language description.

Step 2 — Sandbox test: Run the function in a subprocess with a test input. Capture stdout, check return code.

Step 3 — Store (if pass): Add the callable to skill_library. Failures are discarded; the agent retries with a revised description.

def create_skill(description, context, skill_lib):
    # Step 1: LLM generates the function
    code = llm_generate_code(
        f"Write a Python function named solve that {description}."
    )

    # Step 2: Run in a subprocess sandbox
    result = subprocess.run(
        ['python', tmpname], capture_output=True, text=True, timeout=5
    )

    # Step 3: Store only if tests pass
    if result.returncode == 0:
        exec(code, {}, local_ns := {})
        skill_lib[description] = local_ns['solve']
        print(f"✓ Skill saved: {description}")
    else:
        print(f"✗ Skill failed, not saved")

    return skill_lib

This generate → test → store pattern turns the agent into a self-extending system. Over an episode, it accumulates a personal standard library tuned to the task at hand.

A Real Example: CSV Smoothing Task

Goal: "Read sales.csv, compute a 7-day moving average of the revenue column, save to smoothed.csv."

Starting with zero skills, the agent's first plan looks like this:

WRITE_SKILL: parse a CSV file and return a list of dictionaries
WRITE_SKILL: calculate a moving average for a list of numbers given a window size
CALL_SKILL:  parse_csv, "sales.csv"
CALL_SKILL:  moving_average, revenue_data, 7
WRITE_FILE:  smoothed.csv, result

After the first two steps, those skills exist permanently. If the agent gets a similar task tomorrow — different CSV, different column — it plans around skills it already has. No new code generation needed.

The Full GEPA Agent

Here's a minimal but complete implementation you can wire to any LLM:

from dataclasses import dataclass, field
from typing import Callable

@dataclass
class Environment:
    file_system: dict[str, str] = field(default_factory=dict)
    last_action_result: str = ""
    skill_library: dict[str, Callable] = field(default_factory=dict)

class GEPAAgent:
    def __init__(self, model_call: Callable[[str], str]):
        self.model = model_call
        self.goal = ""
        self.env = Environment()

    def perceive(self) -> str:
        return (
            f"Goal: {self.goal}\n"
            f"Files: {list(self.env.file_system.keys())}\n"
            f"Last result: {self.env.last_action_result}\n"
            f"Available skills: {list(self.env.skill_library.keys())}\n"
        )

    def plan_next(self) -> list[str]:
        prompt = (
            "You are an autonomous agent using the GEPA loop.\n"
            f"{self.perceive()}\n"
            "Produce a step-by-step plan. Use WRITE_SKILL: to create new "
            "functions, CALL_SKILL: to use existing ones. One step per line."
        )
        response = self.model(prompt)
        return [line.strip() for line in response.strip().split("\n") if line.strip()]

    def act(self, step: str) -> str:
        if step.startswith("WRITE_SKILL:"):
            desc = step.split("WRITE_SKILL:", 1)[-1].strip()
            self.env.skill_library = create_skill(
                desc, self.perceive(), self.env.skill_library
            )
            return f"Skill created: {desc}"
        elif step.startswith("CALL_SKILL:"):
            parts = step.split(":", 1)[-1].strip().split(",", 1)
            name = parts[0].strip()
            args = eval(parts[1]) if len(parts) > 1 else []
            skill = self.env.skill_library.get(name)
            return str(skill(*args)) if skill else f"Skill not found: {name}"
        return "Unknown command."

    def run(self, goal: str, max_steps: int = 10):
        self.goal = goal
        for _ in range(max_steps):
            steps = self.plan_next()
            if not steps:
                break
            step = steps[0]
            self.env.last_action_result = self.act(step)
            print(f"→ {step}\n  {self.env.last_action_result}\n")

Wire model_call to any LLM API and you have a working GEPA agent.

How This Compares to Existing Frameworks

Aspect	LangChain / AutoGPT	Hermes Agent Challenge
Tool origin	❌ Human-coded, fixed at deploy	✅ Generated by agent at runtime
Capability growth	❌ None during a run	✅ Accumulates across episodes
Failure recovery	❌ Retry with same limited set	✅ Write a better tool, retry
Task generality	❌ Bounded by predefined tools	✅ Limited only by sandbox safety
Control flow	ReAct	GEPA: explicit plan before acting

LangChain gives you a Tool object with a name and a function. You're on the hook for every capability the agent will ever need. HAC makes capability itself a dynamic output of the agent.

What This Means for Practitioners

The mental shift required:

From: "What tools does my agent need?"
To: "What tool-building process should my agent have?"

🔒 Sandboxing is non-negotiable. LLM-generated code executing on your host is a serious security risk. HAC uses Docker or gVisor. Don't skip this.

🧠 Skill quality depends on your model. Hermes models are fine-tuned specifically for this loop. Let the agent generate its own test cases too — it catches edge cases you wouldn't think to write.

📈 Skill persistence is where value compounds. A skill library saved across runs means the agent genuinely gets better over time — not just within a single session. That's the real payoff.

Getting Started

The minimal GEPA loop above is enough to experiment with. For the full sandbox setup, evaluation harness, and Hermes model integrations, the official Hermes Agent Challenge repository is the right starting point.

Try it on a task your current agent can't handle without custom tool work. You'll quickly see where the leverage is.

Have you experimented with dynamic skill creation in your agents? Drop your approach in the comments.

DEV Community