Hermes Agent: First Contact

#meta #buildinginpublic #agents #llm

Someone recommended I look at Hermes Agent as an alternative to OpenClaw. I've been running OpenClaw on the homelab since early May — it drives a Discord bot backed by Qwen 3.6 on an RTX 5090, with MCP tools wired into a fitness tracker. It works, mostly. The "mostly" is why I was open to alternatives.

What I expected: a quick install, a side-by-side comparison, a blog post with a verdict. What I got instead was a research rabbit hole that changed my understanding of why my existing setup had been flaky in the first place.

What Hermes Actually Is

Two things from the same org (Nous Research):

Hermes Models — fine-tuned LLMs trained specifically for function calling, with native <tool_call> tokens baked into the weights. The model knows the tool-calling grammar because it was trained on it.
Hermes Agent — a Python-based agent framework with 90+ built-in tools, a skills/learning system, and integrations for 25+ messaging platforms.

The key difference from OpenClaw: vertical integration. Nous Research makes both the model and the framework. The model was trained on the agent's tool schema. OpenClaw treats the model as a black box — plug in any OpenAI-compatible endpoint and go. Hermes pairs the model with the exact format it was trained to produce.

That distinction sounded like marketing until the research phase made it concrete.

The Install

One-liner installer, clean enough. It provisions its own Python, pulls 90 skills, detects my existing OpenClaw installation and offers a migration preview. The migration is thoughtful — it shows what it would import (soul config, memories, Discord settings, MCP servers) and warns about semantic mismatches. I skipped it. Importing OpenClaw's personality into Hermes would muddy any comparison.

The rough edges are in the setup wizard:

Portal login ambush. The first thing the wizard does — even after selecting "Quick setup" — is open a browser to the Nous Portal pricing page. If you're running local inference, this is confusing. You don't need an account. But there's no obvious "skip" button. You Ctrl+C out, which feels like you're breaking something.

Sudo password storage. It asks if you want Hermes to store your sudo password for running apt commands. I said no. Don't want an agent framework I'm evaluating holding root credentials.
Default model display. After setup, it shows anthropic/claude-opus-4.6 as the current model — even though no API key is configured and no cloud provider is connected. Misleading.

None of these are dealbreakers. They're first-impression friction that an open-source project with 172K GitHub stars could smooth out. The install itself took about ten minutes, model download included.

The Research That Changed Everything

Before running any comparison, I wanted to pick the right models. The obvious plan: OpenClaw runs Qwen 3.6 (my daily driver, the model it's been using for weeks), Hermes runs Hermes-4-14B (its native model). Each framework gets its best model. Fair fight.

Then I started reading GitHub issues.

There's an open llama.cpp issue titled, with admirable directness, "qwen3.6-27b not work with openclaw." The problem is in how llama.cpp handles Qwen's tool-call format.

llama.cpp's tool-call autoparser recognizes three formats:

Format	How It Works	Models
JSON_NATIVE	Pure JSON tool calls	Cleanest, fewest bugs
TAG_WITH_JSON	Function name in XML tag, arguments as JSON	Hermes models
TAG_WITH_TAGGED	Everything in nested XML tags	Qwen models

Qwen uses TAG_WITH_TAGGED — the most complex format. Tool calls look like <tool_call><function=name><parameter=key>value</parameter></function></tool_call>. Multiple open issues describe parser failures, tool calls leaking into reasoning blocks, and permanently wedged conversations when parameters contain arrays.

I built a compatibility ranking across every model on the homelab:

Model	Format	Tool-Call Reliability
Hermes-4-14B	TAG_WITH_JSON	★★★★★
Gemma 4	Custom parser	★★★★☆
Devstral	Mistral format	★★★☆☆
Qwen 3.6	TAG_WITH_TAGGED	★★☆☆☆
Qwen3-Coder	TAG_WITH_TAGGED	★★☆☆☆
DeepSeek R1	Unicode delimiters	★☆☆☆☆

Qwen — my daily driver, the model I'd been running with OpenClaw for three weeks — ranked fourth out of six for tool calling. The flaky behavior I'd attributed to "OpenClaw being finicky" or "memory-core having bugs" may have been Qwen's tool-call format failing to parse all along.

The Bootstrap Problem

More community research surfaced a second issue. OpenClaw's default bootstrap injects ~27,000 characters of system prompt — agent identity, tool schemas, conversation rules. Models at 14B parameters or below can't handle it. They hallucinate tool use as text instead of emitting structured calls.

The fix documented in the issue tracker: slash bootstrapMaxChars from 12,000 to 1,500. That's an 88% reduction in system prompt for the model to chew on before it even sees the user's message.

The Experiment Design

The research inverted the original plan. Instead of "each framework gets its native model," both agents will run Hermes-4-14B. Same model, different frameworks. That isolates the framework variable — any performance difference is the orchestration, not the weights.

Five tasks, escalating complexity, all via Discord against a fitness-tracker MCP server:

Task	Tests
List last 5 workouts	Basic single tool call
Sync Peloton → weekly count → goal pace	Multi-step tool chain
"How am I doing?"	Ambiguous intent, tool selection
Delete a fake workout ID	Error handling and recovery
Full 2025 fitness trend analysis	Multi-turn agentic reasoning

Task 5 opens into a 2×2 matrix — both agents on both Hermes-4 and Qwen 3.6 — to measure how much the model format matters versus the framework.

One deliberate asymmetry: Hermes keeps its memory and learning loop active across all five tasks. OpenClaw's memory-core is disabled due to an upstream bug. This isn't a controlled variable — it's a real product difference. We're testing each agent at its best available configuration, not at its lowest common denominator.

What I Learned Before Testing Anything

The most useful discovery came before running a single experiment. I'd been blaming OpenClaw for flaky tool calling. The actual culprit was probably Qwen's TAG_WITH_TAGGED format — deeply nested XML that llama.cpp's parser struggles with. The memory_search hangs I'd attributed to a memory-core bug? Possibly Qwen's tool calls never parsed correctly in the first place, leaving the chain dangling on an await that could never resolve.

Vertical integration isn't just a marketing story. When the model is trained on the exact tool-call format the agent expects, you skip an entire class of parsing bugs. Hermes-4-14B produces TAG_WITH_JSON — function name in a tag, arguments as clean JSON. llama.cpp strips the wrapper and passes it through. No nested XML, no parameter tags, no parser edge cases.

Whether that translates to better real-world performance is what the bakeoff will answer. But the prep work already taught me something: the model I thought was working was only partially working, and I wouldn't have known without researching a replacement.

By the Numbers

1 GitHub issue titled literally "qwen3.6-27b not work with openclaw"
6 models evaluated for tool-calling compatibility
3 tool-call formats in llama.cpp (JSON_NATIVE, TAG_WITH_JSON, TAG_WITH_TAGGED)
27,000 chars — default OpenClaw bootstrap prompt; 1,500 — recommended for ≤14B models
90 skills bundled with Hermes Agent out of the box
25+ messaging platforms supported (we configured zero of them)
10 minutes from download to installed
0 experiments run — and still the most useful research session of the week