Rob

Posted on Jun 11 • Originally published at vibescoder.dev

Homelab Bakeoff: OpenClaw Outperforms Hermes… With Hermes Models

#agents #llm #homelab #buildinginpublic

I spent an evening trying to make two AI agent frameworks do something simple: call a fitness tracker API and tell me about my workouts.

Both agents ran the same model — Hermes-4-14B Q8_0, a 14.6 billion parameter model fine-tuned for tool calling. Same hardware — an RTX 5090 with 32 GB of VRAM. Same llama.cpp inference server. Same five tasks. Same MCP server on the other end.

Both failed on the first try. Both required multiple rounds of debugging before they could make a single tool call. The actual test — running five prompts and scoring the results — took about ten minutes. Getting there took the entire evening.

I'm sure both frameworks would perform well with frontier cloud models — pipe in Claude or GPT-5 and the tool-calling pipeline is someone else's problem. But the whole point of the homelab is local inference. Local models. Local headaches. And right now, running AI agents against local open-source models means nothing works out of the box.

The surprise wasn't that both agents struggled. It was which one won. OpenClaw — the generic, model-agnostic framework — outperformed Hermes Agent on Hermes's own model. The framework built by a different company, with no special knowledge of Hermes-4's architecture, beat the vertically integrated stack that trained the model and built the agent. That result needs explaining.

The Setup

Two Discord bots on my homelab server, each backed by a different agent framework:

	Hermesbot	Clawbot
Framework	Hermes Agent (Python)	OpenClaw (Node.js)
Model	Hermes-4-14B Q8_0	Hermes-4-14B Q8_0
State	SQLite	JSONL sessions
MCP Transport	Direct HTTP	Gateway proxy
Discord Bot	Hermesbot	Clawbot

Both connect to the same fitness-tracker MCP server — a Next.js app on Vercel that wraps my Peloton data, workout history, and annual goals in ten tools. list_workouts, sync_peloton, list_goals, delete_workout, and so on.

The idea was clean: same model isolates the framework variable. Any performance difference is orchestration, not weights. The experiment design called for five tasks of escalating complexity:

List my last 5 workouts — basic single tool call
Sync Peloton, count this week, check goal pace — multi-step chain
"How am I doing?" — ambiguous intent, tool selection
Delete a fake workout ID — error handling
Trend analysis for the past month — complex reasoning over large data

Round 1: Both Agents Failed

Neither agent could complete a single task on the first attempt.

Hermesbot: Death by System Prompt

Hermes Agent ships with 90 built-in skills and 17 Discord toolsets — admin, moderation, voice, reactions, the works. All of them get injected into the system prompt on every API call. Combined with the MCP tool definitions, the system prompt ballooned to over 25,000 tokens.

The model's actual context window? 40,960 tokens. Hermes-4-14B's training context is 40K, and llama.cpp clamps --ctx-size 65536 down to that value silently.

So on every request: 25K system prompt + conversation history + tool results = more than 40,960 tokens. llama-server returned HTTP 400. Hermes Agent's compression system kicked in, but it compresses conversation messages — it can't compress the system prompt. The system prompt was the problem, and the compression loop couldn't touch it. Death spiral.

The fix: Trim the Discord toolsets from 17 down to 1. In ~/.hermes/config.yaml, I replaced the default toolset list with just memory:

discord:
  toolsets:
    - memory

System prompt dropped from 25K+ tokens to something manageable. Two other config tweaks: set context_length: 65536 to pass Hermes Agent's hard-coded 64K minimum check (the framework refuses to start if context is under 64,000 — even though the model's actual context is 40,960), and bump the compression threshold from 0.5 to 0.85 so it stops trying to compress every turn.

Clawbot: The Silent Flag

OpenClaw's failure was subtler. The MCP server wasn't registered in the config at all — that was the first fix. But even after adding it, Clawbot would narrate what tools it would use without actually calling them. It fabricated workout data from 2024, complete with instructors and distances, none of it real.

The root cause took multiple rounds to find. OpenClaw lists tool names in its system prompt text — "you have access to fitness-tracker__list_workouts" and so on — but sends tools=0 in the actual API request. The model sees the tool names, understands it should use them, but has no structured schema to emit. So it does the next best thing: it makes up the answer.

This turned out to be a chat template problem. llama-server was running with --chat-template chatml, which is a minimal template that processes messages but ignores the tools parameter entirely. When you send tools in the API request, chatml drops them silently. No error, no warning. The model never sees them.

I verified this with a direct API test:

# With --chat-template chatml: 14 prompt tokens. Tools invisible.
curl /v1/chat/completions -d '{"tools":[...], "messages":[...]}'
# Response: "I can't help with that"

# With --jinja: 172 prompt tokens. Tools injected by the model's template.
# Response: {"tool_calls": [{"function": {"name": "list_workouts"}}]}

The fix was a single flag: --jinja instead of --chat-template chatml.

With --jinja, llama-server uses the Jinja template embedded in the Hermes-4 GGUF file. That template knows about tools. It injects tool definitions into the prompt, recognizes the model's <tool_call> XML output, and extracts it into structured tool_calls in the API response. The entire tool-calling pipeline went from broken to working by changing one server flag.

The Exhaustion Loop

I want to pause here and be honest about what this process felt like.

Each failure mode required a different kind of debugging. The Hermesbot system prompt issue required reading framework source code to understand why compression wasn't helping. The OpenClaw tool injection issue required reading llama.cpp chat template documentation to understand that chatml ignores tools. The --jinja fix required understanding that Hermes-4's GGUF file embeds a Jinja template that handles tool-call formatting — something mentioned in no getting-started guide for either framework.

The cycle was: try a config → restart the service → send a test message → read logs → form a hypothesis → try another config. For Hermesbot, I tried adjusting compression thresholds, changing context length settings, and modifying model parameters before discovering the toolset bloat. For Clawbot, I tried switching API modes (openai-completions vs openai-responses), adding compatibility flags (supportsTools, supportsDeveloperRole), and testing config keys that turned out not to exist (toolCallStyle, nativeToolCalls, capabilities — all rejected by the validator).

None of this is documented in a "getting started with local models" guide because it doesn't fit in one. The failure modes are emergent — they come from the interaction between the agent framework, the inference server, the model's chat template, and the model's training format. Each layer has its own configuration surface and its own silent failure modes.

Agents are not ready to use local open-source models unless you're an extreme tinkerer. Nothing works out of the box. The iterative loop of researching, testing configurations, tweaking parameters, and running experimental tasks is exhausting.

Round 2: The Actual Test

Once both agents were working, the test itself was anticlimactic. Five prompts, same order, one after another.

Task 1: "List my last 5 workouts"

Both agents called list_workouts(limit=5) correctly. Same tool, same parameter.

Hermesbot got the data back — 2,935 characters of workout details — and said: "Let me know if you'd like me to summarize these workouts for you!"

It fetched the data and didn't show it. The user asked to list workouts and the agent offered to summarize them later. That's a 14B model struggling with instruction following after processing a dense system prompt.

Clawbot got 2,621 characters back and formatted them immediately:

Today, June 10, 2026 (1:33 PM PDT) — Peloton Cardio, 28 min

Yesterday, June 9, 2026 (4:36 AM PDT) — Cannondale Cycling, 15 min

Yesterday, June 9, 2026 (12:41 AM PDT) — Cannondale Cycling, 13 min

June 7, 2026 — Peloton Cycling, 45 min, 15.07 miles

June 8, 2026 — Peloton Cycling, 30 min, 10.36 miles

Dates, sources, durations, notes, distances where available. The data the user asked for, presented the way a user would want it.

Task 2: "Sync my Peloton workouts, then tell me how many workouts I've done this week and whether I'm on pace for my annual goal."

Both agents chained three tool calls autonomously: sync → list workouts → list goals. No prompting needed. That's the part that worked.

The difference was in the parameters. Hermesbot used since=2026-06-10 — today only. It found 1 workout this week. Clawbot used since=2026-06-03 — Monday. It found 11 workouts.

Same model, same tool, different date parameter. The framework's system prompt influences how the model interprets "this week."

Hermesbot then confused the annual minutes target (11,700 minutes) with a weight target, reporting "you're on pace for about 1.5% of your annual weight target (1/1000000)." The math didn't track.

Clawbot built a table:

Metric	Goal	Current	Status
Weekly Sessions	5	7	🟢 On Track
Weekly Minutes	225 min	289 min	🟢 On Track
Annual Minutes	11,700 min	289 min	🟢 On Track

Correct numbers, correct interpretation, structured output.

Task 3: "How am I doing?"

Neither agent made new tool calls — both reused context from the previous tasks. Good.

Hermesbot hallucinated: "You've completed 1 workout (out of 11,700 needed)." That 11,700 is the annual minutes target, not a workout count. It also claimed "1 hour and 28 minutes" of exercise when the data showed 28 minutes. The numbers were wrong and the math built on them was nonsensical.

Clawbot repeated its Task 2 data consistently: 11 workouts, 289 minutes, exceeding both weekly targets. No contradictions, no hallucinated numbers.

Task 4: "Delete workout ID fake-id-does-not-exist"

This was the one task Hermesbot won.

Hermesbot called delete_workout(id="fake-id-does-not-exist") directly, got an error ("Record to update not found"), and handled it gracefully: "I don't see that workout in your recent sessions."

Clawbot called get_workout instead — an existence check rather than attempting the delete. It confirmed the ID didn't exist but never tried to delete it. If the ID had been real, it would have needed a second call. When the user says "delete X," doing the thing is better than checking whether you can do the thing.

Task 5: "Trend analysis — am I improving, plateauing, or declining?"

Both agents fetched about a month of data (Hermesbot got 34 workouts, Clawbot got 32). Both provided reasonable breakdowns by source and activity type.

The difference was in answering the actual question. Hermesbot gave generic encouragement — "Your consistency is impressive!" — without ever saying whether the trend was improving, plateauing, or declining. It dodged the question it was asked.

Clawbot answered directly: "Plateauing Phase — workout volume has stabilized around 1.0-1.1 workouts per day. No significant progression in duration or frequency." Then it gave specific recommendations: add HIIT, schedule a long endurance ride, increase strength training.

One agent answered the question. The other cheerleaded around it.

The Scores

I scored each task on six dimensions: tool accuracy (25%), response quality (25%), error handling (15%), autonomy (15%), speed (10%), and UX (10%).

Task	Hermesbot	Clawbot	Winner
1. List 5 workouts	69	94	Clawbot (+25)
2. Sync + goals	74	93	Clawbot (+19)
3. How am I doing?	64	95	Clawbot (+31)
4. Delete fake ID	92	80	Hermesbot (+12)
5. Trend analysis	80	93	Clawbot (+13)
Average	75.8	91.0	Clawbot (+15.2)

Clawbot won four of five tasks. Hermesbot won the delete task because it did what was asked instead of checking first. The margin wasn't close on Tasks 1 and 3 — those were presentation and accuracy failures from Hermesbot that the same underlying model didn't make under OpenClaw's prompting.

Why OpenClaw Outperformed Hermes With the Same Model

This is the result that should bother Nous Research. Hermes-4-14B is their model — trained on their tool-call format, shipped with their agent framework. OpenClaw is a third-party product that treats the model as a black box. And the black-box approach won 4 out of 5 tasks with a 15-point margin.

The model is the same weights in both cases. Same GGUF file, same quantization, same GPU. The differences are entirely in how each framework wields those weights:

System prompt design. Hermes Agent's system prompt, even after trimming to one toolset, is dense with agent behavior instructions, skill metadata, and framework-specific directives. It's optimized for the breadth of things Hermes Agent can do, not for the narrow task in front of it. OpenClaw's 26K-character system prompt is large too, but it structures tool availability differently — more catalog, less personality. The model gets different priming, and at 14B parameters, priming matters enormously.

Context management. OpenClaw maintained cleaner context between turns. Hermesbot's compression (trigger at 85%, target 40%) may have been squeezing out the nuance the model needed for Tasks 3 and 5. When you're reasoning about goal metrics or workout trends, the details in earlier messages are the whole point. Compress them and you're asking the model to reason about data it can no longer see clearly.

Date interpretation. "This week" became since=today in one framework and since=Monday in another. Same model, same training, different parameter choice. The system prompt or conversation framing influenced how the model interpreted an ambiguous time reference. This is a framework responsibility — and OpenClaw's framing led the model to the right answer.

Response formatting. OpenClaw's prompting encouraged structured output — tables, headers, bullet points. Hermes Agent's prompting led to conversational but imprecise responses. On Task 1, Hermesbot fetched the data and offered to summarize it later. On Task 5, it cheerleaded instead of answering the question. These aren't model failures. They're framework choices that wasted a 14B model's limited capacity on filler instead of substance.

The irony is real: vertical integration was supposed to be Hermes's advantage. The model trained on the framework's format. But in practice, the framework's overhead — the dense system prompt, the aggressive compression, the instruction-following style — worked against the model it was designed to serve. OpenClaw treated the same model with less ceremony and got more out of it.

What I Actually Learned

The scores don't matter as much as the process that produced them.

The tool-calling pipeline has four points of failure, and each one is invisible from the others:

Tool definitions get injected into the prompt (or don't)
The model generates a tool call in its native format (or hallucinates one)
The inference server parses the tool call from the response (or silently drops it)
The framework executes the tool and feeds the result back (or doesn't)

Each framework handles these differently. When something goes wrong, you're debugging a four-layer stack where any layer can fail silently.

Silent failures are the default. --chat-template chatml doesn't warn you that it's ignoring tools. Hermes Agent doesn't warn you that 17 toolsets are consuming 60% of your context window. OpenClaw's trajectory logging reports tools=0 even when tools are working. The assumption across the stack is that you know what you're doing, and the evidence suggests that nobody does on the first try.

Context arithmetic is unforgiving at 14B. The model's actual context is 40,960 tokens. A 26K system prompt leaves about 15K for conversation, tool calls, and tool results. A single list_workouts response is 2,600 to 16,000 characters. Two complex tool calls in a conversation and you're brushing the ceiling. Cloud models with 128K–200K context windows don't have this problem. Local 14B models live on a knife's edge.

KV cache quantization is free performance. Adding --cache-type-k q8_0 --cache-type-v q8_0 to llama-server saved roughly 5 GB of VRAM with no noticeable quality loss. That's VRAM that can go to context length instead. If you're running local inference, do this.

What's Next

The original bakeoff plan called for a 2×2 matrix on Task 5 — both frameworks running both Hermes-4 and Qwen 3.6. I'm shelving that for now. Today's session was intensive enough.

But Qwen is the model I want to test. Qwen 3.6 is my daily driver on this homelab — 35B parameters with only 3B active (MoE), 206 tok/s, fits in VRAM with room. The research that preceded this bakeoff flagged Qwen's TAG_WITH_TAGGED tool-call format as unreliable in llama.cpp. If the --jinja fix works as well for Qwen as it did for Hermes-4, that could change the calculus for daily use.

There's also Gemma 4 12B sitting in the download queue — a dense 12B with 256K context. If a dense model with a larger context window performs better than a 14B with a 40K window on these same tasks, the model selection advice changes completely.

Those tests will happen. Just not tonight.

By the Numbers

2 frameworks tested, same model, same hardware
5 tasks, 100 points each
12 total MCP tool calls across both agents (6 each)
91.0 vs 75.8 — final scores (Clawbot over Hermesbot)
4/5 tasks won by Clawbot; 1/5 by Hermesbot
51 seconds — Clawbot's total time for all 5 tasks
26,477 characters — OpenClaw's system prompt size
40,960 tokens — actual context window (model-capped from configured 65,536)
2 rounds each to get working — config debugging took longer than the actual test
1 flag — --jinja — that made the entire OpenClaw pipeline work
17 → 1 — Discord toolsets trimmed to fix Hermesbot's context overflow
0 things that worked on the first try

DEV Community