Bharath Nelapatla

Posted on May 25

What Happens When You Ask an Open-Source Agent to Do Real SRE Work on Windows? A Trajectory Worth Reading

#hermesagentchallenge #devchallenge #agents #ai

Hermes Agent Challenge Submission: Write About Hermes Agent

*This is a submission for the Hermes Agent Challenge*

I spent yesterday hand-building a Docker-sandboxed Linux training game with a custom AI judge. The whole stack: ~250 lines of Python wrapping Ollama, a YAML scenario format, a shell command runner, an explicit grading rubric. So when I saw Nous Research's Hermes Agent ships with a Docker terminal backend, persistent memory, skill creation, and 70+ built-in tools, I had to know: how much of what I built yesterday did Hermes already have built-in?

I installed Hermes Agent v0.14.0 on Windows tonight and gave it the same scenario my training game gives players. This post is what I learned — including a real bug the agent self-diagnosed in real-time.

The setup

Hermes Agent v0.14.0 (Windows native, early beta — installer is one PowerShell one-liner)
Local Ollama running gemma4:26b (Gemma 4 26B MoE, already on the host from a previous build)
No cloud API keys. Provider: custom, base URL: http://localhost:11434/v1, mode: chat_completions
Toolsets enabled: terminal,file
Scenario: a recreated "/var disk filling up" investigation. A real 80MB log file at .\var\log\app.log plus two tiny decoy files. The exact same scenario my training game uses.

The task

An alert just paged: the var directory in the current folder is filling up fast (96% full). Find which file is consuming the most space under .\var. Use terminal commands to investigate. When you find the largest file, report its full path and size in MB.

This is a one-shot SRE scenario. A junior engineer should solve it with 2-3 commands. I wanted to see how Hermes' agentic loop — planning, tool use, observation, replanning — handled it.

The trajectory

Hermes ran for 7m 11s and produced 8 tool calls before I interrupted. The full JSONL session export is in the repo. Here's what happened, with the agent's own reasoning traces.

Call 1 — Plan, then act.
Reasoning (Gemma 4 26B):

"The user is reporting an issue with a var directory filling up. I need to: navigate to var, identify the largest file recursively, report path and size in MB. I'll start by listing the contents."

Tool call: terminal { "command": "ls -R ./var" }
Result: exit 126

{"output": "/bin/bash: line 2: cd: C:\\Users\\nbhar\\Downloads\\Gamma\\hermes-build\\scenario: No such file or directory"}

The plan was correct. The tool failed before the command even ran — the terminal tool tried to cd into the working directory using a Windows path inside bash and failed.

Call 2 — Diagnose.
Reasoning:

"The error says cd: C:\Users\nbhar\Downloads\Gamma\hermes-build \scenario: No such file or directory. But the system prompt says my current working directory is that path."

Tool call: terminal { "command": "pwd" }
Result: exit 126 — same error.

The agent correctly noticed the contradiction between what the system prompt claimed about the working directory and what the shell reported.

Calls 3-7 — Try every reasonable variation.
The agent cycled through ls -la, ls -la ., pwd && ls -la, plain pwd again. Every command returned the same exit-126 error before executing. The trajectory shows Hermes injecting its own [Tool loop warning] messages mid-run, telling the agent to change strategy.

Call 8 — The agent self-diagnoses the bug.
Reasoning:

"The error message indicates that the current working directory provided in the prompt does not exist or is not accessible via the terminal tool's shell environment (which seems to be a bash-like environment on a Windows host, but bash doesn't understand C:... natively without a mount point like /c/...). On Windows-based Git Bash or similar, C:\ is usually /c/. The error command not found or No such file or directory for a path with backslashes often happens because backslashes are escape characters in bash. I should try using forward slashes or the /c/ convention if it's Git Bash."

The agent self-diagnosed the root cause correctly. Hermes' terminal tool on Windows runs commands through bash (git-bash/MSYS), but the working-directory injection passes a Windows backslash-style path to a cd step inside that bash shell. Bash interprets C:\Users\... as C:Users... (backslash escaping), so the cd prologue fails before any user command runs. Every tool call hits the same wall regardless of which command the agent tries.

The model knew exactly what was wrong. It couldn't escape the trap because every command — including diagnostic ones like pwd — went through the same broken cd prologue.

What this tells me about open-source agents on Windows

Three takeaways:

1. Gemma 4 26B did real agentic work. The reasoning traces aren't fluff — they show plan → execute → observe → diagnose → replan. The model identified the exact root cause of its own environment problem in plain English. That's the agentic capability the challenge asked about, and it's real even when the framework around the model fails.

2. Hermes Agent v0.14.0's Windows native build is genuinely beta. Nous flags it as such, and they're right. The path-translation between Windows working directory and the bash shell backend is a real bug. The agent's own loop-detection (Hermes injected [Tool loop warning] messages mid-trajectory) is excellent UX, but it can't save the agent when every tool call hits the same environmental wall.

3. Hand-rolling isn't dead. Yesterday I built a Linux training game with a custom Docker sandbox that uses subprocess.run(["docker", "run", ..., "sh", "-c", command]). The reason that worked first time, on Windows, was that I bypassed every shell layer. Hermes is more capable in every other dimension — persistent memory, 16+ messaging platforms, skill creation, MCP integration. But for one specific job (run a command in a controlled environment and capture the output), a 30-line subprocess wrapper outperformed a 70+ tool agentic framework on this host.

That's not an indictment of Hermes. It's an honest observation about where each pattern fits. If you're on Linux, macOS, or WSL2, Hermes' terminal tool just works. If you're a field engineer on a Windows laptop trying to evaluate an autonomous agent, you'll hit this within minutes — and the friction point is in the framework, not the model.

What I'd do differently next time

Run Hermes inside WSL2 rather than the Windows native beta. The Linux/macOS installer path is more battle-tested. The exact same one-liner installer auto-detects WSL.
Test on a non-Windows host first before forming opinions about agent capability. Tonight, the agent never got to demonstrate its actual capability on this task because the environment trap caught it first.
Submit the bug upstream. The reasoning traces in the captured session point clearly at the fix — Hermes should translate the configured working directory to MSYS-style (/c/Users/...) when invoking bash on Windows. Filing a GitHub issue against the Nous Research repo is my next move after this post.

The full session

Captured trajectory (18 messages, 8 tool calls, 7m 11s): docs/hermes-session.jsonl in the repo.

The setup script and the .\var\log\app.log scenario recreation are also in the repo: anyone with Hermes Agent installed and an Ollama model pulled can reproduce this trajectory in under five minutes.

Repo: https://github.com/Bharathtrainer/hermes-build

What I'm taking forward

Local-first AI agents are real and powerful, and Hermes Agent on Linux is almost certainly excellent. Tonight I learned that "runs on your own infrastructure" means specifically Linux-flavored infrastructure right now, and that the friction point lives in the gap between an agent's reasoning capability (which Gemma 4 26B has) and its environment integration (which is where every framework spends 80% of its real-world bugs).

I'll be running Hermes inside WSL2 from here on. And tomorrow I'll be opening an issue against the Nous Research repo with this trajectory attached — the agent's own diagnosis points right at the fix.

All inference local. No cloud APIs. Model: Gemma 4 26B MoE via Ollama at http://localhost:11434/v1. Hermes Agent v0.14.0. The post you just read was written by a human after watching an agent get stuck on a Windows path-encoding problem and explain exactly why.

Top comments (1)

Harjot Singh • May 31

The most valuable instinct in this whole piece is the one in your title, "a trajectory worth reading." Evaluating an agent on whether it got the right final answer misses the point; the trajectory (what it tried, where it backtracked, which tool it reached for and why) is where you actually learn if it's reasoning or just got lucky. An agent can land the correct fix through a sequence of nonsense steps, and that's the one that'll fail catastrophically next time on a slightly different problem. Your hand-built version (Docker sandbox, shell runner, explicit grading rubric) is exactly the right scaffolding, and the explicit rubric is the part I'd never give up even when a framework offers a built-in judge, because a rubric you wrote is auditable and a black-box judge is just another model you're trusting. SRE on Windows is a great stress test too, more stateful, more ways to silently leave the box in a bad state. That trajectory-and-rubric-over-final-answer discipline is core to how I evaluate agents in Moonshift. Of what you built by hand, what did Hermes genuinely replace versus what you still trusted your own rubric for?