What Nobody Tells You About Running Hermes Agent Locally (M-Series Mac Edition)
I spent a day building a real project with Hermes Agent on my M5 MacBook Air with 16GB RAM and zero API budget. This is the honest account of what broke, what worked, and what I wish someone had told me before I started.
If you're on Apple Silicon and want to run Hermes Agent locally without paying for API credits, this post is for you.
What Hermes Agent Actually Is
Before I get into the setup pain, a quick framing for people who haven't used it yet.
Hermes Agent is an open-source autonomous agent built by NousResearch: the team behind the Hermes family of fine-tuned models. It's not a chatbot wrapper. It's a full agentic loop: it receives a goal, breaks it into steps, selects from 40+ built-in tools (browser, terminal, file system, code execution, cron jobs, messaging platforms), executes those steps, and iterates until the task is done.
The part that makes it genuinely different from most agent frameworks is episodic memory. After each task, Hermes writes a structured record of what worked and what didn't. On future tasks, it retrieves those records and adjusts its approach. It actually learns from its own history.
It's MIT licensed, runs on your own machine, and supports OpenAI, Anthropic, Google, and local models via Ollama.
Step 1: Installation (The Easy Part)
Installation is genuinely one command:
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
It installs everything automatically — Node.js, browser dependencies, the works. Takes about 3 minutes. After that:
source ~/.zshrc
hermes
You'll see the Hermes ASCII banner and a list of available tools and skills. This part actually works exactly as advertised.
First lesson: Run hermes postinstall after the main install. The base installer skips Playwright (the browser automation library). If you skip this step, every browser-related task will fail silently and you'll waste an hour debugging.
Step 2: The API Provider Trap
Here's where I hit the first wall.
Hermes supports a huge list of providers — OpenAI, Anthropic, Google Gemini, Ollama, and about 30 others. The interactive setup is clean and fast. But the provider you choose matters enormously for agentic tasks.
What I tried first: Gemini free tier
Google's Gemini API has a free tier. Sounds perfect. The problem is the rate limits:
-
gemini-2.5-flash: 5 requests per minute on free tier -
gemini-flash-latest: slightly better, still very low
For a simple chatbot, 5 requests/minute is fine. For an agentic task where Hermes might make 15-20 API calls to complete a single multi-step workflow (browse a page → take a screenshot → analyze it → decide next step → browse again), you'll hit the rate limit on the second tool call.
The error looks like this:
HTTP 429: Quota exceeded for metric:
generativelanguage.googleapis.com/generate_content_free_tier_requests
limit: 5, model: gemini-3.5-flash
And then Hermes retries, hits the limit again, and eventually gives up. You end up with a half-completed task and no useful output.
The fix: don't use cloud APIs for agentic tasks on a free tier. The request volume is just too high.
Step 3: Going Local with Ollama
This is where Apple Silicon earns its reputation.
Ollama runs LLMs locally using Apple's Metal framework — your GPU and CPU share the same unified memory pool, which means models load fast and run at genuinely usable speeds.
Install Ollama:
brew install ollama
ollama serve # keep this running in a separate terminal tab
Now the model choice matters. On a 16GB M-series machine:
| Model | Size | Speed | Context | Verdict |
|---|---|---|---|---|
| qwen3:8b | 5.2GB | ~50 tok/s | 40K | Good for most tasks |
| gemma3:12b | ~8GB | ~30 tok/s | 128K | Smarter, but slower |
| llama3.2:3b | 2GB | ~90 tok/s | 128K | Fast but less capable |
| anything 30B+ | >16GB | Unusable | — | Skip entirely |
I went with qwen3:8b:
ollama pull qwen3:8b
Then switch Hermes to use it:
hermes config set provider ollama
hermes config set base_url http://localhost:11434/v1
hermes config set model qwen3:8b
hermes config set model.context_length 65536
hermes config set model.ollama_num_ctx 65536
Critical: Those last two lines are not optional. Hermes requires a minimum 64K context window. Qwen3:8b defaults to 40K. Without the override, Hermes will refuse to initialize every single time with this error:
Model qwen3:8b has a context window of 40,960 tokens, which is below
the minimum 64,000 required by Hermes Agent.
Step 4: The Honest Performance Reality
I'm not going to pretend qwen3:8b on an M5 base model is fast for agentic tasks.
A simple factual question: ~15-20 seconds.
A multi-step agentic task with 5-6 tool calls: 8-12 minutes.
For a demo or a prototype, that's acceptable. For something you'd run continuously in production, you'd want either a paid API or a machine with more RAM to run a larger model.
The tradeoff is clear: speed vs. cost vs. privacy. Local Ollama gives you infinite requests, zero cost, and complete data privacy. You pay for it in latency.
For my use case — an agent that runs once daily to process regulatory documents — the latency is completely fine. The agent runs overnight and the output is ready in the morning.
Step 5: What Hermes Is Actually Good At
Once everything is running, here's what genuinely impressed me:
Terminal tool chaining. Hermes can execute a sequence of shell commands, read the output of each one, and use that output to decide what to do next. This is the core of what makes it an agent rather than a script runner.
Staying on task. With a well-written prompt, Hermes doesn't get distracted. It completes the steps you gave it without asking for clarification on every detail.
The skills system. Hermes ships with 90+ pre-built skills — integrations with GitHub, Obsidian, Spotify, Google Workspace, and dozens more. These aren't just API wrappers; they're prompting strategies that tell Hermes how to use each tool effectively.
What it struggles with on smaller models:
- Complex multi-step reasoning where each step builds on the last
- Tasks that require reading a long document and making nuanced judgments
- Anything where the prompt is ambiguous
The last point is on the user, not Hermes. Clear, specific prompts produce dramatically better results than vague ones.
The Setup That Actually Works
Here's the complete working configuration for M-series Mac, free tier, local model:
# 1. Install
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
source ~/.zshrc
hermes postinstall # don't skip this
# 2. Install Ollama and pull a model
brew install ollama
ollama serve & # or run in a separate tab
ollama pull qwen3:8b
# 3. Configure Hermes
hermes config set provider ollama
hermes config set base_url http://localhost:11434/v1
hermes config set model qwen3:8b
hermes config set model.context_length 65536
hermes config set model.ollama_num_ctx 65536
# 4. Start
hermes
Test it works:
Search the web for the latest news about open source AI agents
If Hermes uses the browser tool and returns actual results, you're set.
My Honest Take
Hermes Agent is the most capable open-source agent I've used. The tool ecosystem is genuinely broad, the install experience is smooth, and the episodic memory system is an idea that most commercial agent frameworks haven't caught up to yet.
The documentation gap is real — the official docs cover the happy path well, but edge cases like the Ollama context window requirement or the Playwright install step are nowhere to be found. You find them by hitting errors.
For developers who want to build real agentic workflows without API costs or data privacy concerns, Hermes on Apple Silicon is a genuinely viable stack. The latency is the price you pay. On most tasks, it's worth it.
Built and tested on M5 MacBook Air 16GB, macOS Sequoia, Hermes Agent v0.14.0, Ollama 0.6.x, qwen3:8b.
Top comments (0)