DEV Community

zac
zac

Posted on • Originally published at remoteopenclaw.com

Best Open-Source Models for Hermes Agent — Self-Hosted Setup

Originally published on Remote OpenClaw.

The best open-source model for Hermes Agent is Llama 4 Maverick for overall quality, Qwen 3 8B for budget VPS deployments, and Mistral Small for the best balance of size and capability. Hermes Agent auto-detects models installed through Ollama and includes per-model tool call parsers that optimize function calling for each local model. Running open-source models eliminates API costs entirely — your only expense is the hardware or VPS hosting the model.

Key Takeaways

  • Hermes Agent auto-detects Ollama models and ships per-model tool call parsers for reliable local function calling.
  • Llama 4 Maverick (1M context, strong tool calling) is the best local model but needs 16+ GB RAM.
  • Qwen 3 8B runs on a VPS with 8 GB RAM and handles straightforward agent tasks at zero API cost.
  • Mistral Small fits in 8 GB RAM with 128K context and solid function calling — the best lightweight option.
  • Hardware requirements: 8 GB RAM minimum for 7-8B models, 16 GB for 14B models, 48+ GB for 70B models.

In this guide

  1. Open-Source Model Rankings for Hermes Agent
  2. Hardware Requirements
  3. Ollama Setup for Hermes Agent
  4. Model-by-Model Configuration
  5. How Hermes Agent Handles Local Tool Calling
  6. Limitations and Tradeoffs
  7. FAQ

Open-Source Model Rankings for Hermes Agent

Open-source models for Hermes Agent must handle two things well: general instruction following and structured tool calling. As of April 2026, these are the top open-weight models ranked by agent task performance when run through Ollama.

Model

Parameters

RAM Required

Context Window

Tool Calling

Best For

Llama 4 Maverick

400B MoE (17B active)

16+ GB

1M tokens

Good

Best overall local performance

Qwen 3 32B

32B

20–24 GB

32K tokens

Good

Strong reasoning, multilingual

Mistral Small

22B

16 GB

128K tokens

Good

Best quality-to-size ratio

Qwen 3 8B

8B

8 GB

32K tokens

Moderate

Budget VPS, zero-cost agent

Llama 4 Scout

109B MoE (17B active)

12 GB

512K tokens

Moderate

Large context on modest hardware

DeepSeek R1 Distill (14B)

14B

12 GB

128K tokens

Moderate

Reasoning-heavy tasks

Gemma 3 12B

12B

10 GB

128K tokens

Moderate

Google ecosystem compatibility

RAM requirements above assume Q4_K_M quantization, which reduces memory usage by 50–75% compared to full-precision weights with minimal quality loss. All listed models are available through ollama pull and work with Hermes Agent immediately after download.


Hardware Requirements

Running open-source models locally means your hardware becomes the bottleneck, not your API budget. The primary constraint is RAM — models must fit entirely in memory (system RAM or VRAM) to run at usable speeds.

Minimum Specs by Model Size

Model Size

Minimum RAM

Recommended RAM

CPU Requirement

Example Models

7–8B

6 GB

8 GB

4+ cores

Qwen 3 8B, Llama 3.3 8B

12–14B

10 GB

16 GB

4+ cores

Gemma 3 12B, DeepSeek R1 14B distill

22–32B

16 GB

24 GB

6+ cores

Mistral Small, Qwen 3 32B

70B+

40 GB

48+ GB

8+ cores

Llama 3.3 70B, Qwen 2.5 72B

A GPU is not required — Ollama runs on CPU. However, a GPU with sufficient VRAM significantly improves response speed. On CPU-only hardware, expect 2–10 tokens per second for 7-8B models and 0.5–3 tokens per second for 32B models. For a detailed guide on VPS hardware for Hermes Agent, see our self-hosted Hermes Agent guide.

VPS Cost Comparison

Self-hosting on a VPS replaces API costs with hosting costs. As of April 2026, here is what it costs to run different model sizes on popular VPS providers:

VPS Provider

Plan for 8B Models

Monthly Cost

Plan for 32B Models

Monthly Cost

Hetzner

CX32 (8 GB RAM)

~$8

CX52 (32 GB RAM)

~$30

Hostinger

KVM 2 (8 GB RAM)

~$10

KVM 8 (32 GB RAM)

~$35

DigitalOcean

4 GB Droplet

~$24

32 GB Droplet

~$96

At $8–$10 per month for a VPS running Qwen 3 8B, you get unlimited agent interactions with zero per-token cost. This breaks even against DeepSeek V4 API usage at roughly $8–$10 per month of moderate use, and saves substantially compared to Claude Sonnet at $20–$80 per month.


Ollama Setup for Hermes Agent

Ollama is the recommended way to run open-source models with Hermes Agent. Install Ollama, pull a model, and Hermes Agent detects it automatically.

Install Ollama

# Linux / WSL2
curl -fsSL https://ollama.com/install.sh | sh

# macOS (Homebrew)
brew install ollama
Enter fullscreen mode Exit fullscreen mode

Pull a Model

# Best overall local model
ollama pull llama4-maverick

# Best for budget VPS (8 GB RAM)
ollama pull qwen3:8b

# Best lightweight option
ollama pull mistral-small
Enter fullscreen mode Exit fullscreen mode

Configure Hermes Agent

# Run the model selector
hermes model
# Select "ollama" as provider, then choose your downloaded model
Enter fullscreen mode Exit fullscreen mode

Or set it directly in ~/.hermes/config.yaml:

provider: ollama
model: qwen3:8b
Enter fullscreen mode Exit fullscreen mode

No API key is needed — Hermes Agent connects to the local Ollama server on the default port (11434). For full installation steps including Docker deployment, see our Hermes Agent setup guide.


Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Model-by-Model Configuration

Each open-source model has different strengths for Hermes Agent workflows. Below are specific recommendations and configuration notes for the top options.

Llama 4 Maverick — Best Overall

Meta's Llama 4 Maverick uses a Mixture-of-Experts architecture with 400B total parameters but only activates 17B per token, keeping resource usage manageable. The 1M token context window matches cloud models like Claude Sonnet and DeepSeek V4, which is critical for Hermes Agent's context-heavy requests. Tool calling quality approaches cloud-level performance.

provider: ollama
model: llama4-maverick
Enter fullscreen mode Exit fullscreen mode

Requires 16+ GB RAM. Best suited for dedicated hardware or a VPS with at least 16 GB RAM.

Qwen 3 8B — Best for Budget VPS

Qwen 3 8B from Alibaba runs on a VPS with just 8 GB RAM and delivers functional tool calling for straightforward agent tasks. It supports 29 languages, making it the best budget option for multilingual Hermes Agent deployments. The 32K context window is a limitation — long agent sessions with many tool calls may truncate earlier context.

provider: ollama
model: qwen3:8b
Enter fullscreen mode Exit fullscreen mode

Mistral Small — Best Lightweight

Mistral Small offers 128K context in a 22B model that fits in 16 GB RAM. Its function calling capabilities are strong relative to its size, and the larger context window means less truncation during extended agent sessions compared to Qwen 3 8B.

provider: ollama
model: mistral-small
Enter fullscreen mode Exit fullscreen mode

DeepSeek R1 Distill 14B — Best for Reasoning

The distilled version of DeepSeek R1 brings chain-of-thought reasoning to local hardware. At 14B parameters, it fits in 12 GB RAM and handles multi-step reasoning better than other models in its size class. The tradeoff is slower response times due to the reasoning process.

provider: ollama
model: deepseek-r1:14b
Enter fullscreen mode Exit fullscreen mode

How Hermes Agent Handles Local Tool Calling

Hermes Agent includes per-model tool call parsers that are specifically designed for local models. This is a key advantage over other agent frameworks when running with Ollama. Different models format tool calls differently — Llama uses one XML-like format, Qwen uses another, and Mistral has its own convention. Hermes Agent's parsers handle these differences automatically.

According to the official Hermes Agent documentation, the agent detects which model is loaded through Ollama and applies the correct parser. This reduces malformed tool call errors that are common when running local models through generic agent frameworks.

For models not yet in the parser registry, Hermes Agent falls back to a generic OpenAI-compatible parser. You can also define custom parsers in the configuration for any model that uses a non-standard tool calling format.


Limitations and Tradeoffs

Self-hosted open-source models trade API costs for hardware costs and operational complexity. These are honest tradeoffs to consider before switching from a cloud API model.

  • Response speed is slower. Local models on CPU-only hardware generate 2–10 tokens per second for 7-8B models. Cloud APIs return 50–100+ tokens per second. Interactive agent sessions feel noticeably slower without a GPU.
  • Tool calling quality is lower. Even the best open-source models generate malformed tool calls more often than Claude Sonnet 4.6 or GPT-4.1. Retries consume compute time on local hardware, adding latency.
  • Context windows are smaller. Qwen 3 8B (32K) and Mistral Small (128K) have smaller context windows than cloud models (1M). Hermes Agent loads tool definitions, memory, and history into every request — smaller windows mean earlier context gets truncated in long sessions.
  • You manage the infrastructure. Updates, monitoring, disk space, and model downloads are your responsibility. A cloud API abstracts all of this away.
  • Quantization reduces quality. Running at Q4 quantization (necessary to fit larger models in less RAM) reduces output quality compared to full-precision inference. The effect is measurable on benchmarks but often acceptable for practical agent tasks.

When NOT to self-host: if you need fast interactive responses, if you run complex multi-step agent workflows that require high tool calling reliability, or if you do not want to manage server infrastructure. For cloud API alternatives at low cost, see our DeepSeek models for Hermes Agent guide.


Related Guides


FAQ

Can I run Hermes Agent completely free with no API costs?

Yes. Install Ollama on your local machine or VPS, pull an open-source model like Qwen 3 8B, and configure Hermes Agent to use the Ollama provider. No API key is needed and there are no per-token charges. Your only cost is the hardware or VPS hosting — which starts at roughly $8 per month for a VPS with 8 GB RAM capable of running 7-8B models.

What is the minimum hardware to run Hermes Agent with a local model?

The minimum viable setup is 8 GB RAM with a 4-core CPU running a 7-8B parameter model like Qwen 3 8B through Ollama. This handles basic agent tasks at 2–10 tokens per second on CPU. For comfortable performance with a stronger model, 16 GB RAM with Mistral Small or Llama 4 Maverick is recommended.

Does Hermes Agent auto-detect Ollama models?

Yes. Hermes Agent queries the local Ollama server on startup and lists all downloaded models as available options. Run hermes model to see the list and select one. The agent also applies per-model tool call parsers automatically based on the detected model, optimizing function calling for each model's format.

Which open-source model has the best tool calling for Hermes Agent?

Llama 4 Maverick has the best tool calling among open-source models for Hermes Agent as of April 2026. It approaches cloud-model quality for structured function calls while supporting a 1M token context window. Mistral Small is the second-best option with reliable function calling at a smaller model size. Qwen 3 8B handles basic tool calling but generates malformed calls more frequently on complex tasks.

Can I use a GPU to speed up local models with Hermes Agent?

Yes. Ollama automatically uses GPU acceleration when a compatible NVIDIA, AMD, or Apple Silicon GPU is available. GPU inference is 5–20x faster than CPU for most models. On Apple Silicon Macs, the unified memory architecture means models use the same memory pool as the system, and Ollama leverages the Metal framework for acceleration without separate VRAM requirements.

Top comments (0)