DEV Community

zac
zac

Posted on • Originally published at remoteopenclaw.com

Best Qwen Models for Hermes Agent — Alibaba's Models Ranked

Originally published on Remote OpenClaw.

Qwen3 Max is the best Qwen model for Hermes Agent when using Alibaba's DashScope API, delivering strong reasoning and tool calling at $0.78 per million input tokens and $3.90 per million output tokens. For local deployments, Qwen3 8B via Ollama runs Hermes Agent at zero marginal cost on a machine with 8GB RAM. The entire Qwen3 lineup is open-source under Apache 2.0, giving teams the flexibility to run models locally, self-host on their own infrastructure, or access them through Alibaba's cloud — a breadth of deployment options that proprietary models cannot match.

Key Takeaways

  • Qwen3 Max ($0.78/$3.90 per million tokens) is the best cloud Qwen model for Hermes Agent — strong reasoning and tool calling.
  • Qwen3 8B runs locally via Ollama on 8GB RAM with zero API cost — ideal for privacy-focused or offline agents.
  • Qwen3 235B-A22B (MoE) is the flagship open model: 235B total params, 22B active, available through DashScope or self-hosted.
  • All Qwen3 models are Apache 2.0 licensed — full freedom to self-host, fine-tune, and modify.
  • Two connection paths: DashScope API (cloud, OpenAI-compatible) or Ollama (local, no API key needed).

In this guide

  1. Qwen Models Ranked for Hermes Agent
  2. DashScope API Setup for Hermes
  3. Local Qwen via Ollama Setup
  4. Open-Source Advantage for Self-Hosted Agents
  5. Qwen vs Other Hermes Providers
  6. Limitations and Tradeoffs
  7. FAQ

Qwen Models Ranked for Hermes Agent

Alibaba's Qwen3 series offers models from 0.6B to 235B parameters, all released under Apache 2.0. For Hermes Agent, the relevant models span three tiers: flagship cloud API models for maximum quality, mid-size local models for balanced performance, and lightweight models for resource-constrained deployments. Every model below exceeds Hermes Agent's 64K minimum context requirement.

Model

Parameters

Context

Cost (DashScope)

Ollama

Best For

Qwen3 Max

Undisclosed

128K

$0.78/$3.90

No

Flagship cloud reasoning, complex tasks

Qwen3 235B-A22B

235B (22B active)

128K

Via DashScope

Needs 48GB+ VRAM

Self-hosted flagship, MoE efficiency

Qwen3 32B

32B (dense)

128K

Via DashScope

Needs 20GB+ RAM

Strong local reasoning, coding

Qwen3 8B

8B (dense)

128K

Via DashScope

Needs 8GB RAM

Best local model for most hardware

Qwen3 30B-A3B

30B (3B active)

128K

Via DashScope

Needs 4GB RAM

Ultra-efficient MoE local model

Qwen3 4B

4B (dense)

128K

Via DashScope

Needs 4GB RAM

Minimal hardware, basic tasks

Qwen3 Max is the recommended choice for cloud deployments — it consistently performs well on agentic benchmarks and its pricing undercuts Claude Sonnet by roughly 4x on input tokens. For local deployments, Qwen3 8B is the sweet spot: it fits comfortably in 8GB RAM, supports tool calling in both thinking and non-thinking modes, and produces results that meaningfully exceed what smaller 4B models can achieve. The 30B-A3B MoE variant is a strong alternative that activates only 3B parameters per token, running nearly as fast as the 4B dense model while accessing 30B total parameters.


DashScope API Setup for Hermes

DashScope is Alibaba Cloud's model serving platform and provides an OpenAI-compatible API endpoint, which means Hermes Agent can connect to it using the custom provider configuration. No special SDK or plugin is required.

Step 1: Get Your DashScope API Key

Sign up at Alibaba Cloud Model Studio. Navigate to the API Keys section and generate a key. As of April 2026, new accounts receive free credits for Qwen model usage.

Step 2: Configure config.yaml

Since DashScope is OpenAI-compatible, configure it as a custom provider in Hermes Agent with the DashScope base URL:

# ~/.hermes/config.yaml
model:
  default: qwen3-max
  provider: custom
  base_url: https://dashscope-intl.aliyuncs.com/compatible-mode/v1
  api_key_env: DASHSCOPE_API_KEY
Enter fullscreen mode Exit fullscreen mode

Step 3: Set the API Key

hermes config set DASHSCOPE_API_KEY your-api-key-here
Enter fullscreen mode Exit fullscreen mode

The base_url must include /v1 because the OpenAI Python SDK appends /chat/completions directly. For international users, use the dashscope-intl endpoint. For users in mainland China, replace with https://dashscope.aliyuncs.com/compatible-mode/v1.

For complete Hermes Agent installation and general setup, see our Hermes Agent setup guide.


Local Qwen via Ollama Setup

Hermes Agent auto-detects models installed through Ollama and includes per-model tool call parsers optimized for local models. Running Qwen locally means zero API cost, complete data privacy, and no rate limits — at the tradeoff of requiring adequate hardware.

Step 1: Install Ollama and Pull a Qwen Model

# Install Ollama (if not already installed)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen3 8B (recommended for most machines)
ollama pull qwen3:8b

# Or pull the efficient MoE variant (runs fast on limited hardware)
ollama pull qwen3:30b-a3b
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure Hermes Agent

Hermes Agent detects Ollama automatically. Run hermes model and select Ollama from the provider list — it will show all locally installed models. Or configure manually:

# ~/.hermes/config.yaml
model:
  default: qwen3:8b
  provider: ollama
Enter fullscreen mode Exit fullscreen mode

No API key is needed for Ollama. The default Ollama endpoint is http://localhost:11434. If you run Ollama on a different host or port, specify it:

model:
  default: qwen3:8b
  provider: custom
  base_url: http://your-server:11434/v1
Enter fullscreen mode Exit fullscreen mode

Hardware Requirements

Model

Min RAM

Recommended RAM

Speed (tokens/sec)

Qwen3 4B

4GB

8GB

~30-50 on CPU

Qwen3 8B

8GB

16GB

~20-40 on CPU

Qwen3 30B-A3B

4GB

8GB

~25-45 on CPU (MoE)

Qwen3 32B

20GB

32GB

~10-20 on CPU

Apple Silicon Macs with unified memory are particularly well-suited for local Qwen models — an M2 or M3 MacBook with 16GB runs Qwen3 8B comfortably with GPU acceleration through Ollama's Metal support.


Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Open-Source Advantage for Self-Hosted Agents

Every Qwen3 model is released under Apache 2.0, the most permissive standard open-source license. This is not a "source-available" or "research-only" license — Apache 2.0 grants full commercial use, modification, distribution, and private deployment rights with no restrictions beyond attribution.

For Hermes Agent deployments, the open-source advantage plays out in several ways:

  • Complete data sovereignty. Run Qwen locally via Ollama or on your own server via vLLM or SGLang. No data leaves your network. This is a hard requirement for legal, healthcare, financial, and government agent deployments.
  • No API dependency. Your Hermes Agent keeps working if Alibaba's API has downtime, changes pricing, or deprecates a model. The weights are yours to keep and serve indefinitely.
  • Fine-tuning for domain tasks. Unlike proprietary models, you can fine-tune Qwen3 on your domain data (legal documents, codebase patterns, internal knowledge) and serve the tuned model through Ollama. Hermes Agent connects to it identically.
  • Cost at scale. For teams running multiple Hermes agents or processing high volumes, self-hosting Qwen3 on a GPU server eliminates per-token costs entirely. The breakeven point depends on hardware costs, but for sustained workloads it typically arrives within weeks.

Qwen is not the only open-source option for Hermes — Llama 4 and Mistral are also available via Ollama. But Qwen3's Apache 2.0 license is less restrictive than Llama's custom license, and the model size range (0.6B to 235B) offers more granularity for matching model size to available hardware. For a broader look at Qwen models beyond Hermes, see our Qwen models overview for 2026. For Qwen configuration in OpenClaw specifically, see the Qwen models for OpenClaw guide.


Qwen vs Other Hermes Providers

Qwen3 Max competes with cloud API models from Anthropic, OpenAI, and DeepSeek, while Qwen3's local models compete with Llama 4 Maverick and Mistral. The table below compares across the dimensions that matter most for Hermes Agent users.

Model

Input/Output Cost

Context

License

Local Option

Agent Strength

Qwen3 Max

$0.78/$3.90

128K

Apache 2.0

No

Strong reasoning, bilingual

Qwen3 8B (Ollama)

Free (local)

128K

Apache 2.0

Yes (8GB)

Privacy, zero cost

Claude Sonnet 4.6

$3.00/$15.00

200K

Proprietary

No

Best reasoning and tool calling

DeepSeek V4

$0.30/$0.50

1M

MIT

No (too large)

Budget coding, huge context

Llama 4 Maverick

Free (local)

1M

Llama License

Yes (16GB+)

Local privacy, large context

Qwen3 Max's pricing ($0.78 input) places it between DeepSeek V4 ($0.30) and GLM-5.1 ($0.95), making it a mid-budget cloud option with strong bilingual capability. For local deployments, Qwen3 8B requires less RAM than Llama 4 Maverick (8GB vs 16GB+) while the Apache 2.0 license offers more permissive terms than Llama's custom license. For overall rankings across all providers, see our best models for Hermes Agent guide.


Limitations and Tradeoffs

Qwen models have specific constraints to consider before selecting them for Hermes Agent.

  • DashScope requires custom endpoint config. Unlike Anthropic, OpenAI, or MiniMax, Alibaba's DashScope is not a first-class Hermes provider. You must configure it as a custom OpenAI-compatible endpoint with the correct base URL. This adds a setup step that other providers skip.
  • Local model quality ceiling. Qwen3 8B is capable but cannot match the reasoning depth of cloud models like Claude Sonnet 4.6 or GPT-4.1. Complex multi-step agent tasks — multi-file code generation, nuanced research synthesis — will produce noticeably weaker results on small local models.
  • 128K context maximum. All Qwen3 models cap at 128K tokens. This exceeds Hermes Agent's 64K minimum but falls short of DeepSeek V4 (1M), GPT-4.1 (1M), and MiniMax-Text-01 (4M). For memory-heavy or document-heavy workflows, the 128K ceiling is a real constraint.
  • DashScope latency outside Asia. Alibaba Cloud's primary data centers are in Asia. Users in North America or Europe may experience higher API latency compared to US-based providers. OpenRouter availability for Qwen models can mitigate this.
  • Ollama tool call parsing. While Hermes Agent includes per-model tool call parsers for Ollama models, local Qwen tool calling can be less reliable than cloud API tool calling. Test your specific workflow before deploying to production with a local Qwen model.

Related Guides


FAQ

How do I connect Qwen3 to Hermes Agent via DashScope?

Configure Hermes Agent with a custom provider pointing to DashScope's OpenAI-compatible endpoint. In ~/.hermes/config.yaml, set provider: custom, base_url: https://dashscope-intl.aliyuncs.com/compatible-mode/v1, and default: qwen3-max. Set your API key with hermes config set DASHSCOPE_API_KEY your-key. The /v1 suffix in the base URL is required because the OpenAI SDK appends /chat/completions directly.

Can I run Qwen3 locally with Hermes Agent using Ollama?

Yes. Install Ollama, pull a Qwen3 model with ollama pull qwen3:8b, then set provider: ollama and default: qwen3:8b in your Hermes config.yaml. No API key is needed. Hermes Agent auto-detects Ollama models and includes optimized tool call parsers for local models. Qwen3 8B requires approximately 8GB RAM and runs on macOS, Linux, and WSL2.

Which Qwen3 model should I choose for local Hermes Agent?

Qwen3 8B is the recommended local model for most hardware — it balances reasoning quality with resource requirements (8GB RAM minimum). For machines with only 4GB RAM, use the Qwen3 30B-A3B MoE variant, which activates only 3B parameters per token and runs nearly as fast as the 4B model while accessing more total knowledge. For machines with 20GB+ RAM, Qwen3 32B offers substantially better reasoning.

Is Qwen3 Max better than Claude Sonnet for Hermes Agent?

Claude Sonnet 4.6 generally outperforms Qwen3 Max on reasoning quality and tool calling reliability in Hermes Agent. Hermes's tool call parsers are most thoroughly tested with Anthropic models. However, Qwen3 Max costs roughly 4x less on input tokens ($0.78 vs $3.00) and offers native Chinese-English bilingual quality that Claude does not match. Choose Qwen3 Max when cost or bilingual capability is the priority; choose Claude when reasoning quality matters most.

Can I fine-tune Qwen3 and use the tuned model with Hermes Agent?

Yes. Since Qwen3 is Apache 2.0 licensed, you can fine-tune any Qwen3 model on your own data, then serve it through Ollama, vLLM, or SGLang. Point Hermes Agent at the local endpoint using the custom provider configuration. This is useful for domain-specific agents — for example, fine-tuning on legal documents, medical records, or internal codebases to improve the agent's performance on specialized tasks.

Top comments (0)