I run OpenClaw as my daily AI agent (Telegram, email, CRM) on a self-hosted RTX 3090. I tested 24 models (18 dense + 6 MoE) on what actually matters for agents: tool calling, multi-step workflows, bilingual FR/EN, and JSON reliability.
Setup: llama.cpp, 65K context, KV cache q4_0, flash attention.
TL;DR
- Qwen 2.5 Coder 32B (Q4_K_M) wins at 9.3/10 — a model from October 2024 beats every 2025-2026 model
- It also beats Claude Sonnet 4.5 API (8.6/10) on pure agent execution
- Reasoning models (R1 Distill, QwQ, OLMo Think) make terrible agents — thinking ≠ doing
- MoE with small active params can't handle multi-step — fast but unreliable
- Magistral Small 2509 is the dark horse — best multi-step (9/10), perfect French
Protocol — 7 categories, 25 tests
| Category | Weight | What we measure |
|---|---|---|
| Tool Calling | 25% | Single tool: exec, read, edit, web_search, browser |
| Multi-step | 25% | Chain 3+ tools: email→HARO→CRM, KB→syndication |
| Instructions | 20% | Confirmation, FR response, CRM verify |
| Bilingual FR/EN | 10% | Pure EN/FR, switch, long context stability |
| JSON | 10% | Parseable, types, nested, consistency (3x) |
| Speed | 5% | tok/s on 400-word generation |
| Prefix Cache | 5% | Speedup on repeated prompts |
Dense Models Results
| # | Model | Q | Score | Tools | Multi | Instr | BiLi | JSON | tok/s |
|---|---|---|---|---|---|---|---|---|---|
| ref | Claude Sonnet 4.5 (API) | — | 8.6 | 8.2 | 9.0 | 7.5 | 10.0 | 10.0 | 34.6* |
| 1 | Qwen 2.5 Coder 32B | Q4 | 9.3 | 10.0 | 10.0 | 7.5 | 10.0 | 10.0 | 15.2 |
| 2 | Qwen 2.5 Instruct 32B | Q4 | 9.3 | 10.0 | 9.0 | 8.3 | 10.0 | 10.0 | 17.5 |
| 3 | Magistral Small 2509 | Q6 | 8.2 | 6.2 | 9.0 | 7.5 | 10.0 | 10.0 | 16.2 |
| 4 | Falcon-H1 34B | Q4 | 8.2 | 10.0 | 6.7 | 7.5 | 10.0 | 10.0 | 16.9 |
| 5 | Hermes 4.3 36B | Q3 | 8.0 | 8.2 | 8.0 | 5.8 | 10.0 | 10.0 | 14.0 |
| 6 | Mistral Small 3.2 | Q6 | 7.9 | 9.0 | 5.7 | 7.5 | 10.0 | 10.0 | 16.9 |
| 7 | Qwen3 32B | Q4 | 7.7 | 8.2 | 6.7 | 5.8 | 8.8 | 10.0 | 16.0 |
| 8 | Devstral Small 2 | Q6 | 7.5 | 8.2 | 4.7 | 7.5 | 10.0 | 10.0 | 15.9 |
| 9 | QwQ 32B | Q4 | 7.3 | 8.2 | 4.7 | 7.5 | 7.0 | 10.0 | 15.5 |
| 10 | Granite 4.0-H (MoE) | Q4 | 7.2 | 8.2 | 4.7 | 5.8 | 10.0 | 10.0 | 53.3 |
| 11 | Qwen3.5 27B | Q4 | 7.1 | 8.2 | 6.7 | 8.3 | 3.5 | 6.6 | 17.9 |
| 12 | Devstral Small v1 | Q6 | 5.6 | 6.4 | 0.0 | 5.8 | 10.0 | 10.0 | 16.8 |
| 13 | Aya Expanse 32B | Q4 | 5.5 | 6.4 | 0.0 | 5.8 | 10.0 | 10.0 | 14.8 |
| 14 | Gemma 3 27B | Q4 | 5.5 | 6.4 | 0.0 | 5.8 | 10.0 | 8.0 | 18.2 |
*Claude tok/s estimated from API wall time, not comparable with local
MoE Models (small active params)
| Model | Q | Score | Tools | Multi | tok/s | Notes |
|---|---|---|---|---|---|---|
| Qwen3.5 35B-A3B | Q4 | 7.9 | 8.2 | 10.0 | 84.9 | FAIL: BiLi 3.5, JSON 4.6 |
| Qwen3 30B-A3B | Q4 | 7.6 | 8.2 | 4.7 | 125.6 | VIABLE |
| Qwen3-Coder 30B-A3B | Q4 | 7.5 | 6.2 | 4.7 | 128.2 | VIABLE |
| GLM-4.7-Flash | Q4 | 6.6 | 8.2 | 2.3 | 87.8 | VIABLE |
Key Findings
1. A 2024 model still wins
Qwen 2.5 Coder 32B was optimized for structured output and function calling. No 2025-2026 model has topped it for agent work.
2. Local beats cloud for agents
Qwen 2.5 Coder (9.3) > Claude Sonnet 4.5 (8.6) on this benchmark. For pure tool execution, the local model wins at €15/mo electricity vs $20-50/mo API.
3. Newer Qwen = worse tool calling
| Gen | Tool Calling | Bilingual FR |
|---|---|---|
| Qwen 2.5 (2024) | 10.0 | 10.0 |
| Qwen 3 (2025) | 8.2 | 8.8 |
| Qwen 3.5 (2026) | 8.2 | 3.5 |
Qwen 3.5 mixes Chinese into French responses. Each generation got smarter on benchmarks but worse at reliable execution.
4. Reasoning models can't agent
R1 Distill (4.0), OLMo Think (3.4), QwQ (7.3) — they waste tokens thinking when the agent needs to act.
5. MoE with small active params isn't enough
Fast (85-128 tok/s) but can't maintain context for multi-step chains. Dense 32B at 15-17 tok/s is slower but reliable.
6. Surprises
- Falcon-H1 34B (8.2) — relatively unknown model, perfect tool calling
- Magistral Small (8.2) — best French + multi-step combo
Q5_K_M Tests
Tried upgrading top models to Q5_K_M — all OOM'd at 65K context on 24GB. Q4_K_M is the ceiling for 32B on a single 3090. Only Magistral Small 24B benefits from higher quant (runs at Q6_K in 19GB).
My Setup
- Daily driver: Qwen 2.5 Coder 32B Q4_K_M (llama.cpp)
- French tasks: Magistral Small 2509 Q6_K
- Complex reasoning: Claude API fallback
Open Source Benchmark
GitHub repo: github.com/Shad107/openclaw-benchmark
Node.js, zero dependencies, works with any llama.cpp setup. PRs welcome if you test other models.
Hardware: RTX 3090 24GB, 64GB RAM, Ubuntu 25.10. Temp 0.1 for tool calls, 0.3 for generation.
Top comments (0)