DEV Community

Cover image for I benchmarked 24 local LLM models for OpenClaw agent tool calling on RTX 3090
Delafosse Olivier
Delafosse Olivier

Posted on

I benchmarked 24 local LLM models for OpenClaw agent tool calling on RTX 3090

I run OpenClaw as my daily AI agent (Telegram, email, CRM) on a self-hosted RTX 3090. I tested 24 models (18 dense + 6 MoE) on what actually matters for agents: tool calling, multi-step workflows, bilingual FR/EN, and JSON reliability.

Setup: llama.cpp, 65K context, KV cache q4_0, flash attention.

TL;DR

  • Qwen 2.5 Coder 32B (Q4_K_M) wins at 9.3/10 — a model from October 2024 beats every 2025-2026 model
  • It also beats Claude Sonnet 4.5 API (8.6/10) on pure agent execution
  • Reasoning models (R1 Distill, QwQ, OLMo Think) make terrible agents — thinking ≠ doing
  • MoE with small active params can't handle multi-step — fast but unreliable
  • Magistral Small 2509 is the dark horse — best multi-step (9/10), perfect French

Protocol — 7 categories, 25 tests

Category Weight What we measure
Tool Calling 25% Single tool: exec, read, edit, web_search, browser
Multi-step 25% Chain 3+ tools: email→HARO→CRM, KB→syndication
Instructions 20% Confirmation, FR response, CRM verify
Bilingual FR/EN 10% Pure EN/FR, switch, long context stability
JSON 10% Parseable, types, nested, consistency (3x)
Speed 5% tok/s on 400-word generation
Prefix Cache 5% Speedup on repeated prompts

Dense Models Results

# Model Q Score Tools Multi Instr BiLi JSON tok/s
ref Claude Sonnet 4.5 (API) 8.6 8.2 9.0 7.5 10.0 10.0 34.6*
1 Qwen 2.5 Coder 32B Q4 9.3 10.0 10.0 7.5 10.0 10.0 15.2
2 Qwen 2.5 Instruct 32B Q4 9.3 10.0 9.0 8.3 10.0 10.0 17.5
3 Magistral Small 2509 Q6 8.2 6.2 9.0 7.5 10.0 10.0 16.2
4 Falcon-H1 34B Q4 8.2 10.0 6.7 7.5 10.0 10.0 16.9
5 Hermes 4.3 36B Q3 8.0 8.2 8.0 5.8 10.0 10.0 14.0
6 Mistral Small 3.2 Q6 7.9 9.0 5.7 7.5 10.0 10.0 16.9
7 Qwen3 32B Q4 7.7 8.2 6.7 5.8 8.8 10.0 16.0
8 Devstral Small 2 Q6 7.5 8.2 4.7 7.5 10.0 10.0 15.9
9 QwQ 32B Q4 7.3 8.2 4.7 7.5 7.0 10.0 15.5
10 Granite 4.0-H (MoE) Q4 7.2 8.2 4.7 5.8 10.0 10.0 53.3
11 Qwen3.5 27B Q4 7.1 8.2 6.7 8.3 3.5 6.6 17.9
12 Devstral Small v1 Q6 5.6 6.4 0.0 5.8 10.0 10.0 16.8
13 Aya Expanse 32B Q4 5.5 6.4 0.0 5.8 10.0 10.0 14.8
14 Gemma 3 27B Q4 5.5 6.4 0.0 5.8 10.0 8.0 18.2

*Claude tok/s estimated from API wall time, not comparable with local

MoE Models (small active params)

Model Q Score Tools Multi tok/s Notes
Qwen3.5 35B-A3B Q4 7.9 8.2 10.0 84.9 FAIL: BiLi 3.5, JSON 4.6
Qwen3 30B-A3B Q4 7.6 8.2 4.7 125.6 VIABLE
Qwen3-Coder 30B-A3B Q4 7.5 6.2 4.7 128.2 VIABLE
GLM-4.7-Flash Q4 6.6 8.2 2.3 87.8 VIABLE

Key Findings

1. A 2024 model still wins

Qwen 2.5 Coder 32B was optimized for structured output and function calling. No 2025-2026 model has topped it for agent work.

2. Local beats cloud for agents

Qwen 2.5 Coder (9.3) > Claude Sonnet 4.5 (8.6) on this benchmark. For pure tool execution, the local model wins at €15/mo electricity vs $20-50/mo API.

3. Newer Qwen = worse tool calling

Gen Tool Calling Bilingual FR
Qwen 2.5 (2024) 10.0 10.0
Qwen 3 (2025) 8.2 8.8
Qwen 3.5 (2026) 8.2 3.5

Qwen 3.5 mixes Chinese into French responses. Each generation got smarter on benchmarks but worse at reliable execution.

4. Reasoning models can't agent

R1 Distill (4.0), OLMo Think (3.4), QwQ (7.3) — they waste tokens thinking when the agent needs to act.

5. MoE with small active params isn't enough

Fast (85-128 tok/s) but can't maintain context for multi-step chains. Dense 32B at 15-17 tok/s is slower but reliable.

6. Surprises

  • Falcon-H1 34B (8.2) — relatively unknown model, perfect tool calling
  • Magistral Small (8.2) — best French + multi-step combo

Q5_K_M Tests

Tried upgrading top models to Q5_K_M — all OOM'd at 65K context on 24GB. Q4_K_M is the ceiling for 32B on a single 3090. Only Magistral Small 24B benefits from higher quant (runs at Q6_K in 19GB).

My Setup

  • Daily driver: Qwen 2.5 Coder 32B Q4_K_M (llama.cpp)
  • French tasks: Magistral Small 2509 Q6_K
  • Complex reasoning: Claude API fallback

Open Source Benchmark

GitHub repo: github.com/Shad107/openclaw-benchmark

Node.js, zero dependencies, works with any llama.cpp setup. PRs welcome if you test other models.

Hardware: RTX 3090 24GB, 64GB RAM, Ubuntu 25.10. Temp 0.1 for tool calls, 0.3 for generation.

Top comments (0)