Delafosse Olivier

Posted on Mar 6

I benchmarked 24 local LLM models for OpenClaw agent tool calling on RTX 3090

#ai #agents #llm #opensource

I run OpenClaw as my daily AI agent (Telegram, email, CRM) on a self-hosted RTX 3090. I tested 24 models (18 dense + 6 MoE) on what actually matters for agents: tool calling, multi-step workflows, bilingual FR/EN, and JSON reliability.

Setup: llama.cpp, 65K context, KV cache q4_0, flash attention.

TL;DR

Qwen 2.5 Coder 32B (Q4_K_M) wins at 9.3/10 — a model from October 2024 beats every 2025-2026 model
It also beats Claude Sonnet 4.5 API (8.6/10) on pure agent execution
Reasoning models (R1 Distill, QwQ, OLMo Think) make terrible agents — thinking ≠ doing
MoE with small active params can't handle multi-step — fast but unreliable
Magistral Small 2509 is the dark horse — best multi-step (9/10), perfect French

Protocol — 7 categories, 25 tests

Category	Weight	What we measure
Tool Calling	25%	Single tool: exec, read, edit, web_search, browser
Multi-step	25%	Chain 3+ tools: email→HARO→CRM, KB→syndication
Instructions	20%	Confirmation, FR response, CRM verify
Bilingual FR/EN	10%	Pure EN/FR, switch, long context stability
JSON	10%	Parseable, types, nested, consistency (3x)
Speed	5%	tok/s on 400-word generation
Prefix Cache	5%	Speedup on repeated prompts

Dense Models Results

#	Model	Q	Score	Tools	Multi	Instr	BiLi	JSON	tok/s
ref	Claude Sonnet 4.5 (API)	—	8.6	8.2	9.0	7.5	10.0	10.0	34.6*
1	Qwen 2.5 Coder 32B	Q4	9.3	10.0	10.0	7.5	10.0	10.0	15.2
2	Qwen 2.5 Instruct 32B	Q4	9.3	10.0	9.0	8.3	10.0	10.0	17.5
3	Magistral Small 2509	Q6	8.2	6.2	9.0	7.5	10.0	10.0	16.2
4	Falcon-H1 34B	Q4	8.2	10.0	6.7	7.5	10.0	10.0	16.9
5	Hermes 4.3 36B	Q3	8.0	8.2	8.0	5.8	10.0	10.0	14.0
6	Mistral Small 3.2	Q6	7.9	9.0	5.7	7.5	10.0	10.0	16.9
7	Qwen3 32B	Q4	7.7	8.2	6.7	5.8	8.8	10.0	16.0
8	Devstral Small 2	Q6	7.5	8.2	4.7	7.5	10.0	10.0	15.9
9	QwQ 32B	Q4	7.3	8.2	4.7	7.5	7.0	10.0	15.5
10	Granite 4.0-H (MoE)	Q4	7.2	8.2	4.7	5.8	10.0	10.0	53.3
11	Qwen3.5 27B	Q4	7.1	8.2	6.7	8.3	3.5	6.6	17.9
12	Devstral Small v1	Q6	5.6	6.4	0.0	5.8	10.0	10.0	16.8
13	Aya Expanse 32B	Q4	5.5	6.4	0.0	5.8	10.0	10.0	14.8
14	Gemma 3 27B	Q4	5.5	6.4	0.0	5.8	10.0	8.0	18.2

*Claude tok/s estimated from API wall time, not comparable with local

MoE Models (small active params)

Model	Q	Score	Tools	Multi	tok/s	Notes
Qwen3.5 35B-A3B	Q4	7.9	8.2	10.0	84.9	FAIL: BiLi 3.5, JSON 4.6
Qwen3 30B-A3B	Q4	7.6	8.2	4.7	125.6	VIABLE
Qwen3-Coder 30B-A3B	Q4	7.5	6.2	4.7	128.2	VIABLE
GLM-4.7-Flash	Q4	6.6	8.2	2.3	87.8	VIABLE

Key Findings

1. A 2024 model still wins

Qwen 2.5 Coder 32B was optimized for structured output and function calling. No 2025-2026 model has topped it for agent work.

2. Local beats cloud for agents

Qwen 2.5 Coder (9.3) > Claude Sonnet 4.5 (8.6) on this benchmark. For pure tool execution, the local model wins at €15/mo electricity vs $20-50/mo API.

3. Newer Qwen = worse tool calling

Gen	Tool Calling	Bilingual FR
Qwen 2.5 (2024)	10.0	10.0
Qwen 3 (2025)	8.2	8.8
Qwen 3.5 (2026)	8.2	3.5

Qwen 3.5 mixes Chinese into French responses. Each generation got smarter on benchmarks but worse at reliable execution.

4. Reasoning models can't agent

R1 Distill (4.0), OLMo Think (3.4), QwQ (7.3) — they waste tokens thinking when the agent needs to act.

5. MoE with small active params isn't enough

Fast (85-128 tok/s) but can't maintain context for multi-step chains. Dense 32B at 15-17 tok/s is slower but reliable.

6. Surprises

Falcon-H1 34B (8.2) — relatively unknown model, perfect tool calling
Magistral Small (8.2) — best French + multi-step combo

Q5_K_M Tests

Tried upgrading top models to Q5_K_M — all OOM'd at 65K context on 24GB. Q4_K_M is the ceiling for 32B on a single 3090. Only Magistral Small 24B benefits from higher quant (runs at Q6_K in 19GB).

My Setup

Daily driver: Qwen 2.5 Coder 32B Q4_K_M (llama.cpp)
French tasks: Magistral Small 2509 Q6_K
Complex reasoning: Claude API fallback

Open Source Benchmark

GitHub repo: github.com/Shad107/openclaw-benchmark

Node.js, zero dependencies, works with any llama.cpp setup. PRs welcome if you test other models.

Hardware: RTX 3090 24GB, 64GB RAM, Ubuntu 25.10. Temp 0.1 for tool calls, 0.3 for generation.

DEV Community