"Is Q4 safe for tool-calling?" gets asked constantly in local-LLM circles, and the answers are almost always anecdotal — a few hundred agent-hours on one model, extrapolated to everything. I wanted a benchmark where every degradation claim comes from bootstrapping the paired per-seed delta itself, not from eyeballing whether two confidence intervals happen to overlap. So I built one: QuantCall.
No cloud GPUs involved — everything below ran on my own hardware, an RTX 3050 Laptop with 4096 MiB of VRAM, which is exactly why the model choices below (0.6B–1.7B) look modest. That's the point: these are the models people are actually running on this class of hardware.
Setup: BFCL v4 (T1 simple/multiple + T6 irrelevance, n=200/seed, 3 seeds, greedy decoding, temperature=0). Metrics: Schema-Validity Rate (SVR), Tool-Selection Accuracy (TSA), Argument Correctness (AC), Abstention Accuracy, and Function-Calling Reliability (FCR — their weighted aggregate).
Headline result: model family beats model size as a predictor
| Model | Quant | SVR | AC | FCR (95% CI) | Significant degradation? |
|---|---|---|---|---|---|
| Qwen3-0.6B | fp16 | 0.877 | 0.605 | 0.822 [0.797, 0.847] | — |
| Qwen3-0.6B | Q8_0 | 0.878 | 0.610 | 0.826 [0.804, 0.850] | No |
| Qwen3-0.6B | Q5_K_M | 0.878 | 0.609 | 0.820 [0.797, 0.852] | No |
| Qwen3-0.6B | Q4_K_M | 0.873 | 0.575 | 0.798 [0.779, 0.827] | AC & FCR yes (AC Δ 95% CI: [+2.6%, +7.3%] rel.) |
| Qwen3-1.7B | Q8_0 (baseline*) | 0.880 | 0.681 | 0.842 [0.805, 0.873] | — |
| Qwen3-1.7B | Q4_K_M | 0.883 | 0.686 | 0.844 [0.814, 0.875] | No |
| Llama-3.2-1B | fp16 | 0.327 | 0.188 | 0.301 [0.277, 0.327] | — |
| Llama-3.2-1B | Q8_0 | 0.305 | 0.176 | 0.284 [0.266, 0.302] | SVR, AC & FCR yes |
| Llama-3.2-1B | Q4_K_M | 0.280 | 0.174 | 0.283 [0.258, 0.305] | SVR, AC & FCR yes (SVR Δ 95% CI: [+0.040, +0.055] abs.) |
- Qwen3-1.7B's real fp16 weights don't fit a usable context length on a 4GB card — genuine CUDA OOM at
n_ctx=4096and2048, only loads at512which is too small for BFCL's tool-schema prompts. Q8_0 is its disclosed fallback baseline, not a hidden substitution.
Two things worth sitting with:
- Qwen3-0.6B holds up all the way to Q4_K_M — schema-validity never significantly degrades; only AC/FCR do, and only at the harshest quant tested.
-
Llama-3.2-1B's schema-validity is fragile at every quant level, including Q8_0 — the one people usually assume is basically free. Its absolute AC is also low across the board; it tends to emit stringified numbers (
"10"instead of10), which correct JSON-schema validation rejects.
A 1B Llama and a 0.6B Qwen3 look like similar-effort deployments on paper. Under quantization they behave nothing alike.
Harder tasks make the gap bigger, not smaller
T1+T6 are BFCL's easiest tiers (one call, or none). As a breadth check, T2 (parallel tool calls) + T3 (ToolACE, realistic catalogs) were run at fp16 and Q4_K_M:
| Model | Quant | SVR | ΔSVR (95% CI) |
|---|---|---|---|
| Llama-3.2-1B | fp16 | 0.572 | — |
| Llama-3.2-1B | Q4_K_M | 0.338 | +0.233 abs, CI [+0.205, +0.265] — ~5x the T1+T6 drop |
| Qwen3-0.6B | fp16 | 0.687 | — |
| Qwen3-0.6B | Q4_K_M | 0.692 | not significant (matches T1+T6) |
Llama's schema-validity collapse at Q4_K_M is roughly 5x larger on parallel/ToolACE-style tasks than on simple single-call ones. If you only benchmark the easy tiers, you'll underestimate exactly the failure mode that matters most for agents.
Two negative results, reported as negative results
Constrained decoding (GBNF) didn't rescue anything. After fixing a real grammar bug that had been blocking correct abstention, forcing schema-valid output via grammar constraints did not measurably improve SVR or AC for Qwen3 here — and cost 6–86% more wall-clock time per instance. A real, disclosed cost with no measured benefit on this benchmark.
Serving backend doesn't move the needle independent of quantization. Qwen3-0.6B's SVR/AC/FCR are statistically indistinguishable between llama-cpp (GGUF) and transformers (bf16, no GGUF) at matching precision — so the degradation above is a quantization effect, not a serving-engine artifact.
Reproducing this
Every result file embeds a manifest: git commit SHA, config hash, dataset sample hash, and hardware fingerprint (GPU/driver/CUDA). Nothing here is cherry-picked — the constrained-decoding and backend checks are both negative results, reported as such.
pip install uv
git clone https://github.com/Happynood/quant-toolcall-bench
cd quant-toolcall-bench
uv sync
make verify # no GPU needed
quantcall run --config configs/smoke.yaml --output results/smoke.json
- Code: github.com/Happynood/quant-toolcall-bench
- Live leaderboard + Pareto chart: huggingface.co/spaces/happynood/quantcall-leaderboard
- Raw per-seed results: huggingface.co/datasets/happynood/quantcall-results
Currently covers Qwen3 (0.6B/1.7B) and Llama-3.2-1B across llama-cpp, transformers, and openai-compatible backends; vLLM is implemented against the real LLM.chat() API but not yet GPU-verified — that needs more than 4GB of VRAM to test properly. If you've got a bigger card and want to extend the model or hardware coverage, the PR flow is documented in CONTRIBUTING.md.
If you're deciding between Q4 and Q6 for an agent deployment, the honest answer from this data is: it depends which model family you're running, and check the harder-task numbers, not just the easy-tier ones. Less satisfying than a single rule of thumb, but it's what the numbers actually say.
Top comments (0)