Alexey

Posted on Jul 5 • Originally published at github.com

Does Quantization Break Tool-Calling? I Measured It on a 4GB Laptop GPU (BFCL, 3 Seeds, Bootstrap 95% CI)

#llm #opensource #machinelearning #python

Model family outperforms size as a predictor

"Is Q4 safe for tool-calling?" gets asked constantly in local-LLM circles, and the answers are almost always anecdotal — a few hundred agent-hours on one model, extrapolated to everything. I wanted a benchmark where every degradation claim comes from bootstrapping the paired per-seed delta itself, not from eyeballing whether two confidence intervals happen to overlap. So I built one: QuantCall.

No cloud GPUs involved — everything below ran on my own hardware, an RTX 3050 Laptop with 4096 MiB of VRAM, which is exactly why the model choices below (0.6B–1.7B) look modest. That's the point: these are the models people are actually running on this class of hardware.

Setup: BFCL v4 (T1 simple/multiple + T6 irrelevance, n=200/seed, 3 seeds, greedy decoding, temperature=0). Metrics: Schema-Validity Rate (SVR), Tool-Selection Accuracy (TSA), Argument Correctness (AC), Abstention Accuracy, and Function-Calling Reliability (FCR — their weighted aggregate).

Headline result: model family beats model size as a predictor

Model	Quant	SVR	AC	FCR (95% CI)	Significant degradation?
Qwen3-0.6B	fp16	0.877	0.605	0.822 [0.797, 0.847]	—
Qwen3-0.6B	Q8_0	0.878	0.610	0.826 [0.804, 0.850]	No
Qwen3-0.6B	Q5_K_M	0.878	0.609	0.820 [0.797, 0.852]	No
Qwen3-0.6B	Q4_K_M	0.873	0.575	0.798 [0.779, 0.827]	AC & FCR yes (AC Δ 95% CI: [+2.6%, +7.3%] rel.)
Qwen3-1.7B	Q8_0 (baseline*)	0.880	0.681	0.842 [0.805, 0.873]	—
Qwen3-1.7B	Q4_K_M	0.883	0.686	0.844 [0.814, 0.875]	No
Llama-3.2-1B	fp16	0.327	0.188	0.301 [0.277, 0.327]	—
Llama-3.2-1B	Q8_0	0.305	0.176	0.284 [0.266, 0.302]	SVR, AC & FCR yes
Llama-3.2-1B	Q4_K_M	0.280	0.174	0.283 [0.258, 0.305]	SVR, AC & FCR yes (SVR Δ 95% CI: [+0.040, +0.055] abs.)

Qwen3-1.7B's real fp16 weights don't fit a usable context length on a 4GB card — genuine CUDA OOM at n_ctx=4096 and 2048, only loads at 512 which is too small for BFCL's tool-schema prompts. Q8_0 is its disclosed fallback baseline, not a hidden substitution.

Two things worth sitting with:

Qwen3-0.6B holds up all the way to Q4_K_M — schema-validity never significantly degrades; only AC/FCR do, and only at the harshest quant tested.
Llama-3.2-1B's schema-validity is fragile at every quant level, including Q8_0 — the one people usually assume is basically free. Its absolute AC is also low across the board; it tends to emit stringified numbers ("10" instead of 10), which correct JSON-schema validation rejects.

A 1B Llama and a 0.6B Qwen3 look like similar-effort deployments on paper. Under quantization they behave nothing alike.

Harder tasks make the gap bigger, not smaller

T1+T6 are BFCL's easiest tiers (one call, or none). As a breadth check, T2 (parallel tool calls) + T3 (ToolACE, realistic catalogs) were run at fp16 and Q4_K_M:

Model	Quant	SVR	ΔSVR (95% CI)
Llama-3.2-1B	fp16	0.572	—
Llama-3.2-1B	Q4_K_M	0.338	+0.233 abs, CI [+0.205, +0.265] — ~5x the T1+T6 drop
Qwen3-0.6B	fp16	0.687	—
Qwen3-0.6B	Q4_K_M	0.692	not significant (matches T1+T6)

Llama's schema-validity collapse at Q4_K_M is roughly 5x larger on parallel/ToolACE-style tasks than on simple single-call ones. If you only benchmark the easy tiers, you'll underestimate exactly the failure mode that matters most for agents.

Two negative results, reported as negative results

Constrained decoding (GBNF) didn't rescue anything. After fixing a real grammar bug that had been blocking correct abstention, forcing schema-valid output via grammar constraints did not measurably improve SVR or AC for Qwen3 here — and cost 6–86% more wall-clock time per instance. A real, disclosed cost with no measured benefit on this benchmark.

Serving backend doesn't move the needle independent of quantization. Qwen3-0.6B's SVR/AC/FCR are statistically indistinguishable between llama-cpp (GGUF) and transformers (bf16, no GGUF) at matching precision — so the degradation above is a quantization effect, not a serving-engine artifact.

Reproducing this

Every result file embeds a manifest: git commit SHA, config hash, dataset sample hash, and hardware fingerprint (GPU/driver/CUDA). Nothing here is cherry-picked — the constrained-decoding and backend checks are both negative results, reported as such.

pip install uv
git clone https://github.com/Happynood/quant-toolcall-bench
cd quant-toolcall-bench
uv sync
make verify                                          # no GPU needed
quantcall run --config configs/smoke.yaml --output results/smoke.json

Code: github.com/Happynood/quant-toolcall-bench
Live leaderboard + Pareto chart: huggingface.co/spaces/happynood/quantcall-leaderboard
Raw per-seed results: huggingface.co/datasets/happynood/quantcall-results

Currently covers Qwen3 (0.6B/1.7B) and Llama-3.2-1B across llama-cpp, transformers, and openai-compatible backends; vLLM is implemented against the real LLM.chat() API but not yet GPU-verified — that needs more than 4GB of VRAM to test properly. If you've got a bigger card and want to extend the model or hardware coverage, the PR flow is documented in CONTRIBUTING.md.

If you're deciding between Q4 and Q6 for an agent deployment, the honest answer from this data is: it depends which model family you're running, and check the harder-task numbers, not just the easy-tier ones. Less satisfying than a single rule of thumb, but it's what the numbers actually say.

Top comments (10)

Nazar Boyko • Jul 6

Publishing the constrained decoding run as a negative result, with the wall clock cost attached, buys the rest of these numbers a lot of trust. One question on the Llama failure mode. Since a big share of its schema errors are stringified numbers like "10", how much of the SVR gap closes if the harness coerces types before validating? Plenty of real agent stacks quietly do that coercion, so the answer would separate "Llama is bad at tools" from "Llama is bad at strict JSON". And since the per seed results are public, someone could probably check that without rerunning anything.

Alexey • Jul 6

Good catch - and honestly this is the one caveat in the piece I didn't get to quantify. The README flags the pattern (Llama emitting "10" instead of 10) but that's from eyeballing failures during the parser audit, not a systematic count.
Here's the annoying part: I can't actually slice this out of the published per-seed CSVs as they stand. Those only carry aggregate SVR/TSA/AC per seed - the parsed arguments per instance never get written to disk. Just went and checked RunResult.to_dict() and yeah, it drops instance_results entirely. So there's nothing to grep through without running it again.
Good news is it's a cheap rerun, not a redesign - no new model needed, just wrap the validator with a type-coercion pass before jsonschema.validate (it's strict-typed by default, doesn't coerce) and diff SVR against the current version on the same Llama-3.2-1B checkpoints. I'll queue it, and this time I'll actually persist instance-level output so this doesn't need a rerun next time someone asks. My guess is it closes some of the gap but not most of it - SVR fails for reasons other than type mismatches too (missing required args, wrong enums), the stringified-number thing is just the one that was visible from eyeballing, not necessarily the majority cause.

Nazar Boyko • Jul 6

Thanks for sharing!

Alex Shev • Jul 5

Measuring tool-calling under quantization is a useful test because accuracy alone hides the product risk. A model can answer well and still become unreliable once tools, arguments, and retries enter the loop.

Alexey • Jul 6

Yeah, that's basically the whole reason this exists instead of just reading FCR off BFCL's own harness. A model can nail every text answer and still hand a downstream tool a "10"where an int belongs, and that failure is invisible until something in the loop actually tries to int() it. Accuracy-only evals just don't have a slot for that kind of bug.

Alex Shev • Jul 6

That is the bug class I would want separated in the report: semantic correctness, schema correctness, and executable correctness. Tool-calling failures often look small in a benchmark table, but they are the ones that break the actual loop.

mote • Jul 9

The split you found between schema-validity and argument-correctness degrading on different schedules is the part most edge deployments will miss. Qwen3-0.6B keeps emitting valid schemas all the way to Q4, but starts returning wrong arguments, while Llama-3.2-1B breaks schemas even at Q8_0. A tool call that's schema-valid but semantically wrong is the nastier failure: it passes validation and then does the wrong thing quietly.

That 5x amplification on parallel and ToolACE-style calls is exactly where production agents live and almost never benchmark. Curious whether a small post-call schema-repair or rerank step recovers the Q4 argument-accuracy loss, or if the drop is intrinsic to the quantized weights and you just have to pick a higher quant for anything doing real tool orchestration.

Kartik N V J K • Jul 7

The stringified-number failure is the detail worth highlighting, because a model emitting "10" instead of 10 passes a human eyeball but fails JSON-schema validation every time, and that stays invisible until live tool calls start failing downstream. Family beating size as the predictor also fits the idea that schema adherence gets baked in during instruction tuning rather than scaling with parameter count. Did Qwen3-0.6B hold the numeric types specifically, or was its robustness mostly in required-field coverage?

Dipankar Sarkar • Jul 6

The family-beats-size result matches what I keep seeing: schema-validity and argument-correctness fail on different axes and people conflate them. If you force JSON with a grammar or constrained decoder, SVR climbs toward 1.0 and everyone declares Q4 safe, but AC keeps rotting underneath because the model still picks the wrong field or fills a plausible-wrong value. That is the dangerous regime: structurally valid, semantically wrong, silently retried.

The other thing your CI framing exposes: a 2-7% per-call AC drop is not scary alone, but agent loops multiply it across steps, so a 5-hop trajectory eats far more than 7%. Were your BFCL runs single-call or multi-turn? The compounding is where local-model tool use actually breaks in production, and it never shows up in a per-call benchmark.

Alexey • Jul 6

"Structurally valid, semantically wrong, silently retried" is a good way to put it, and it's exactly why FCR alone isn't the number to trust here. The GBNF pass is actually direct evidence for that split - forcing the grammar pushed SVR toward 1.0 for Qwen3 but didn't move AC at all. The constraint fixes the syntax and does nothing about whether the value inside is right. That run is probably the cleanest demonstration of the exact thing you're describing, and I didn't even frame it that way in the writeup.
On single-call vs multi-turn: everything here is single-turn. T1/T2/T6 are BFCL tiers where T2 is parallel calls within one turn, not a sequential conversation, and T3 (ToolACE) explicitly collapses multi-turn conversations down to the first exchange - I did that on purpose to keep ground-truth matching sane, but it does mean the compounding you're pointing at isn't in these numbers at all.
Rough math on why that matters: take the 2-7% per-call AC drop and assume steps are independent (generous - real trajectories will correlate through shared, increasingly-garbled context, so this is a floor). Over a 5-hop trajectory, 1-(1-p)^5 for p=0.02 is ~9.6%, for p=0.07 it's ~30%. So a number that reads as "fine, single digits" per call turns into "something breaks on roughly a third of trajectories" at the high end once you chain it. Multi-turn is the natural next tier here - already on my list right after real MCP tool schemas, and this is a good argument to move it up.