<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alexey</title>
    <description>The latest articles on DEV Community by Alexey (@happynood).</description>
    <link>https://dev.to/happynood</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4016504%2Ff4d14f69-a905-4b80-b9ad-62cfcc1f4ec1.jpeg</url>
      <title>DEV Community: Alexey</title>
      <link>https://dev.to/happynood</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/happynood"/>
    <language>en</language>
    <item>
      <title>Does Quantization Break Tool-Calling? I Measured It on a 4GB Laptop GPU (BFCL, 3 Seeds, Bootstrap 95% CI)</title>
      <dc:creator>Alexey</dc:creator>
      <pubDate>Sun, 05 Jul 2026 17:27:25 +0000</pubDate>
      <link>https://dev.to/happynood/does-quantization-break-tool-calling-i-measured-it-on-a-4gb-laptop-gpu-bfcl-3-seeds-bootstrap-185l</link>
      <guid>https://dev.to/happynood/does-quantization-break-tool-calling-i-measured-it-on-a-4gb-laptop-gpu-bfcl-3-seeds-bootstrap-185l</guid>
      <description>&lt;p&gt;"Is Q4 safe for tool-calling?" gets asked constantly in local-LLM circles, and the answers are almost always anecdotal — a few hundred agent-hours on one model, extrapolated to everything. I wanted a benchmark where every degradation claim comes from bootstrapping the &lt;em&gt;paired per-seed delta itself&lt;/em&gt;, not from eyeballing whether two confidence intervals happen to overlap. So I built one: &lt;strong&gt;QuantCall&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;No cloud GPUs involved — everything below ran on my own hardware, an RTX 3050 Laptop with 4096 MiB of VRAM, which is exactly why the model choices below (0.6B–1.7B) look modest. That's the point: these are the models people are actually running on this class of hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt; BFCL v4 (T1 simple/multiple + T6 irrelevance, n=200/seed, 3 seeds, greedy decoding, &lt;code&gt;temperature=0&lt;/code&gt;). Metrics: Schema-Validity Rate (SVR), Tool-Selection Accuracy (TSA), Argument Correctness (AC), Abstention Accuracy, and Function-Calling Reliability (FCR — their weighted aggregate).&lt;/p&gt;

&lt;h2&gt;
  
  
  Headline result: model family beats model size as a predictor
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;SVR&lt;/th&gt;
&lt;th&gt;AC&lt;/th&gt;
&lt;th&gt;FCR (95% CI)&lt;/th&gt;
&lt;th&gt;Significant degradation?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-0.6B&lt;/td&gt;
&lt;td&gt;fp16&lt;/td&gt;
&lt;td&gt;0.877&lt;/td&gt;
&lt;td&gt;0.605&lt;/td&gt;
&lt;td&gt;0.822 [0.797, 0.847]&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-0.6B&lt;/td&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;0.878&lt;/td&gt;
&lt;td&gt;0.610&lt;/td&gt;
&lt;td&gt;0.826 [0.804, 0.850]&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-0.6B&lt;/td&gt;
&lt;td&gt;Q5_K_M&lt;/td&gt;
&lt;td&gt;0.878&lt;/td&gt;
&lt;td&gt;0.609&lt;/td&gt;
&lt;td&gt;0.820 [0.797, 0.852]&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-0.6B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Q4_K_M&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.873&lt;/td&gt;
&lt;td&gt;0.575&lt;/td&gt;
&lt;td&gt;0.798 [0.779, 0.827]&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;AC &amp;amp; FCR yes&lt;/strong&gt; (AC Δ 95% CI: [+2.6%, +7.3%] rel.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-1.7B&lt;/td&gt;
&lt;td&gt;Q8_0 (baseline*)&lt;/td&gt;
&lt;td&gt;0.880&lt;/td&gt;
&lt;td&gt;0.681&lt;/td&gt;
&lt;td&gt;0.842 [0.805, 0.873]&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-1.7B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;0.883&lt;/td&gt;
&lt;td&gt;0.686&lt;/td&gt;
&lt;td&gt;0.844 [0.814, 0.875]&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3.2-1B&lt;/td&gt;
&lt;td&gt;fp16&lt;/td&gt;
&lt;td&gt;0.327&lt;/td&gt;
&lt;td&gt;0.188&lt;/td&gt;
&lt;td&gt;0.301 [0.277, 0.327]&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3.2-1B&lt;/td&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;0.305&lt;/td&gt;
&lt;td&gt;0.176&lt;/td&gt;
&lt;td&gt;0.284 [0.266, 0.302]&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SVR, AC &amp;amp; FCR yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3.2-1B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Q4_K_M&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.280&lt;/td&gt;
&lt;td&gt;0.174&lt;/td&gt;
&lt;td&gt;0.283 [0.258, 0.305]&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;SVR, AC &amp;amp; FCR yes&lt;/strong&gt; (SVR Δ 95% CI: [+0.040, +0.055] abs.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Qwen3-1.7B's real fp16 weights don't fit a usable context length on a 4GB card — genuine CUDA OOM at &lt;code&gt;n_ctx=4096&lt;/code&gt; and &lt;code&gt;2048&lt;/code&gt;, only loads at &lt;code&gt;512&lt;/code&gt; which is too small for BFCL's tool-schema prompts. Q8_0 is its disclosed fallback baseline, not a hidden substitution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two things worth sitting with:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Qwen3-0.6B holds up all the way to Q4_K_M&lt;/strong&gt; — schema-validity never significantly degrades; only AC/FCR do, and only at the harshest quant tested.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Llama-3.2-1B's schema-validity is fragile at every quant level, including Q8_0&lt;/strong&gt; — the one people usually assume is basically free. Its absolute AC is also low across the board; it tends to emit stringified numbers (&lt;code&gt;"10"&lt;/code&gt; instead of &lt;code&gt;10&lt;/code&gt;), which correct JSON-schema validation rejects.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A 1B Llama and a 0.6B Qwen3 look like similar-effort deployments on paper. Under quantization they behave nothing alike.&lt;/p&gt;

&lt;h2&gt;
  
  
  Harder tasks make the gap bigger, not smaller
&lt;/h2&gt;

&lt;p&gt;T1+T6 are BFCL's easiest tiers (one call, or none). As a breadth check, T2 (parallel tool calls) + T3 (ToolACE, realistic catalogs) were run at fp16 and Q4_K_M:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;SVR&lt;/th&gt;
&lt;th&gt;ΔSVR (95% CI)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3.2-1B&lt;/td&gt;
&lt;td&gt;fp16&lt;/td&gt;
&lt;td&gt;0.572&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3.2-1B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;0.338&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.233 abs, CI [+0.205, +0.265] — ~5x the T1+T6 drop&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-0.6B&lt;/td&gt;
&lt;td&gt;fp16&lt;/td&gt;
&lt;td&gt;0.687&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-0.6B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;0.692&lt;/td&gt;
&lt;td&gt;not significant (matches T1+T6)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Llama's schema-validity collapse at Q4_K_M is roughly &lt;strong&gt;5x larger&lt;/strong&gt; on parallel/ToolACE-style tasks than on simple single-call ones. If you only benchmark the easy tiers, you'll underestimate exactly the failure mode that matters most for agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two negative results, reported as negative results
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Constrained decoding (GBNF) didn't rescue anything.&lt;/strong&gt; After fixing a real grammar bug that had been blocking correct abstention, forcing schema-valid output via grammar constraints did &lt;em&gt;not&lt;/em&gt; measurably improve SVR or AC for Qwen3 here — and cost 6–86% more wall-clock time per instance. A real, disclosed cost with no measured benefit on this benchmark.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Serving backend doesn't move the needle independent of quantization.&lt;/strong&gt; Qwen3-0.6B's SVR/AC/FCR are statistically indistinguishable between &lt;code&gt;llama-cpp&lt;/code&gt; (GGUF) and &lt;code&gt;transformers&lt;/code&gt; (bf16, no GGUF) at matching precision — so the degradation above is a quantization effect, not a serving-engine artifact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducing this
&lt;/h2&gt;

&lt;p&gt;Every result file embeds a manifest: git commit SHA, config hash, dataset sample hash, and hardware fingerprint (GPU/driver/CUDA). Nothing here is cherry-picked — the constrained-decoding and backend checks are both negative results, reported as such.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;uv
git clone https://github.com/Happynood/quant-toolcall-bench
&lt;span class="nb"&gt;cd &lt;/span&gt;quant-toolcall-bench
uv &lt;span class="nb"&gt;sync
&lt;/span&gt;make verify                                          &lt;span class="c"&gt;# no GPU needed&lt;/span&gt;
quantcall run &lt;span class="nt"&gt;--config&lt;/span&gt; configs/smoke.yaml &lt;span class="nt"&gt;--output&lt;/span&gt; results/smoke.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code:&lt;/strong&gt; &lt;a href="https://github.com/Happynood/quant-toolcall-bench" rel="noopener noreferrer"&gt;github.com/Happynood/quant-toolcall-bench&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Live leaderboard + Pareto chart:&lt;/strong&gt; &lt;a href="https://huggingface.co/spaces/happynood/quantcall-leaderboard" rel="noopener noreferrer"&gt;huggingface.co/spaces/happynood/quantcall-leaderboard&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raw per-seed results:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/happynood/quantcall-results" rel="noopener noreferrer"&gt;huggingface.co/datasets/happynood/quantcall-results&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Currently covers Qwen3 (0.6B/1.7B) and Llama-3.2-1B across &lt;code&gt;llama-cpp&lt;/code&gt;, &lt;code&gt;transformers&lt;/code&gt;, and &lt;code&gt;openai&lt;/code&gt;-compatible backends; &lt;code&gt;vLLM&lt;/code&gt; is implemented against the real &lt;code&gt;LLM.chat()&lt;/code&gt; API but not yet GPU-verified — that needs more than 4GB of VRAM to test properly. If you've got a bigger card and want to extend the model or hardware coverage, the PR flow is documented in &lt;code&gt;CONTRIBUTING.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you're deciding between Q4 and Q6 for an agent deployment, the honest answer from this data is: &lt;em&gt;it depends which model family you're running, and check the harder-task numbers, not just the easy-tier ones.&lt;/em&gt; Less satisfying than a single rule of thumb, but it's what the numbers actually say.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
