<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Syed Azeez</title>
    <description>The latest articles on DEV Community by Syed Azeez (@syedazeez).</description>
    <link>https://dev.to/syedazeez</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1338706%2F5bd18cd2-c65a-4368-a77c-938639f003f7.jpeg</url>
      <title>DEV Community: Syed Azeez</title>
      <link>https://dev.to/syedazeez</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/syedazeez"/>
    <language>en</language>
    <item>
      <title>Quantizing three models to fit a 6 GB laptop GPU, and the one that wouldn't</title>
      <dc:creator>Syed Azeez</dc:creator>
      <pubDate>Thu, 18 Jun 2026 10:05:46 +0000</pubDate>
      <link>https://dev.to/syedazeez/quantizing-three-models-to-fit-a-6-gb-laptop-gpu-and-the-one-that-wouldnt-4pjl</link>
      <guid>https://dev.to/syedazeez/quantizing-three-models-to-fit-a-6-gb-laptop-gpu-and-the-one-that-wouldnt-4pjl</guid>
      <description>&lt;p&gt;My GPU is an RTX 3050 Laptop card. It has 6 GB of VRAM, about 5.67 GB of which I&lt;br&gt;
actually get to use after the desktop takes its cut. That is not a lot. A 3B&lt;br&gt;
model in bfloat16 is around 6 GB of weights on its own, so it does not fit, and a&lt;br&gt;
7B does not come close.&lt;/p&gt;

&lt;p&gt;So I did the obvious thing: quantize to 4-bit and see what fits. But instead of&lt;br&gt;
quantizing one model and calling it a day, I ran the same pipeline across three&lt;br&gt;
popular families (Qwen, Microsoft, Meta), benchmarked all three in vLLM, and kept&lt;br&gt;
honest notes on where it worked and where it fell over. One of the four models I&lt;br&gt;
started with never made it, and that failure turned out to be the most useful&lt;br&gt;
part of the exercise.&lt;/p&gt;

&lt;p&gt;All the code, the quantized models, and the raw numbers are linked at the bottom.&lt;/p&gt;
&lt;h2&gt;
  
  
  The plan
&lt;/h2&gt;

&lt;p&gt;W4A16 with GPTQ, group size 128, using &lt;code&gt;llmcompressor&lt;/code&gt; to produce&lt;br&gt;
compressed-tensors checkpoints that vLLM loads natively. W4A16 means 4-bit weights&lt;br&gt;
and 16-bit activations. It is GPU-friendly, vLLM serves it with the Marlin kernel,&lt;br&gt;
and it is the format most local-inference folks are already running.&lt;/p&gt;

&lt;p&gt;The whole pipeline is one script. You give it a model and an output path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llmcompressor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;oneshot&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llmcompressor.modifiers.quantization&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GPTQModifier&lt;/span&gt;

&lt;span class="c1"&gt;# "W4A16" already implies group_size 128. Passing group_size raises
# a pydantic extra_forbidden error, which cost me ten minutes the first time.
&lt;/span&gt;&lt;span class="n"&gt;recipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GPTQModifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Linear&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scheme&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;W4A16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ignore&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lm_head&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="nf"&gt;oneshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# loaded with device_map="cpu"
&lt;/span&gt;    &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# 512 ultrachat samples @ 2048 tokens
&lt;/span&gt;    &lt;span class="n"&gt;recipe&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;recipe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_seq_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;num_calibration_samples&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model loads onto CPU, and &lt;code&gt;llmcompressor&lt;/code&gt; streams one layer at a time to the&lt;br&gt;
GPU to collect calibration statistics. That keeps peak VRAM small even though the&lt;br&gt;
full model lives in RAM. On my box, Phi-3.5-mini peaked at about 1.6 GB of VRAM&lt;br&gt;
during quantization. The catch is the RAM side: the full bf16 model sits in&lt;br&gt;
memory, and on a 15 GB machine a 3B model already pushes free RAM down to a few&lt;br&gt;
hundred MB. Run these jobs detached with &lt;code&gt;nohup&lt;/code&gt;, because the last time I did not,&lt;br&gt;
calibration memory pressure killed my terminal.&lt;/p&gt;
&lt;h2&gt;
  
  
  First gotcha: the environment was already broken
&lt;/h2&gt;

&lt;p&gt;Before any of this ran, the import failed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ModuleNotFoundError: No module named 'compressed_tensors.distributed'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;llmcompressor&lt;/code&gt; 0.12 wants &lt;code&gt;compressed-tensors&lt;/code&gt; 0.17.x, but vLLM 0.22 pins&lt;br&gt;
0.15.0.1, and something had downgraded it back. The fix is to install the newer&lt;br&gt;
one and ignore pip's complaint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"compressed-tensors==0.17.1"&lt;/span&gt;
&lt;span class="c"&gt;# pip warns: vllm 0.22.0 requires compressed-tensors==0.15.0.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;vLLM prints the warning and then loads the 0.17.1-produced model anyway. Both&lt;br&gt;
import fine. This is worth knowing if you share one venv between quantizing and&lt;br&gt;
serving, which I do.&lt;/p&gt;
&lt;h2&gt;
  
  
  The clean runs: Phi-3.5-mini and Llama-3.2-3B
&lt;/h2&gt;

&lt;p&gt;Quantizing is one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;HF_HUB_OFFLINE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 python quantize.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; microsoft/Phi-3.5-mini-instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--out&lt;/span&gt; output/Phi-3.5-mini-instruct-W4A16-G128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;HF_HUB_OFFLINE=1&lt;/code&gt; matters. A transient DNS blip once cost me six minutes of&lt;br&gt;
Hugging Face revalidation retries on a model that was already cached.&lt;/p&gt;

&lt;p&gt;It walks the layers and writes the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(33/33): Propagating: 100%|██████████| 512/512
Compressing model: 100%|██████████| 128/128
Writing model shards: 100%
[done] saved to output/Phi-3.5-mini-instruct-W4A16-G128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Phi went from 7.6 GB to 2.2 GB in about 34 minutes. For the Meta model I used the&lt;br&gt;
non-gated &lt;code&gt;unsloth/Llama-3.2-3B-Instruct&lt;/code&gt; mirror so there was no license gate to&lt;br&gt;
click through. It quantized cleanly too, peaking around 2 GB of VRAM, never close&lt;br&gt;
to the edge.&lt;/p&gt;

&lt;p&gt;Two down. Then I tried the 7B.&lt;/p&gt;
&lt;h2&gt;
  
  
  The wall: Qwen2.5-7B will not quantize on 6 GB
&lt;/h2&gt;

&lt;p&gt;Same command, &lt;code&gt;Qwen/Qwen2.5-7B-Instruct&lt;/code&gt;. It died on the second layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.34 GiB.
GPU 0 has a total capacity of 5.67 GiB of which 484.69 MiB is free.
...
torch.OutOfMemoryError: Sequential pipeline ran out of memory.
Please consider choosing a smaller module for `sequential_targets`, ex. 'Linear'.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That 1.34 GiB allocation is the clue. GPTQ builds a Hessian for each linear layer,&lt;br&gt;
sized by the layer's input dimension squared. Qwen2.5-7B's &lt;code&gt;down_proj&lt;/code&gt; takes an&lt;br&gt;
18944-dimensional input, so its Hessian is 18944 squared times 4 bytes, about&lt;br&gt;
1.4 GB, and GPTQ has to invert it on the GPU. The card was already sitting at&lt;br&gt;
roughly 3.9 GB before that allocation, so it tipped over.&lt;/p&gt;

&lt;p&gt;I tried the obvious levers, in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Smaller calibration.&lt;/strong&gt; Dropped from 512 samples at 2048 tokens to 256 at 512,
then to 64 at 512. No change. The GPU stayed pinned at the same 3.9 GB every
time, which told me the resident memory was fixed (embeddings, the live layer)
and not calibration-dependent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;offload_hessians=True&lt;/code&gt;.&lt;/strong&gt; This keeps Hessians on CPU. The GPU ran lower and
the job survived nine minutes on one layer instead of seconds, but it still
OOM'd: the Hessian has to come back to the GPU to be inverted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;expandable_segments:True&lt;/code&gt;.&lt;/strong&gt; Reclaims fragmented reserve memory. Helped a
little, not enough.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWQ instead of GPTQ.&lt;/strong&gt; AWQ avoids the giant per-layer Hessian. I was hopeful.
It uses the same sequential pipeline under the hood and OOM'd in exactly the
same place.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU only&lt;/strong&gt;, with &lt;code&gt;CUDA_VISIBLE_DEVICES=""&lt;/code&gt;. This works. It has no 6 GB ceiling
because it uses RAM and swap. But it is brutally slow:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;   (2/29): Calibrating: 57%|█████▋ | 146/256 [08:42&amp;lt;06:46, 3.70s/it]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;At 3.7 seconds per calibration sample, that is roughly 16 minutes per layer,&lt;br&gt;
   about 10 hours for the full model. For a benchmark, no.&lt;/p&gt;

&lt;p&gt;The honest conclusion: on this card, with this toolchain, GPTQ-on-GPU tops out&lt;br&gt;
around 3 to 4 billion parameters. The 7B's &lt;code&gt;down_proj&lt;/code&gt; Hessian plus the fixed&lt;br&gt;
resident memory does not fit, and no calibration trick changes that. So I swapped&lt;br&gt;
the 7B for Llama-3.2-3B, whose &lt;code&gt;down_proj&lt;/code&gt; input is 8192-dimensional. Its Hessian&lt;br&gt;
is about 268 MB, and it quantized without complaint.&lt;/p&gt;

&lt;p&gt;The wall is the finding. If you have a 6 GB card and you are wondering whether you&lt;br&gt;
can quantize a 7B on it yourself, the answer with &lt;code&gt;llmcompressor&lt;/code&gt;'s sequential&lt;br&gt;
GPTQ is no, not on the GPU. You quantize it elsewhere and download the result.&lt;/p&gt;

&lt;p&gt;(For the record, I did add 24 GB of swap before the CPU attempt, since a 7B in&lt;br&gt;
bf16 is 15 GB and my machine has 15 GB of RAM. On Btrfs the swapfile has to be&lt;br&gt;
NOCOW and fully allocated or &lt;code&gt;swapon&lt;/code&gt; rejects it.)&lt;/p&gt;
&lt;h2&gt;
  
  
  The benchmark
&lt;/h2&gt;

&lt;p&gt;Three models, one pipeline, served in vLLM 0.22 with a 4096 max length and&lt;br&gt;
&lt;code&gt;gpu_memory_utilization=0.88&lt;/code&gt;, same settings for all three so the comparison is&lt;br&gt;
fair.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fj4906dt2821vztlmeqos.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fj4906dt2821vztlmeqos.png" alt="BF16 vs W4A16 size, all three cross the 6 GB line only after quantization" width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Family&lt;/th&gt;
&lt;th&gt;BF16&lt;/th&gt;
&lt;th&gt;W4A16&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;th&gt;KV budget (tokens)&lt;/th&gt;
&lt;th&gt;Single-stream&lt;/th&gt;
&lt;th&gt;Batched x32&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VibeThinker-3B&lt;/td&gt;
&lt;td&gt;Qwen2.5-3B reasoning&lt;/td&gt;
&lt;td&gt;5.8 GB&lt;/td&gt;
&lt;td&gt;2.07 GB&lt;/td&gt;
&lt;td&gt;2.8x&lt;/td&gt;
&lt;td&gt;53,104&lt;/td&gt;
&lt;td&gt;43.9 tok/s&lt;/td&gt;
&lt;td&gt;186 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phi-3.5-mini&lt;/td&gt;
&lt;td&gt;Microsoft 3.8B&lt;/td&gt;
&lt;td&gt;7.6 GB&lt;/td&gt;
&lt;td&gt;2.27 GB&lt;/td&gt;
&lt;td&gt;3.35x&lt;/td&gt;
&lt;td&gt;4,608&lt;/td&gt;
&lt;td&gt;68.7 tok/s&lt;/td&gt;
&lt;td&gt;1,039 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3.2-3B&lt;/td&gt;
&lt;td&gt;Meta 3B&lt;/td&gt;
&lt;td&gt;6.1 GB&lt;/td&gt;
&lt;td&gt;2.26 GB&lt;/td&gt;
&lt;td&gt;2.7x&lt;/td&gt;
&lt;td&gt;16,688&lt;/td&gt;
&lt;td&gt;66.0 tok/s&lt;/td&gt;
&lt;td&gt;1,404 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things jump out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The KV budgets are not close.&lt;/strong&gt; All three are about 2.2 GB on disk, but the&lt;br&gt;
context they can serve on the same card ranges from 4,608 tokens (Phi) to 53,104&lt;br&gt;
(VibeThinker), an 11x spread. That is architecture, not file size. VibeThinker is&lt;br&gt;
a Qwen2.5-3B model with aggressive grouped-query attention, so each token's&lt;br&gt;
key-value state is tiny and the budget is huge. Phi-3.5-mini is bigger (3.8B) with&lt;br&gt;
a heavier KV footprint, so once its weights are loaded there is far less room left&lt;br&gt;
for cache. If you are picking a model to run on a small card, the weight size&lt;br&gt;
tells you whether it loads. The KV geometry tells you how much context you&lt;br&gt;
actually get, and that is the number that bites you later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk92wx0ohercvf0ans3yf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk92wx0ohercvf0ans3yf.png" alt="Same disk size, an 11x spread in servable context" width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capability costs throughput.&lt;/strong&gt; VibeThinker is a reasoning model with 36 layers.&lt;br&gt;
It is the slowest of the three by a wide margin, 186 tok/s batched against Llama's&lt;br&gt;
1,404. You pay for the extra depth on every token.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5h8kepqjdef14mnxl3l9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5h8kepqjdef14mnxl3l9.png" alt="Single-stream and batched throughput across the three models" width="800" height="348"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Did 4-bit break the reasoning?
&lt;/h3&gt;

&lt;p&gt;This is the question that actually matters. I gave all three the same prompt:&lt;br&gt;
state Fermat's little theorem and use it to compute 3^100 mod 7. The correct&lt;br&gt;
answer is 4. All three got it, with correct working. VibeThinker even shows its&lt;br&gt;
thinking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[VibeThinker] ... 3^100 = (3^6)^16 * 3^4 ≡ 1^16 * 3^4 ≡ 81 ≡ 4 (mod 7).
              final answer: \boxed{4}.
[Phi-3.5]     ... 3^4 = 81, 81 mod 7 = 4. So 3^100 mod 7 = 4.
[Llama-3.2]   ... 3^100 ≡ 1 × 4 ≡ 4 (mod 7). So 3^100 mod 7 is 4.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All three also listed the first ten primes correctly. For this class of model and&lt;br&gt;
this kind of task, W4A16 did not cost me anything I could measure. That matches my&lt;br&gt;
earlier experience quantizing VibeThinker, where the 4-bit version still did its&lt;br&gt;
math correctly.&lt;/p&gt;

&lt;p&gt;One caveat worth stating: this is a spot check, not MMLU. 4-bit quantization does&lt;br&gt;
lose accuracy in general, and I have seen it bite before (an fp8 KV cache once&lt;br&gt;
botched a "list the first 10 primes" prompt that bf16 got right). I am reporting&lt;br&gt;
that these specific models held up on these specific prompts, nothing stronger.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would tell someone starting this
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;One script generalizes across families. Qwen, Phi, and Llama all quantize with
the identical &lt;code&gt;GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])&lt;/code&gt;
recipe. The architecture differences show up at serving time, not quantizing time.&lt;/li&gt;
&lt;li&gt;On a 6 GB card, plan for roughly a 3x size reduction and a 3 to 4 billion
parameter ceiling for GPU quantization.&lt;/li&gt;
&lt;li&gt;Read the KV budget, not just the file size. It varies more than the weights do.&lt;/li&gt;
&lt;li&gt;Quantization is the memory-hungry step, not serving. Run it detached, and watch
your RAM, not your VRAM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code is one &lt;code&gt;quantize.py&lt;/code&gt; and one &lt;code&gt;bench.py&lt;/code&gt;, both linked below. The three&lt;br&gt;
models are on the Hub. If you have a small card and a model that will not fit, the&lt;br&gt;
path is short, and now you know exactly where it ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Models on the Hub:

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/syedazeez/VibeThinker-3B-W4A16-G128" rel="noopener noreferrer"&gt;syedazeez/VibeThinker-3B-W4A16-G128&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/syedazeez/Phi-3.5-mini-instruct-W4A16-G128" rel="noopener noreferrer"&gt;syedazeez/Phi-3.5-mini-instruct-W4A16-G128&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/syedazeez/Llama-3.2-3B-Instruct-W4A16-G128" rel="noopener noreferrer"&gt;syedazeez/Llama-3.2-3B-Instruct-W4A16-G128&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Code: the &lt;code&gt;quantize.py&lt;/code&gt; and &lt;code&gt;bench.py&lt;/code&gt; pipeline, &lt;a href="https://github.com/syedazeez337/vllm-lab" rel="noopener noreferrer"&gt;github.com/syedazeez337/vllm-lab&lt;/a&gt; (&lt;code&gt;quant-lab/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Hardware: RTX 3050 Laptop, 6 GB, vLLM 0.22, torch 2.11 + CUDA 13, Python 3.12&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
