Quantizing three models to fit a 6 GB laptop GPU, and the one that wouldn't

#ai #machinelearning #llm #python

My GPU is an RTX 3050 Laptop card. It has 6 GB of VRAM, about 5.67 GB of which I
actually get to use after the desktop takes its cut. That is not a lot. A 3B
model in bfloat16 is around 6 GB of weights on its own, so it does not fit, and a
7B does not come close.

So I did the obvious thing: quantize to 4-bit and see what fits. But instead of
quantizing one model and calling it a day, I ran the same pipeline across three
popular families (Qwen, Microsoft, Meta), benchmarked all three in vLLM, and kept
honest notes on where it worked and where it fell over. One of the four models I
started with never made it, and that failure turned out to be the most useful
part of the exercise.

All the code, the quantized models, and the raw numbers are linked at the bottom.

The plan

W4A16 with GPTQ, group size 128, using llmcompressor to produce
compressed-tensors checkpoints that vLLM loads natively. W4A16 means 4-bit weights
and 16-bit activations. It is GPU-friendly, vLLM serves it with the Marlin kernel,
and it is the format most local-inference folks are already running.

The whole pipeline is one script. You give it a model and an output path:

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

# "W4A16" already implies group_size 128. Passing group_size raises
# a pydantic extra_forbidden error, which cost me ten minutes the first time.
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])

oneshot(
    model=model,            # loaded with device_map="cpu"
    dataset=ds,             # 512 ultrachat samples @ 2048 tokens
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=512,
    output_dir=out,
)

The model loads onto CPU, and llmcompressor streams one layer at a time to the
GPU to collect calibration statistics. That keeps peak VRAM small even though the
full model lives in RAM. On my box, Phi-3.5-mini peaked at about 1.6 GB of VRAM
during quantization. The catch is the RAM side: the full bf16 model sits in
memory, and on a 15 GB machine a 3B model already pushes free RAM down to a few
hundred MB. Run these jobs detached with nohup, because the last time I did not,
calibration memory pressure killed my terminal.

First gotcha: the environment was already broken

Before any of this ran, the import failed:

ModuleNotFoundError: No module named 'compressed_tensors.distributed'

llmcompressor 0.12 wants compressed-tensors 0.17.x, but vLLM 0.22 pins
0.15.0.1, and something had downgraded it back. The fix is to install the newer
one and ignore pip's complaint:

pip install "compressed-tensors==0.17.1"
# pip warns: vllm 0.22.0 requires compressed-tensors==0.15.0.1

vLLM prints the warning and then loads the 0.17.1-produced model anyway. Both
import fine. This is worth knowing if you share one venv between quantizing and
serving, which I do.

The clean runs: Phi-3.5-mini and Llama-3.2-3B

Quantizing is one command:

HF_HUB_OFFLINE=1 python quantize.py \
  --model microsoft/Phi-3.5-mini-instruct \
  --out output/Phi-3.5-mini-instruct-W4A16-G128

HF_HUB_OFFLINE=1 matters. A transient DNS blip once cost me six minutes of
Hugging Face revalidation retries on a model that was already cached.

It walks the layers and writes the result:

(33/33): Propagating: 100%|██████████| 512/512
Compressing model: 100%|██████████| 128/128
Writing model shards: 100%
[done] saved to output/Phi-3.5-mini-instruct-W4A16-G128

Phi went from 7.6 GB to 2.2 GB in about 34 minutes. For the Meta model I used the
non-gated unsloth/Llama-3.2-3B-Instruct mirror so there was no license gate to
click through. It quantized cleanly too, peaking around 2 GB of VRAM, never close
to the edge.

Two down. Then I tried the 7B.

The wall: Qwen2.5-7B will not quantize on 6 GB

Same command, Qwen/Qwen2.5-7B-Instruct. It died on the second layer:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.34 GiB.
GPU 0 has a total capacity of 5.67 GiB of which 484.69 MiB is free.
...
torch.OutOfMemoryError: Sequential pipeline ran out of memory.
Please consider choosing a smaller module for `sequential_targets`, ex. 'Linear'.

That 1.34 GiB allocation is the clue. GPTQ builds a Hessian for each linear layer,
sized by the layer's input dimension squared. Qwen2.5-7B's down_proj takes an
18944-dimensional input, so its Hessian is 18944 squared times 4 bytes, about
1.4 GB, and GPTQ has to invert it on the GPU. The card was already sitting at
roughly 3.9 GB before that allocation, so it tipped over.

I tried the obvious levers, in order:

Smaller calibration. Dropped from 512 samples at 2048 tokens to 256 at 512, then to 64 at 512. No change. The GPU stayed pinned at the same 3.9 GB every time, which told me the resident memory was fixed (embeddings, the live layer) and not calibration-dependent.
offload_hessians=True. This keeps Hessians on CPU. The GPU ran lower and the job survived nine minutes on one layer instead of seconds, but it still OOM'd: the Hessian has to come back to the GPU to be inverted.
expandable_segments:True. Reclaims fragmented reserve memory. Helped a little, not enough.
AWQ instead of GPTQ. AWQ avoids the giant per-layer Hessian. I was hopeful. It uses the same sequential pipeline under the hood and OOM'd in exactly the same place.
CPU only, with CUDA_VISIBLE_DEVICES="". This works. It has no 6 GB ceiling because it uses RAM and swap. But it is brutally slow:

   (2/29): Calibrating: 57%|█████▋ | 146/256 [08:42<06:46, 3.70s/it]

At 3.7 seconds per calibration sample, that is roughly 16 minutes per layer,
about 10 hours for the full model. For a benchmark, no.

The honest conclusion: on this card, with this toolchain, GPTQ-on-GPU tops out
around 3 to 4 billion parameters. The 7B's down_proj Hessian plus the fixed
resident memory does not fit, and no calibration trick changes that. So I swapped
the 7B for Llama-3.2-3B, whose down_proj input is 8192-dimensional. Its Hessian
is about 268 MB, and it quantized without complaint.

The wall is the finding. If you have a 6 GB card and you are wondering whether you
can quantize a 7B on it yourself, the answer with llmcompressor's sequential
GPTQ is no, not on the GPU. You quantize it elsewhere and download the result.

(For the record, I did add 24 GB of swap before the CPU attempt, since a 7B in
bf16 is 15 GB and my machine has 15 GB of RAM. On Btrfs the swapfile has to be
NOCOW and fully allocated or swapon rejects it.)

The benchmark

Three models, one pipeline, served in vLLM 0.22 with a 4096 max length and
gpu_memory_utilization=0.88, same settings for all three so the comparison is
fair.

Model	Family	BF16	W4A16	Ratio	KV budget (tokens)	Single-stream	Batched x32
VibeThinker-3B	Qwen2.5-3B reasoning	5.8 GB	2.07 GB	2.8x	53,104	43.9 tok/s	186 tok/s
Phi-3.5-mini	Microsoft 3.8B	7.6 GB	2.27 GB	3.35x	4,608	68.7 tok/s	1,039 tok/s
Llama-3.2-3B	Meta 3B	6.1 GB	2.26 GB	2.7x	16,688	66.0 tok/s	1,404 tok/s

Two things jump out.

The KV budgets are not close. All three are about 2.2 GB on disk, but the
context they can serve on the same card ranges from 4,608 tokens (Phi) to 53,104
(VibeThinker), an 11x spread. That is architecture, not file size. VibeThinker is
a Qwen2.5-3B model with aggressive grouped-query attention, so each token's
key-value state is tiny and the budget is huge. Phi-3.5-mini is bigger (3.8B) with
a heavier KV footprint, so once its weights are loaded there is far less room left
for cache. If you are picking a model to run on a small card, the weight size
tells you whether it loads. The KV geometry tells you how much context you
actually get, and that is the number that bites you later.

Capability costs throughput. VibeThinker is a reasoning model with 36 layers.
It is the slowest of the three by a wide margin, 186 tok/s batched against Llama's
1,404. You pay for the extra depth on every token.

Did 4-bit break the reasoning?

This is the question that actually matters. I gave all three the same prompt:
state Fermat's little theorem and use it to compute 3^100 mod 7. The correct
answer is 4. All three got it, with correct working. VibeThinker even shows its
thinking:

[VibeThinker] ... 3^100 = (3^6)^16 * 3^4 ≡ 1^16 * 3^4 ≡ 81 ≡ 4 (mod 7).
              final answer: \boxed{4}.
[Phi-3.5]     ... 3^4 = 81, 81 mod 7 = 4. So 3^100 mod 7 = 4.
[Llama-3.2]   ... 3^100 ≡ 1 × 4 ≡ 4 (mod 7). So 3^100 mod 7 is 4.

All three also listed the first ten primes correctly. For this class of model and
this kind of task, W4A16 did not cost me anything I could measure. That matches my
earlier experience quantizing VibeThinker, where the 4-bit version still did its
math correctly.

One caveat worth stating: this is a spot check, not MMLU. 4-bit quantization does
lose accuracy in general, and I have seen it bite before (an fp8 KV cache once
botched a "list the first 10 primes" prompt that bf16 got right). I am reporting
that these specific models held up on these specific prompts, nothing stronger.

What I would tell someone starting this

One script generalizes across families. Qwen, Phi, and Llama all quantize with the identical GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]) recipe. The architecture differences show up at serving time, not quantizing time.
On a 6 GB card, plan for roughly a 3x size reduction and a 3 to 4 billion parameter ceiling for GPU quantization.
Read the KV budget, not just the file size. It varies more than the weights do.
Quantization is the memory-hungry step, not serving. Run it detached, and watch your RAM, not your VRAM.

The code is one quantize.py and one bench.py, both linked below. The three
models are on the Hub. If you have a small card and a model that will not fit, the
path is short, and now you know exactly where it ends.