QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)

#ai #machinelearning #python #llm

In Part 2, LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to Qwen2.5-7B — and hit a wall that LoRA alone doesn't solve.

The problem

A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would barely load, with no room left to actually train.

The QLoRA insight

QLoRA asks the question that naturally follows from LoRA: the base is frozen and only ever read — so why store it in full precision?

So you quantize the frozen base to 4-bit (NF4, a format tuned for how neural-net weights are distributed) and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NormalFloat4
    bnb_4bit_use_double_quant=True,        # quantize the quant constants too
    bnb_4bit_compute_dtype=torch.float16,  # dequantize to fp16 for the matmuls
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, quantization_config=bnb_config, device_map="auto")

Each flag earns its place:

load_in_4bit — store frozen weights in 4 bits instead of 16.
nf4 — a 4-bit type matched to the bell-curve distribution of neural-net weights (better than plain int4).
double_quant — quantize the quantization constants too, for a bit more savings.
compute_dtype — dequantize to fp16 for the actual matmuls, so storage is 4-bit but compute stays precise.

The moment it clicked

One line of output:

loaded in 4-bit. footprint: 5.44 GB

I downloaded 15.2GB of weights and they sat in memory as 5.44GB. A model that couldn't be loaded for full fine-tuning was now training on a single consumer GPU — with room to spare. (The download is still 15GB; bitsandbytes quantizes on the fly during load.)

The QLoRA-standard recipe

Two more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target all linear layers (the QLoRA paper found this matters), with a paged 8-bit optimizer:

from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
# ... attach LoRA to every linear layer ...
TrainingArguments(optim="paged_adamw_8bit", gradient_checkpointing=True, ...)

It's slow — and that's fine

A 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But QLoRA isn't about speed — it's about fit. The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point.

⚠️ Hardware note: bitsandbytes 4-bit is CUDA-first. It does not run on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU (Kaggle/Colab T4 works).

Result

QLoRA accuracy: 92.848% (4-bit base was 16.000%)
macro-F1: 0.928

It roughly tied the smaller models from Parts 1 and 2.

And the card_arrival vs card_delivery_estimate confusion that haunted both smaller models? [Say what happened at 7B — did it finally fix it, or hit the same wall?] Either way, it sets up the question I tackle in Part 4: if the 270M model already worked, why did I build any of this?

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b

Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.