In Part 2, LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to Qwen2.5-7B — and hit a wall that LoRA alone doesn't solve.
The problem
A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would barely load, with no room left to actually train.
The QLoRA insight
QLoRA asks the question that naturally follows from LoRA: the base is frozen and only ever read — so why store it in full precision?
So you quantize the frozen base to 4-bit (NF4, a format tuned for how neural-net weights are distributed) and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise.
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_use_double_quant=True, # quantize the quant constants too
bnb_4bit_compute_dtype=torch.float16, # dequantize to fp16 for the matmuls
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, quantization_config=bnb_config, device_map="auto")
Each flag earns its place:
-
load_in_4bit— store frozen weights in 4 bits instead of 16. -
nf4— a 4-bit type matched to the bell-curve distribution of neural-net weights (better than plain int4). -
double_quant— quantize the quantization constants too, for a bit more savings. -
compute_dtype— dequantize to fp16 for the actual matmuls, so storage is 4-bit but compute stays precise.
The moment it clicked
One line of output:
loaded in 4-bit. footprint: 5.44 GB
I downloaded 15.2GB of weights and they sat in memory as 5.44GB. A model that couldn't be loaded for full fine-tuning was now training on a single consumer GPU — with room to spare. (The download is still 15GB; bitsandbytes quantizes on the fly during load.)
The QLoRA-standard recipe
Two more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target all linear layers (the QLoRA paper found this matters), with a paged 8-bit optimizer:
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
# ... attach LoRA to every linear layer ...
TrainingArguments(optim="paged_adamw_8bit", gradient_checkpointing=True, ...)
It's slow — and that's fine
A 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But QLoRA isn't about speed — it's about fit. The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point.
⚠️ Hardware note:
bitsandbytes4-bit is CUDA-first. It does not run on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU (Kaggle/Colab T4 works).
Result
QLoRA accuracy: 92.848% (4-bit base was 16.000%)
macro-F1: 0.928
It roughly tied the smaller models from Parts 1 and 2.
And the card_arrival vs card_delivery_estimate confusion that haunted both smaller models? [Say what happened at 7B — did it finally fix it, or hit the same wall?] Either way, it sets up the question I tackle in Part 4: if the 270M model already worked, why did I build any of this?
📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b
Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.
Top comments (0)