In Part 1 I fully fine-tuned a 270M model — updating every weight. That's fine for a tiny model. It gets painful as models grow, because full fine-tuning needs gradients and optimizer state for every parameter (~4× the model size in memory).
So: what do you do when the model is too big to comfortably fine-tune all of?
The idea behind LoRA
LoRA (Low-Rank Adaptation) rests on one observation: the change fine-tuning makes to a weight matrix is "low rank" — it lives in a small subspace. You don't need to learn the full update ΔW; you can learn it as the product of two skinny matrices, B·A:
output = W·x + (B·A)·x
↑frozen ↑trainable (tiny)
For a single 1536×1536 layer at rank 16, that's about 49,000 trainable numbers instead of ~2.4 million. And you freeze the entire base model — only the adapters train. B starts at zero, so at step 0 the model behaves exactly like the original and training nudges it from there.
The config
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — adapter capacity
lora_alpha=32, # scaling; effective scale = alpha / r
lora_dropout=0.05,
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# -> trainable params are ~1% of the model. The other 99% is frozen.
I ran this on Qwen2.5-1.5B-Instruct — 5× bigger than the Gemma model from Part 1. Same Banking77 task. Then the GPU fought back.
Wall #1: ValueError: Attempting to unscale FP16 gradients
I'd loaded the model in fp16 to save memory. Wrong move: the optimizer needs fp32 master weights; mixed precision is applied at train time by the trainer, not baked into the loaded weights.
# load weights in fp32; let the Trainer's AMP do fp16 during training
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float32)
# and set fp16=True in TrainingArguments (on CUDA) for the mixed-precision part
Wall #2: CUDA out of memory at batch size 64
Adapter training still holds activations and optimizer state. Fix: smaller batch + gradient accumulation (keeps the effective batch) + gradient checkpointing (recompute activations in the backward pass):
per_device_train_batch_size=16,
gradient_accumulation_steps=2, # effective batch 32, lower peak memory
gradient_checkpointing=True, # ~30% more compute, big memory savings
Wall #3: my laptop and a cloud GPU showed the same speed
This one was sneaky. My Mac (MPS) and a Kaggle T4 reported nearly identical it/s. How is a datacenter GPU no faster than a laptop?
It wasn't. The Kaggle session had 2 GPUs running data-parallel — each step processed 2× the data, so the total step count halved (626 vs 1250) while it/s stayed flat. The fix isn't code, it's how you read the number: compare examples/second, never iterations/second. Once I did, the GPU was clearly ~3× faster.
Result
~96% accuracy again — a frozen 1.5B model + a few-MB adapter matched the fully-fine-tuned 270M model from Part 1, with a saved artifact roughly 1000× smaller.
And that card_arrival vs card_delivery_estimate confusion from Part 1? Still there. Bigger model, different technique, identical mistake. (We resolve that mystery in Part 4.)
What's next
Part 3: I fit a 7-billion-parameter model onto a 16GB GPU that can't even load it normally. That's QLoRA.
📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/02-lora-qwen2-5-1-5b
Built with PyTorch + Hugging Face Transformers + PEFT. Questions or corrections welcome in the comments.
Top comments (0)