DEV Community

Cover image for LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune
Suman Nath
Suman Nath

Posted on

LoRA: I Trained <1% of a 1.5B Model and Matched a Full Fine-Tune

In Part 1 I fully fine-tuned a 270M model — updating every weight. That's fine for a tiny model. It gets painful as models grow, because full fine-tuning needs gradients and optimizer state for every parameter (~4× the model size in memory).

So: what do you do when the model is too big to comfortably fine-tune all of?

The idea behind LoRA

LoRA (Low-Rank Adaptation) rests on one observation: the change fine-tuning makes to a weight matrix is "low rank" — it lives in a small subspace. You don't need to learn the full update ΔW; you can learn it as the product of two skinny matrices, B·A:

output = W·x  +  (B·A)·x
         ↑frozen    ↑trainable (tiny)
Enter fullscreen mode Exit fullscreen mode

For a single 1536×1536 layer at rank 16, that's about 49,000 trainable numbers instead of ~2.4 million. And you freeze the entire base model — only the adapters train. B starts at zero, so at step 0 the model behaves exactly like the original and training nudges it from there.

The config

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                 # rank — adapter capacity
    lora_alpha=32,        # scaling; effective scale = alpha / r
    lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# -> trainable params are ~1% of the model. The other 99% is frozen.
Enter fullscreen mode Exit fullscreen mode

I ran this on Qwen2.5-1.5B-Instruct — 5× bigger than the Gemma model from Part 1. Same Banking77 task. Then the GPU fought back.

Wall #1: ValueError: Attempting to unscale FP16 gradients

I'd loaded the model in fp16 to save memory. Wrong move: the optimizer needs fp32 master weights; mixed precision is applied at train time by the trainer, not baked into the loaded weights.

# load weights in fp32; let the Trainer's AMP do fp16 during training
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float32)
# and set fp16=True in TrainingArguments (on CUDA) for the mixed-precision part
Enter fullscreen mode Exit fullscreen mode

Wall #2: CUDA out of memory at batch size 64

Adapter training still holds activations and optimizer state. Fix: smaller batch + gradient accumulation (keeps the effective batch) + gradient checkpointing (recompute activations in the backward pass):

per_device_train_batch_size=16,
gradient_accumulation_steps=2,     # effective batch 32, lower peak memory
gradient_checkpointing=True,       # ~30% more compute, big memory savings
Enter fullscreen mode Exit fullscreen mode

Wall #3: my laptop and a cloud GPU showed the same speed

This one was sneaky. My Mac (MPS) and a Kaggle T4 reported nearly identical it/s. How is a datacenter GPU no faster than a laptop?

It wasn't. The Kaggle session had 2 GPUs running data-parallel — each step processed 2× the data, so the total step count halved (626 vs 1250) while it/s stayed flat. The fix isn't code, it's how you read the number: compare examples/second, never iterations/second. Once I did, the GPU was clearly ~3× faster.

Result

~96% accuracy again — a frozen 1.5B model + a few-MB adapter matched the fully-fine-tuned 270M model from Part 1, with a saved artifact roughly 1000× smaller.

And that card_arrival vs card_delivery_estimate confusion from Part 1? Still there. Bigger model, different technique, identical mistake. (We resolve that mystery in Part 4.)

What's next

Part 3: I fit a 7-billion-parameter model onto a 16GB GPU that can't even load it normally. That's QLoRA.

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/02-lora-qwen2-5-1-5b


Built with PyTorch + Hugging Face Transformers + PEFT. Questions or corrections welcome in the comments.

Top comments (0)