I wanted to actually understand fine-tuning — not run a tutorial and nod along. So I gave myself a constraint: same task, three techniques, smallest model to largest. Full fine-tuning, then LoRA, then QLoRA. Hold the task fixed and the only variable is the method.
This first post is full fine-tuning — the most powerful and most expensive option: update every weight in the model.
The task
Banking77: ~13,000 real bank customer-support messages, 77 intents like card_arrival, lost_or_stolen_card, exchange_rate. The model reads a message and names the intent.
The model: deliberately tiny
I picked Gemma 3, 270M parameters — small enough to fully fine-tune on a laptop (Apple Silicon / MPS). That's intentional: full fine-tuning stores gradients and optimizer state for every parameter, roughly 4× the model's size in memory. I wanted to feel that, not read about it.
One design decision: generate the label, don't classify it
The obvious approach is to bolt a 77-way classification head onto the model. I didn't. Instead I had the model generate the intent as text — literally output card_arrival. Why? Because that's the same shape as instruction-tuning, so the later LoRA/QLoRA projects build naturally on this one.
The key detail is masking the loss so the model is graded only on the label tokens, not the prompt:
# build "prompt + label", but set prompt tokens to -100 so the loss ignores them
prompt_ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
target_ids = tokenizer(" " + label_name + tokenizer.eos_token,
add_special_tokens=False)["input_ids"]
input_ids = prompt_ids + target_ids
labels = [-100] * len(prompt_ids) + target_ids # only the label is graded
If you skip that masking, the model spends its capacity learning to reproduce the prompt instead of the answer.
The thing that surprised me: full FT is fragile
Because you're updating all the pretrained weights, a too-high learning rate shreds the model's existing knowledge. I used 5e-5 and it trained cleanly. Bumping to 2e-4 destabilized it. The training config is otherwise unremarkable — and that's the point:
TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=5e-5, # small, on purpose
lr_scheduler_type="cosine",
bf16=False, fp16=False, # fp32 on MPS for stability
)
(The later projects freeze the base, which is exactly why they can tolerate a much higher learning rate — there's no fragile pretrained knowledge to wreck.)
Result
~96% on the common intents. A near-perfect diagonal confusion matrix. A 270M model, fully fine-tuned on a laptop, nailing the task.
The one persistent slip: it confused card_arrival with card_delivery_estimate. Keep that in mind — it shows up in every project in this series, and the reason why is the punchline of Part 4.
What's next
In Part 2, I take a model 5× bigger and train less than 1% of it — and get the same accuracy. That's LoRA.
📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/01-full-finetune-gemma270m
Built with PyTorch + Hugging Face Transformers. Questions or corrections welcome in the comments.
Top comments (0)