I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

#ai #machinelearning #python #llm

I wanted to actually understand fine-tuning — not run a tutorial and nod along. So I gave myself a constraint: same task, three techniques, smallest model to largest. Full fine-tuning, then LoRA, then QLoRA. Hold the task fixed and the only variable is the method.

This first post is full fine-tuning — the most powerful and most expensive option: update every weight in the model.

The task

Banking77: ~13,000 real bank customer-support messages, 77 intents like card_arrival, lost_or_stolen_card, exchange_rate. The model reads a message and names the intent.

The model: deliberately tiny

I picked Gemma 3, 270M parameters — small enough to fully fine-tune on a laptop (Apple Silicon / MPS). That's intentional: full fine-tuning stores gradients and optimizer state for every parameter, roughly 4× the model's size in memory. I wanted to feel that, not read about it.

One design decision: generate the label, don't classify it

The obvious approach is to bolt a 77-way classification head onto the model. I didn't. Instead I had the model generate the intent as text — literally output card_arrival. Why? Because that's the same shape as instruction-tuning, so the later LoRA/QLoRA projects build naturally on this one.

The key detail is masking the loss so the model is graded only on the label tokens, not the prompt:

# build "prompt + label", but set prompt tokens to -100 so the loss ignores them
prompt_ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
target_ids = tokenizer(" " + label_name + tokenizer.eos_token,
                       add_special_tokens=False)["input_ids"]
input_ids = prompt_ids + target_ids
labels    = [-100] * len(prompt_ids) + target_ids   # only the label is graded

If you skip that masking, the model spends its capacity learning to reproduce the prompt instead of the answer.

The thing that surprised me: full FT is fragile

Because you're updating all the pretrained weights, a too-high learning rate shreds the model's existing knowledge. I used 5e-5 and it trained cleanly. Bumping to 2e-4 destabilized it. The training config is otherwise unremarkable — and that's the point:

TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=5e-5,            # small, on purpose
    lr_scheduler_type="cosine",
    bf16=False, fp16=False,        # fp32 on MPS for stability
)

(The later projects freeze the base, which is exactly why they can tolerate a much higher learning rate — there's no fragile pretrained knowledge to wreck.)

Result

~96% on the common intents. A near-perfect diagonal confusion matrix. A 270M model, fully fine-tuned on a laptop, nailing the task.

The one persistent slip: it confused card_arrival with card_delivery_estimate. Keep that in mind — it shows up in every project in this series, and the reason why is the punchline of Part 4.