DEV Community

Cover image for I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)
Suman Nath
Suman Nath

Posted on • Originally published at dev.to

I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)

I wanted to actually understand fine-tuning — not run a tutorial and nod along. So I gave myself a constraint: same task, three techniques, smallest model to largest. Full fine-tuning, then LoRA, then QLoRA. Hold the task fixed and the only variable is the method.

This first post is full fine-tuning — the most powerful and most expensive option: update every weight in the model.

The task

Banking77: ~13,000 real bank customer-support messages, 77 intents like card_arrival, lost_or_stolen_card, exchange_rate. The model reads a message and names the intent.

The model: deliberately tiny

I picked Gemma 3, 270M parameters — small enough to fully fine-tune on a laptop (Apple Silicon / MPS). That's intentional: full fine-tuning stores gradients and optimizer state for every parameter, roughly 4× the model's size in memory. I wanted to feel that, not read about it.

One design decision: generate the label, don't classify it

The obvious approach is to bolt a 77-way classification head onto the model. I didn't. Instead I had the model generate the intent as text — literally output card_arrival. Why? Because that's the same shape as instruction-tuning, so the later LoRA/QLoRA projects build naturally on this one.

The key detail is masking the loss so the model is graded only on the label tokens, not the prompt:

# build "prompt + label", but set prompt tokens to -100 so the loss ignores them
prompt_ids = tokenizer(prompt, add_special_tokens=False)["input_ids"]
target_ids = tokenizer(" " + label_name + tokenizer.eos_token,
                       add_special_tokens=False)["input_ids"]
input_ids = prompt_ids + target_ids
labels    = [-100] * len(prompt_ids) + target_ids   # only the label is graded
Enter fullscreen mode Exit fullscreen mode

If you skip that masking, the model spends its capacity learning to reproduce the prompt instead of the answer.

The thing that surprised me: full FT is fragile

Because you're updating all the pretrained weights, a too-high learning rate shreds the model's existing knowledge. I used 5e-5 and it trained cleanly. Bumping to 2e-4 destabilized it. The training config is otherwise unremarkable — and that's the point:

TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=5e-5,            # small, on purpose
    lr_scheduler_type="cosine",
    bf16=False, fp16=False,        # fp32 on MPS for stability
)
Enter fullscreen mode Exit fullscreen mode

(The later projects freeze the base, which is exactly why they can tolerate a much higher learning rate — there's no fragile pretrained knowledge to wreck.)

Result

~96% on the common intents. A near-perfect diagonal confusion matrix. A 270M model, fully fine-tuned on a laptop, nailing the task.

The one persistent slip: it confused card_arrival with card_delivery_estimate. Keep that in mind — it shows up in every project in this series, and the reason why is the punchline of Part 4.

What's next

In Part 2, I take a model 5× bigger and train less than 1% of it — and get the same accuracy. That's LoRA.

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/01-full-finetune-gemma270m


Built with PyTorch + Hugging Face Transformers. Questions or corrections welcome in the comments.

Top comments (0)