If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?

#ai #machinelearning #llm #mlops

Over three posts I built three fine-tuned models for the same banking-intent task — full fine-tuning a 270M model, LoRA on 1.5B, QLoRA on 7B. They all landed around the same accuracy.

Which raises an honest, slightly uncomfortable question: if a 270M model on my laptop already worked, why reach for a 7B model at all?

The answer most "bigger is better" content skips

For this task — you wouldn't. A good engineer picks the smallest model that clears the bar, not the biggest one available. The small model is cheaper to serve, runs in milliseconds, and you fully own it. Choosing the 7B here would be over-engineering.

Reaching for a bigger model isn't a flex. It's a response to a requirement the small one can't meet. Here are the four cases where small stops being enough:

1. The task is genuinely hard

Banking77 is easy — 77 fixed labels, short clean queries. Small models saturate it. But ask for reasoning ("which of these three issues is the primary one?"), open-ended generation (write the reply, don't just classify), or real nuance, and there's a capability floor that more parameters buy. No amount of fine-tuning gives a 270M model abilities it doesn't have.

2. You have little data

I had ~10,000 labeled examples — plenty for a small model. With 50, a small model can't learn the task, but a 7B model already "knows" banking concepts from pretraining and only needs a nudge. Bigger models need less task data because they bring more prior knowledge.

3. You need one model for many tasks

This is the quiet superpower of LoRA/QLoRA. A single frozen 7B base can host dozens of swappable adapters — intent classifier, reply writer, summarizer, sentiment — all from one ~5GB footprint in memory. The 270M is single-purpose. This is why companies serve hundreds of fine-tunes from one base model.

4. Accuracy compounds at scale

93% means 7 in 100 queries misrouted. At 10M queries/month, that's 700,000 mistakes. If each costs a support escalation, the 2–3 points a bigger model buys can pay for itself many times over. At small scale, nobody notices. At large scale, it's the whole budget.

So why did I build all three?

Not to beat the small model — they tied. And that tie is the lesson: on an easy task, the technique barely matters.

I built them to learn the techniques, so that when I hit a task where small isn't enough, I can fine-tune a 7B model on a 16GB card without flinching. It's like learning to change a tire in your driveway on a sunny day. The driveway didn't need it — but now I can do it on the highway, in the rain.

The bug no model size could fix

One thread ran through all three projects. Every model — 270M, 1.5B, 7B — confused the same two intents: card_arrival and card_delivery_estimate. Three model scales, three techniques, the same mistake in every confusion matrix.

That's not a capacity problem you can buy your way out of. "Where's my card?" and "when will my card arrive?" genuinely overlap — the ambiguity is in the labels themselves, not the model. Three models agreeing on a "mistake" is a strong signal the data, not the model, is the limit.

Sometimes the answer isn't a bigger model. It's better data.

That might be the most useful thing I learned across the whole series.

The takeaway, as a decision

Is the small model good enough for the actual requirement?
   YES → ship it. Bigger is wasted cost, latency, and complexity.
   NO  → why not?
          capability ceiling? → bigger base model
          too little data?    → bigger base + LoRA (needs less data)
          many tasks at once? → one big base + swappable adapters

Match the model to the requirement. That instinct is worth more than any of the fine-tuning mechanics in Parts 1–3.

📓 All three notebooks on Kaggle: https://www.kaggle.com/work/collections/18659493

Thanks for reading the series. If it was useful, a reaction or a comment helps it reach the next person debugging their first OOM.