<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Suman Nath</title>
    <description>The latest articles on DEV Community by Suman Nath (@sumanpro).</description>
    <link>https://dev.to/sumanpro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3995156%2Fe4839d06-1ce1-4793-b439-6bdeefb2f572.png</url>
      <title>DEV Community: Suman Nath</title>
      <link>https://dev.to/sumanpro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sumanpro"/>
    <language>en</language>
    <item>
      <title>If a 270M Model Already Worked, Why Did I Fine-Tune a 7B One?</title>
      <dc:creator>Suman Nath</dc:creator>
      <pubDate>Sun, 21 Jun 2026 12:23:55 +0000</pubDate>
      <link>https://dev.to/sumanpro/if-a-270m-model-already-worked-why-did-i-fine-tune-a-7b-one-2ae3</link>
      <guid>https://dev.to/sumanpro/if-a-270m-model-already-worked-why-did-i-fine-tune-a-7b-one-2ae3</guid>
      <description>&lt;p&gt;Over three posts I built three fine-tuned models for the same banking-intent task — &lt;a href="https://dev.to/sumanpro/i-fine-tuned-a-270m-model-on-my-laptop-full-fine-tuning-from-scratch-3p4l"&gt;full fine-tuning a 270M model&lt;/a&gt;, &lt;a href="https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if"&gt;LoRA on 1.5B&lt;/a&gt;, &lt;a href="https://dev.to/sumanpro/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-54gb-in-front-of-me-28n4"&gt;QLoRA on 7B&lt;/a&gt;. They all landed around the same accuracy.&lt;/p&gt;

&lt;p&gt;Which raises an honest, slightly uncomfortable question: &lt;strong&gt;if a 270M model on my laptop already worked, why reach for a 7B model at all?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The answer most "bigger is better" content skips
&lt;/h2&gt;

&lt;p&gt;For &lt;em&gt;this&lt;/em&gt; task — you wouldn't. A good engineer picks the &lt;strong&gt;smallest model that clears the bar&lt;/strong&gt;, not the biggest one available. The small model is cheaper to serve, runs in milliseconds, and you fully own it. Choosing the 7B here would be over-engineering.&lt;/p&gt;

&lt;p&gt;Reaching for a bigger model isn't a flex. It's a response to a requirement the small one can't meet. Here are the four cases where small stops being enough:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The task is genuinely hard
&lt;/h3&gt;

&lt;p&gt;Banking77 is easy — 77 fixed labels, short clean queries. Small models saturate it. But ask for reasoning ("which of these three issues is the &lt;em&gt;primary&lt;/em&gt; one?"), open-ended generation (write the reply, don't just classify), or real nuance, and there's a capability floor that more parameters buy. No amount of fine-tuning gives a 270M model abilities it doesn't have.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. You have little data
&lt;/h3&gt;

&lt;p&gt;I had ~10,000 labeled examples — plenty for a small model. With 50, a small model can't learn the task, but a 7B model already "knows" banking concepts from pretraining and only needs a nudge. &lt;strong&gt;Bigger models need &lt;em&gt;less&lt;/em&gt; task data&lt;/strong&gt; because they bring more prior knowledge.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. You need one model for many tasks
&lt;/h3&gt;

&lt;p&gt;This is the quiet superpower of LoRA/QLoRA. A single frozen 7B base can host &lt;em&gt;dozens&lt;/em&gt; of swappable adapters — intent classifier, reply writer, summarizer, sentiment — all from one ~5GB footprint in memory. The 270M is single-purpose. This is why companies serve hundreds of fine-tunes from one base model.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Accuracy compounds at scale
&lt;/h3&gt;

&lt;p&gt;93% means 7 in 100 queries misrouted. At 10M queries/month, that's 700,000 mistakes. If each costs a support escalation, the 2–3 points a bigger model buys can pay for itself many times over. At small scale, nobody notices. At large scale, it's the whole budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  So why did I build all three?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Not to beat the small model — they tied. And that tie &lt;em&gt;is&lt;/em&gt; the lesson: on an easy task, the technique barely matters.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I built them to learn the techniques, so that when I hit a task where small &lt;em&gt;isn't&lt;/em&gt; enough, I can fine-tune a 7B model on a 16GB card without flinching. It's like learning to change a tire in your driveway on a sunny day. The driveway didn't need it — but now I can do it on the highway, in the rain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bug no model size could fix
&lt;/h2&gt;

&lt;p&gt;One thread ran through all three projects. Every model — 270M, 1.5B, 7B — confused the same two intents: &lt;code&gt;card_arrival&lt;/code&gt; and &lt;code&gt;card_delivery_estimate&lt;/code&gt;. Three model scales, three techniques, the same mistake in every confusion matrix.&lt;/p&gt;

&lt;p&gt;That's not a capacity problem you can buy your way out of. "Where's my card?" and "when will my card arrive?" genuinely overlap — the ambiguity is in the &lt;strong&gt;labels themselves&lt;/strong&gt;, not the model. Three models agreeing on a "mistake" is a strong signal the data, not the model, is the limit.&lt;/p&gt;

&lt;p&gt;Sometimes the answer isn't a bigger model. It's better data.&lt;/p&gt;

&lt;p&gt;That might be the most useful thing I learned across the whole series.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway, as a decision
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is the small model good enough for the actual requirement?
   YES → ship it. Bigger is wasted cost, latency, and complexity.
   NO  → why not?
          capability ceiling? → bigger base model
          too little data?    → bigger base + LoRA (needs less data)
          many tasks at once? → one big base + swappable adapters
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Match the model to the requirement. That instinct is worth more than any of the fine-tuning mechanics in Parts 1–3.&lt;/p&gt;

&lt;p&gt;📓 &lt;strong&gt;All three notebooks on Kaggle:&lt;/strong&gt; &lt;a href="https://www.kaggle.com/work/collections/18659493" rel="noopener noreferrer"&gt;https://www.kaggle.com/work/collections/18659493&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Thanks for reading the series. If it was useful, a reaction or a comment helps it reach the next person debugging their first OOM.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>mlops</category>
      <category>ai</category>
    </item>
    <item>
      <title>QLoRA: Fine-Tuning a 7B Model on a 16GB GPU (It Shrank to 5.4GB in Front of Me)</title>
      <dc:creator>Suman Nath</dc:creator>
      <pubDate>Sun, 21 Jun 2026 12:20:53 +0000</pubDate>
      <link>https://dev.to/sumanpro/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-54gb-in-front-of-me-28n4</link>
      <guid>https://dev.to/sumanpro/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-54gb-in-front-of-me-28n4</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if"&gt;Part 2&lt;/a&gt;, LoRA let me fine-tune a 1.5B model by freezing it and training tiny adapters. But the frozen base still sat in memory in 16-bit (~3GB). Now I wanted to go to &lt;strong&gt;Qwen2.5-7B&lt;/strong&gt; — and hit a wall that LoRA alone doesn't solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;A 7B model is ~15GB in 16-bit precision. A free-tier T4 GPU has 16GB. It would &lt;em&gt;barely&lt;/em&gt; load, with no room left to actually train.&lt;/p&gt;

&lt;h2&gt;
  
  
  The QLoRA insight
&lt;/h2&gt;

&lt;p&gt;QLoRA asks the question that naturally follows from LoRA: &lt;strong&gt;the base is frozen and only ever read — so why store it in full precision?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So you &lt;strong&gt;quantize the frozen base to 4-bit&lt;/strong&gt; (NF4, a format tuned for how neural-net weights are distributed) and run the LoRA adapters on top in normal precision. The base shrinks dramatically; the trainable part stays small and precise.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BitsAndBytesConfig&lt;/span&gt;

&lt;span class="n"&gt;bnb_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BitsAndBytesConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;load_in_4bit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bnb_4bit_quant_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nf4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;# NormalFloat4
&lt;/span&gt;    &lt;span class="n"&gt;bnb_4bit_use_double_quant&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# quantize the quant constants too
&lt;/span&gt;    &lt;span class="n"&gt;bnb_4bit_compute_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# dequantize to fp16 for the matmuls
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quantization_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bnb_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each flag earns its place:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;load_in_4bit&lt;/code&gt;&lt;/strong&gt; — store frozen weights in 4 bits instead of 16.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;nf4&lt;/code&gt;&lt;/strong&gt; — a 4-bit type matched to the bell-curve distribution of neural-net weights (better than plain int4).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;double_quant&lt;/code&gt;&lt;/strong&gt; — quantize the quantization constants too, for a bit more savings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;compute_dtype&lt;/code&gt;&lt;/strong&gt; — dequantize to fp16 for the actual matmuls, so &lt;em&gt;storage&lt;/em&gt; is 4-bit but &lt;em&gt;compute&lt;/em&gt; stays precise.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The moment it clicked
&lt;/h2&gt;

&lt;p&gt;One line of output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;loaded in 4-bit. footprint: 5.44 GB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I downloaded 15.2GB of weights and they sat in memory as &lt;strong&gt;5.44GB.&lt;/strong&gt; A model that couldn't be &lt;em&gt;loaded&lt;/em&gt; for full fine-tuning was now training on a single consumer GPU — with room to spare. (The download is still 15GB; bitsandbytes quantizes on the fly during load.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The QLoRA-standard recipe
&lt;/h2&gt;

&lt;p&gt;Two more pieces beyond Part 2's LoRA setup: prepare the quantized model for training, and target &lt;em&gt;all&lt;/em&gt; linear layers (the QLoRA paper found this matters), with a paged 8-bit optimizer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;prepare_model_for_kbit_training&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;prepare_model_for_kbit_training&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_gradient_checkpointing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# ... attach LoRA to every linear layer ...
&lt;/span&gt;&lt;span class="nc"&gt;TrainingArguments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;paged_adamw_8bit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gradient_checkpointing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  It's slow — and that's fine
&lt;/h2&gt;

&lt;p&gt;A 7B forward pass through 4-bit weights with gradient checkpointing is heavy: ~1 hour for one epoch on a T4, ~3 examples/second. But &lt;strong&gt;QLoRA isn't about speed — it's about fit.&lt;/strong&gt; The model runs at all, on hardware that couldn't otherwise hold it. That's the entire point.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Hardware note:&lt;/strong&gt; &lt;code&gt;bitsandbytes&lt;/code&gt; 4-bit is CUDA-first. It does &lt;em&gt;not&lt;/em&gt; run on Apple MPS, and AMD/ROCm support exists but is less mature. Run this one on an NVIDIA GPU (Kaggle/Colab T4 works).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;QLoRA accuracy: 92.848%  (4-bit base was 16.000%)&lt;br&gt;
macro-F1: 0.928&lt;/p&gt;

&lt;p&gt;It roughly tied the smaller models from Parts 1 and 2.&lt;/p&gt;

&lt;p&gt;And the &lt;code&gt;card_arrival&lt;/code&gt; vs &lt;code&gt;card_delivery_estimate&lt;/code&gt; confusion that haunted both smaller models? [Say what happened at 7B — did it finally fix it, or hit the same wall?] Either way, it sets up the question I tackle in &lt;a href="https://dev.to/sumanpro/if-a-270m-model-already-worked-why-did-i-fine-tune-a-7b-one-2ae3"&gt;Part 4&lt;/a&gt;: &lt;strong&gt;if the 270M model already worked, why did I build any of this?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;📓 &lt;strong&gt;Full runnable notebook on Kaggle:&lt;/strong&gt; &lt;a href="https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b" rel="noopener noreferrer"&gt;https://www.kaggle.com/code/sumannath88/03-qlora-qwen2-5-7b&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with PyTorch + Transformers + PEFT + bitsandbytes. Questions or corrections welcome in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>LoRA: I Trained &lt;1% of a 1.5B Model and Matched a Full Fine-Tune</title>
      <dc:creator>Suman Nath</dc:creator>
      <pubDate>Sun, 21 Jun 2026 12:17:16 +0000</pubDate>
      <link>https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if</link>
      <guid>https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/sumanpro/i-fine-tuned-a-270m-model-on-my-laptop-full-fine-tuning-from-scratch-3p4l"&gt;Part 1&lt;/a&gt; I fully fine-tuned a 270M model — updating every weight. That's fine for a tiny model. It gets painful as models grow, because full fine-tuning needs gradients and optimizer state for &lt;em&gt;every&lt;/em&gt; parameter (~4× the model size in memory).&lt;/p&gt;

&lt;p&gt;So: what do you do when the model is too big to comfortably fine-tune all of?&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea behind LoRA
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LoRA (Low-Rank Adaptation)&lt;/strong&gt; rests on one observation: the &lt;em&gt;change&lt;/em&gt; fine-tuning makes to a weight matrix is "low rank" — it lives in a small subspace. You don't need to learn the full update ΔW; you can learn it as the product of two skinny matrices, B·A:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;output = W·x  +  (B·A)·x
         ↑frozen    ↑trainable (tiny)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a single 1536×1536 layer at rank 16, that's about &lt;strong&gt;49,000 trainable numbers instead of ~2.4 million&lt;/strong&gt;. And you &lt;strong&gt;freeze the entire base model&lt;/strong&gt; — only the adapters train. &lt;code&gt;B&lt;/code&gt; starts at zero, so at step 0 the model behaves exactly like the original and training nudges it from there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The config
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;peft&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TaskType&lt;/span&gt;

&lt;span class="n"&gt;lora_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CAUSAL_LM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="c1"&gt;# rank — adapter capacity
&lt;/span&gt;    &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# scaling; effective scale = alpha / r
&lt;/span&gt;    &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_peft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;print_trainable_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; trainable params are ~1% of the model. The other 99% is frozen.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I ran this on &lt;strong&gt;Qwen2.5-1.5B-Instruct&lt;/strong&gt; — 5× bigger than the Gemma model from Part 1. Same Banking77 task. Then the GPU fought back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wall #1: &lt;code&gt;ValueError: Attempting to unscale FP16 gradients&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;I'd loaded the model in fp16 to save memory. Wrong move: the optimizer needs &lt;strong&gt;fp32 master weights&lt;/strong&gt;; mixed precision is applied &lt;em&gt;at train time&lt;/em&gt; by the trainer, not baked into the loaded weights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# load weights in fp32; let the Trainer's AMP do fp16 during training
&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MODEL_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# and set fp16=True in TrainingArguments (on CUDA) for the mixed-precision part
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Wall #2: CUDA out of memory at batch size 64
&lt;/h2&gt;

&lt;p&gt;Adapter training still holds activations and optimizer state. Fix: smaller batch + gradient accumulation (keeps the &lt;em&gt;effective&lt;/em&gt; batch) + gradient checkpointing (recompute activations in the backward pass):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# effective batch 32, lower peak memory
&lt;/span&gt;&lt;span class="n"&gt;gradient_checkpointing&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# ~30% more compute, big memory savings
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Wall #3: my laptop and a cloud GPU showed the &lt;em&gt;same&lt;/em&gt; speed
&lt;/h2&gt;

&lt;p&gt;This one was sneaky. My Mac (MPS) and a Kaggle T4 reported nearly identical &lt;code&gt;it/s&lt;/code&gt;. How is a datacenter GPU no faster than a laptop?&lt;/p&gt;

&lt;p&gt;It wasn't. The Kaggle session had &lt;strong&gt;2 GPUs&lt;/strong&gt; running data-parallel — each step processed 2× the data, so the &lt;em&gt;total step count halved&lt;/em&gt; (626 vs 1250) while &lt;code&gt;it/s&lt;/code&gt; stayed flat. The fix isn't code, it's how you read the number: &lt;strong&gt;compare examples/second, never iterations/second.&lt;/strong&gt; Once I did, the GPU was clearly ~3× faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;~96% accuracy again — a frozen 1.5B model + a few-MB adapter matched the fully-fine-tuned 270M model from Part 1, with a saved artifact roughly &lt;strong&gt;1000× smaller&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And that &lt;code&gt;card_arrival&lt;/code&gt; vs &lt;code&gt;card_delivery_estimate&lt;/code&gt; confusion from Part 1? &lt;strong&gt;Still there.&lt;/strong&gt; Bigger model, different technique, identical mistake. (We resolve that mystery in Part 4.)&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/sumanpro/qlora-fine-tuning-a-7b-model-on-a-16gb-gpu-it-shrank-to-54gb-in-front-of-me-28n4"&gt;Part 3&lt;/a&gt;: I fit a &lt;strong&gt;7-billion&lt;/strong&gt;-parameter model onto a 16GB GPU that can't even load it normally. That's QLoRA.&lt;/p&gt;

&lt;p&gt;📓 &lt;strong&gt;Full runnable notebook on Kaggle:&lt;/strong&gt; &lt;a href="https://www.kaggle.com/code/sumannath88/02-lora-qwen2-5-1-5b" rel="noopener noreferrer"&gt;https://www.kaggle.com/code/sumannath88/02-lora-qwen2-5-1-5b&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with PyTorch + Hugging Face Transformers + PEFT. Questions or corrections welcome in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Fine-Tuned a 270M Model on My Laptop (Full Fine-Tuning, From Scratch)</title>
      <dc:creator>Suman Nath</dc:creator>
      <pubDate>Sun, 21 Jun 2026 12:08:45 +0000</pubDate>
      <link>https://dev.to/sumanpro/i-fine-tuned-a-270m-model-on-my-laptop-full-fine-tuning-from-scratch-3p4l</link>
      <guid>https://dev.to/sumanpro/i-fine-tuned-a-270m-model-on-my-laptop-full-fine-tuning-from-scratch-3p4l</guid>
      <description>&lt;p&gt;I wanted to actually &lt;em&gt;understand&lt;/em&gt; fine-tuning — not run a tutorial and nod along. So I gave myself a constraint: &lt;strong&gt;same task, three techniques, smallest model to largest.&lt;/strong&gt; Full fine-tuning, then LoRA, then QLoRA. Hold the task fixed and the only variable is the method.&lt;/p&gt;

&lt;p&gt;This first post is full fine-tuning — the most powerful and most expensive option: &lt;strong&gt;update every weight in the model.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The task
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/datasets/mteb/banking77" rel="noopener noreferrer"&gt;Banking77&lt;/a&gt;: ~13,000 real bank customer-support messages, 77 intents like &lt;code&gt;card_arrival&lt;/code&gt;, &lt;code&gt;lost_or_stolen_card&lt;/code&gt;, &lt;code&gt;exchange_rate&lt;/code&gt;. The model reads a message and names the intent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model: deliberately tiny
&lt;/h2&gt;

&lt;p&gt;I picked &lt;strong&gt;Gemma 3, 270M parameters&lt;/strong&gt; — small enough to fully fine-tune on a laptop (Apple Silicon / MPS). That's intentional: full fine-tuning stores gradients and optimizer state for &lt;em&gt;every&lt;/em&gt; parameter, roughly 4× the model's size in memory. I wanted to &lt;em&gt;feel&lt;/em&gt; that, not read about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  One design decision: generate the label, don't classify it
&lt;/h2&gt;

&lt;p&gt;The obvious approach is to bolt a 77-way classification head onto the model. I didn't. Instead I had the model &lt;strong&gt;generate the intent as text&lt;/strong&gt; — literally output &lt;code&gt;card_arrival&lt;/code&gt;. Why? Because that's the same shape as instruction-tuning, so the later LoRA/QLoRA projects build naturally on this one.&lt;/p&gt;

&lt;p&gt;The key detail is masking the loss so the model is graded &lt;em&gt;only&lt;/em&gt; on the label tokens, not the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# build "prompt + label", but set prompt tokens to -100 so the loss ignores them
&lt;/span&gt;&lt;span class="n"&gt;prompt_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;add_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;target_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;label_name&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eos_token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;add_special_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prompt_ids&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;target_ids&lt;/span&gt;
&lt;span class="n"&gt;labels&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;target_ids&lt;/span&gt;   &lt;span class="c1"&gt;# only the label is graded
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you skip that masking, the model spends its capacity learning to reproduce the prompt instead of the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The thing that surprised me: full FT is fragile
&lt;/h2&gt;

&lt;p&gt;Because you're updating &lt;em&gt;all&lt;/em&gt; the pretrained weights, a too-high learning rate shreds the model's existing knowledge. I used 5e-5 and it trained cleanly. Bumping to 2e-4 destabilized it. The training config is otherwise unremarkable — and that's the point:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nc"&gt;TrainingArguments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5e-5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# small, on purpose
&lt;/span&gt;    &lt;span class="n"&gt;lr_scheduler_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bf16&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fp16&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# fp32 on MPS for stability
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(The later projects &lt;em&gt;freeze&lt;/em&gt; the base, which is exactly why they can tolerate a much higher learning rate — there's no fragile pretrained knowledge to wreck.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;~96% on the common intents. A near-perfect diagonal confusion matrix. A 270M model, fully fine-tuned on a laptop, nailing the task.&lt;/p&gt;

&lt;p&gt;The one persistent slip: it confused &lt;strong&gt;&lt;code&gt;card_arrival&lt;/code&gt;&lt;/strong&gt; with &lt;strong&gt;&lt;code&gt;card_delivery_estimate&lt;/code&gt;&lt;/strong&gt;. Keep that in mind — it shows up in &lt;em&gt;every&lt;/em&gt; project in this series, and the reason why is the punchline of Part 4.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/sumanpro/lora-i-trained-1-of-a-15b-model-and-matched-a-full-fine-tune-41if"&gt;Part 2&lt;/a&gt;, I take a model 5× bigger and train less than 1% of it — and get the same accuracy. That's LoRA.&lt;/p&gt;

&lt;p&gt;📓 &lt;strong&gt;Full runnable notebook on Kaggle:&lt;/strong&gt; &lt;a href="https://www.kaggle.com/code/sumannath88/01-full-finetune-gemma270m" rel="noopener noreferrer"&gt;https://www.kaggle.com/code/sumannath88/01-full-finetune-gemma270m&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with PyTorch + Hugging Face Transformers. Questions or corrections welcome in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>python</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
