Rishab Dugar

Posted on May 11

Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

#finetuning #llm #ai #genai

Fine-tuning large language models (LLMs) sounds complex — until you meet Unsloth. Whether you’re a complete beginner or an experienced ML tinkerer, this guide walks you through the simplest and most efficient way to fine-tune LLaMA models on free GPUs using Google Colab. Best of all? No fancy hardware or deep ML theory required.

This article breaks down every keyword, library, and function, defining each term precisely but in the simplest language possible.

In this article, you’ll learn how to:

Install and configure Unsloth in Colab
Load models in quantized (4-bit) mode to save memory
Understand core concepts (parameters, weights, biases, quantization, etc.)
Apply PEFT and LoRA adapters to fine-tune only a small part of the model
Prepare Q&A data for training with Hugging Face Datasets and chat templates
Use SFTTrainer for supervised fine-tuning
Switch to inference mode for faster generation
Save and reload your fine-tuned model

Getting Comfortable With some Core Concepts

Disclaimer: I promise this will be the friendliest GenAI glossary — your cheat sheet, wittier than autocorrect and way less judgmental! 😉

Language Model — Word-Predictor to put it simply! — like a smart autocomplete that predicts the next word based on what came before, much like your phone suggests “you” after “how are”. It learns these patterns by “reading” massive amounts of text, so it knows common word sequences and can fill in blanks. Behind the scenes, it models the probabilities of word sequences, assigning higher scores to more natural continuations.

Attention — Imagine you’re reading a sentence and want to know which earlier words matter most to understand each new word — that’s attention. Instead of reading a sentence strictly left-to-right, attention lets every word weigh how much it should consider all the others, like skimming a page and highlighting only key phrases. This selective focus makes predictions more accurate and efficient, ignoring irrelevant details.

Parameter — A number inside a model that can change during learning (like a dial that the model tweaks)

Weight — Mostly synonymous with parameters controlling how strongly one part of input affects the output

Data vs Parameters vs Weights — “Data” refers to the information used to train a model, while “parameters” are the values the model learns from that data, and “weights” are a specific type of parameter representing the strength of connections between model variables. To put it more simply, Data is the input, parameters are what the model adjusts to make predictions, and weights are a subset of those parameters.

Bias — A small extra number added so the model can shift outputs up or down, like a baseline adjustment

Transformer — A Transformer is a special model built around attention, letting it “look” at every word in a sentence in parallel rather than one by one. It’s like a study group where everyone reads the entire essay at once and then discusses which sentences are most important to the main idea. Introduced in 2017 in Google’s “Attention Is All You Need” paper, Transformers are the backbone of today’s LLMs and power everything from translation tools to chatbots.

Quantization — Reducing precision of weights (e.g. 16-bit → 4-bit) to slash memory use, with minimal accuracy loss.

PEFT — (Parameter-Efficient Fine-Tuning) — updating only tiny adapter layers instead of the whole model.

LoRA — (Low-Rank Adaptation) — A smart shortcut for teaching a huge AI model new tricks by only tweaking a tiny part of it instead of retraining the entire thing. You “freeze” most of the model’s parameters and insert two small, trainable matrices into each layer. During fine-tuning, only these add-ons learn, drastically cutting down on time and compute cost.

LoRA “r” — The adapter’s rank (size). Higher r gives more capacity but uses more memory.

LoRA α (alpha) — A scaling factor for adapter updates — like a “volume knob” for learning strength.

Dropout — Randomly turning off some adapter connections during training to prevent overfitting (can be set to 0).

Gradient Checkpointing — Recomputes parts of the model during backpropagation to halve peak VRAM usage, at a slight speed cost.

4-bit Mode — Quantized mode storing weights in 4 bits, cutting memory by ~4× compared to 16/32-bit.

Inference Mode — After training, use a special mode optimized for fast text generation (2× speed).

Overfitting — When a model “memorizes” a tiny dataset and fails on new inputs — always test on unseen data.

Checkpoint — A saved snapshot of model weights you can reload later.

Token — A token is a small chunk of text (~4–5 characters) — a word, part of a word, punctuation mark, or symbol — that serves as the basic unit a model processes.

Tokenizer — The tokenizer is the program that “cuts” raw text into those tokens and then converts each token into a unique number (ID) the model can work with (e.g., “unhappiness” → “un”, “happi”, “ness” → “un” = 137, “happi” = 428, etc).

SLMs vs. LLMs

SLMs (Small Language Models) have fewer parameters and focus on specific tasks or domains — think of them as pocket calculators solving one type of math problem quickly and efficiently.
LLMs (Large Language Models) are like supercomputers trained on vast, diverse data; they can tackle many tasks — writing essays, summarizing articles, or coding — because they’ve “read” almost the entire internet.
SLMs require less compute power and are ideal for on-device or specialized applications, whereas LLMs need massive cloud resources but offer broader versatility.
In practice, you might use an SLM for customer-service chat on your phone, but call an LLM when you need deep research help or creative story generation.

1. Getting Started: Colab Setup

Why Google Colab & Tesla T4?

Cost: Free GPU access, that's all!
Performance: Tesla T4 handles mid-size LLMs effectively when paired with quantization and PEFT
Accessibility: No local GPU required — ideal for beginners

Installing Unsloth

# Stable release from PyPI:
pip install unsloth

# OR

# Install the Nightly (latest GitHub) for cutting-edge features:
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir --no-deps \
  git+https://github.com/unslothai/unsloth.git@nightly \
  git+https://github.com/unslothai/unsloth-zoo.git

pip install unsloth: grabs the vetted, stable version
uninstall & install: fetches the newest commits from GitHub (may include experimental updates)

2. Loading a Model Efficiently

We load the Llama 3.2 1B model using Unsloth in a memory-efficient 4-bit quantization mode, using roughly one-quarter the memory of full precision so it runs faster and fits on small GPUs. It also sets how long each input can be (up to 2048 tokens).

from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048      # How many tokens each input can have
dtype = None               # None for auto detection. Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = True        # Use 4-bit quantization to reduce memory usage
model_name = "unsloth/Llama-3.2-1B-Instruct"

# Load both model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

FastLanguageModel.from_pretrained: downloads and prepares the model + tokenizer
max_seq_length: sets the maximum context length

3. Introducing PEFT & LoRA

Instead of updating all model weights (which can be billions), PEFT adds small adapter layers you train. LoRA is one such method.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,          # Adapter rank (size). Suggested: 8, 16, 32, 64, 128
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "up_proj", "down_proj"
    ],
    lora_alpha=16,                  # Scales adapter updates
    lora_dropout=0,                 # No dropout
    bias="none",                    # Skip bias updates
    use_gradient_checkpointing="unsloth",  # Unsloth-optimized checkpointing
    random_state=3407,              # For reproducibility
    use_rslora=False,               # Optional advanced LoRA variant
    loftq_config=None,              # Optional LoftQ config
)

r controls the size of the LoRA layers; higher uses more memory.
lora_alpha is like a “volume knob” for learning strength.
use_gradient_checkpointing="unsloth" trades compute for lower VRAM.

4. Preparing Your Dataset for Training

When preparing datasets for fine-tuning models like LLaMA 3.1 and Phi-4, format multi-turn conversations per each model’s expected structure.

🦙 LLaMA 3.1: Chat Template Format

LLaMA 3.1 wraps each message with special tokens:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there! How can I assist you today?<|eot_id|>

<|begin_of_text|> marks the start
<|start_header_id|>/<|end_header_id|> mark roles
<|eot_id|> ends each message

Use Unsloth’s standardize_sharegpt to convert existing data into this format.

Phi-4 uses ChatML JSON:

{
  "messages": [
    {"role": "system",    "content": "You are a helpful assistant."},
    {"role": "user",      "content": "Can you explain machine learning?"},
    {"role": "assistant", "content": "Certainly! Machine learning is..."}
  ]
}

🔄 Converting Between Formats

Identify current format (CSV, ShareGPT, ChatML).
Convert to a ShareGPT-like structure (from/value).
Standardize to role/content with standardize_sharegpt.
Apply chat template via get_chat_template and apply_chat_template.

Below, set USE_CSV = True or False to choose your data source.

Custom Dataset (CSV)

We used a fictional 30-question dataset from “Eastern Caverns” to illustrate fine-tuning on domain-specific Q&A.

Configuration & Imports

# Configuration & Imports
USE_CSV     = True                 # False → load a ShareGPT dataset instead
CSV_PATH    = "your_data.csv"      # CSV must have 'question' & 'answer' columns
SHAREGPT_DS = "mlabonne/FineTome-100k"  # HF ShareGPT-style dataset

import pandas as pd
from datasets import Dataset, load_dataset
from unsloth.chat_templates import get_chat_template, standardize_sharegpt

Loading & Wrapping Data

if USE_CSV:
    df = pd.read_csv(CSV_PATH)
    ds = Dataset.from_pandas(df)

    def to_sharegpt_format(ex):
        return {
            "conversations": [
                {"from": "system", "value": "You are a Victorian-era assistant…"},
                {"from": "human",  "value": ex["question"]},
                {"from": "gpt",    "value": ex["answer"]},
            ]
        }
    ds = ds.map(to_sharegpt_format, remove_columns=df.columns.tolist())
else:
    ds = load_dataset(SHAREGPT_DS, split="train")

Standardizing & Applying the Chat Template

ds = standardize_sharegpt(ds)

CHAT_TEMPLATE = "llama-3.1"
tokenizer = get_chat_template(tokenizer, chat_template=CHAT_TEMPLATE)

def format_prompts(examples):
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        )
        for convo in examples["conversations"]
    ]
    return {"text": texts}

nds = ds.map(format_prompts, batched=True)

print("===== BEFORE =====")
print(nds[0]["conversations"])
print("===== AFTER =====")
print(nds[0]["text"])

5. Supervised Fine Tuning with SFTTrainer

Too many things here 😮‍💨 — lets explore one by one,

🛠️ SFTTrainer

Purpose: Trainer for Supervised Fine-Tuning (SFT) of LLMs.
Why: Streamlines fine-tuning with built-in utilities.

Key Components

model, tokenizer, train_dataset — The model, tokenizer, and dataset (nds).
dataset_text_field="text" — Uses the "text" field for input.
DataCollatorForSeq2Seq — Pads and batches seq2seq data.
TrainingArguments — Hyperparameters (batch size, learning rate, epochs, etc.).
per_device_train_batch_size=2 — Examples per device.
gradient_accumulation_steps=4 — Simulate larger batches.
warmup_steps=5 — Smooth learning rate start.
max_steps=100 — Total training steps.
optim="adamw_8bit" — 8-bit AdamW optimizer for memory savings.
output_dir="outputs" — Where checkpoints go.

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=nds,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        max_steps=100,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        output_dir="outputs",
        report_to="none",
    ),
)

6. Kicking-off the training

Use Unsloth’s train_on_responses_only to compute loss only on the assistant’s output:

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

stats = trainer.train()

You’ll see the loss drop steadily—proof your 4-bit Llama 3.2 is learning efficiently.

7. Inference & Saving Your Model

Fast Inference Mode

model = FastLanguageModel.for_inference(model)

inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "<Your Question Here>"}],
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs)[0])

Save & Reload

# Save
model.save_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")
tokenizer.save_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")

# Reload in 4-bit mode
from transformers import AutoTokenizer
from unsloth import FastLanguageModel

tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="/content/drive/MyDrive/my_llama3_model_eastern_caverns",
    load_in_4bit=True,
    max_seq_length=2048,
)

# Quick test
inputs = tokenizer("What backup options are available for the CavernDB cluster?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conclusion

Fine-tuning a large language model is all about balancing precision, speed, and the right data.

Precision vs. Quantization

Full precision (FP32) → ~4.3 billion weight values
4-bit quantization → 16 levels (tiny rounding error for big memory savings)

Why 4-Bit Helps

7 B params in FP16 needs ~28 GB RAM; in 4-bit ~7 GB
Avoids out-of-memory on free GPUs (Colab, Kaggle)

Unsloth’s Speed Boost

Up to 2× faster fine-tuning, ~70% VRAM reduction
Memory-efficient kernels without accuracy loss

Picking the Right Dataset

Too small → overfitting; too big/wild → underfitting
Aim for focused, high-quality examples in proper order

That’s your bird’s-eye view: squeezing precision, saving resources, turbo-charging training with Unsloth, and choosing data wisely. Feedback and questions are always welcome!

One last thing — if you’ve made it this far, I’ve dropped the Colab notebook link in the comments below. Feel free to dive in and give it a spin 😉 !