DEV Community

Cover image for Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2
Rishab Dugar
Rishab Dugar

Posted on

Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

UnSloth Guides the Llama’s Fine-Tuning Ritual — a whimsical stylized llama illustration

Fine-tuning large language models (LLMs) sounds complex — until you meet Unsloth. Whether you’re a complete beginner or an experienced ML tinkerer, this guide walks you through the simplest and most efficient way to fine-tune LLaMA models on free GPUs using Google Colab. Best of all? No fancy hardware or deep ML theory required.

This article breaks down every keyword, library, and function, defining each term precisely but in the simplest language possible.

In this article, you’ll learn how to:

  • Install and configure Unsloth in Colab
  • Load models in quantized (4-bit) mode to save memory
  • Understand core concepts (parameters, weights, biases, quantization, etc.)
  • Apply PEFT and LoRA adapters to fine-tune only a small part of the model
  • Prepare Q&A data for training with Hugging Face Datasets and chat templates
  • Use SFTTrainer for supervised fine-tuning
  • Switch to inference mode for faster generation
  • Save and reload your fine-tuned model

Getting Comfortable With some Core Concepts

Llama’s Machine Learning Meditation — stylized llama in lotus pose

Disclaimer: I promise this will be the friendliest GenAI glossary — your cheat sheet, wittier than autocorrect and way less judgmental! 😉

Language Model — Word-Predictor to put it simply! — like a smart autocomplete that predicts the next word based on what came before, much like your phone suggests “you” after “how are”. It learns these patterns by “reading” massive amounts of text, so it knows common word sequences and can fill in blanks. Behind the scenes, it models the probabilities of word sequences, assigning higher scores to more natural continuations.

Attention — Imagine you’re reading a sentence and want to know which earlier words matter most to understand each new word — that’s attention. Instead of reading a sentence strictly left-to-right, attention lets every word weigh how much it should consider all the others, like skimming a page and highlighting only key phrases. This selective focus makes predictions more accurate and efficient, ignoring irrelevant details.

Parameter — A number inside a model that can change during learning (like a dial that the model tweaks)

Weight — Mostly synonymous with parameters controlling how strongly one part of input affects the output

Data vs Parameters vs Weights — Data” refers to the information used to train a model, while “parameters” are the values the model learns from that data, and “weights” are a specific type of parameter representing the strength of connections between model variables. To put it more simply, Data is the input, parameters are what the model adjusts to make predictions, and weights are a subset of those parameters.

Bias — A small extra number added so the model can shift outputs up or down, like a baseline adjustment

Transformer — A Transformer is a special model built around attention, letting it “look” at every word in a sentence in parallel rather than one by one. It’s like a study group where everyone reads the entire essay at once and then discusses which sentences are most important to the main idea. Introduced in 2017 in Google’s “Attention Is All You Need” paper, Transformers are the backbone of today’s LLMs and power everything from translation tools to chatbots.

Quantization — Reducing precision of weights (e.g. 16-bit → 4-bit) to slash memory use, with minimal accuracy loss.

PEFT — (Parameter-Efficient Fine-Tuning) — updating only tiny adapter layers instead of the whole model.

LoRA — (Low-Rank Adaptation) — A smart shortcut for teaching a huge AI model new tricks by only tweaking a tiny part of it instead of retraining the entire thing. You “freeze” most of the model’s parameters and insert two small, trainable matrices into each layer. During fine-tuning, only these add-ons learn, drastically cutting down on time and compute cost.

LoRA “r” — The adapter’s rank (size). Higher r gives more capacity but uses more memory.

LoRA α (alpha) — A scaling factor for adapter updates — like a “volume knob” for learning strength.

Dropout — Randomly turning off some adapter connections during training to prevent overfitting (can be set to 0).

Gradient Checkpointing — Recomputes parts of the model during backpropagation to halve peak VRAM usage, at a slight speed cost.

4-bit Mode — Quantized mode storing weights in 4 bits, cutting memory by ~4× compared to 16/32-bit.

Inference Mode — After training, use a special mode optimized for fast text generation (2× speed).

Overfitting — When a model “memorizes” a tiny dataset and fails on new inputs — always test on unseen data.

Checkpoint — A saved snapshot of model weights you can reload later.

Token — A token is a small chunk of text (~4–5 characters) — a word, part of a word, punctuation mark, or symbol — that serves as the basic unit a model processes.

Tokenizer — The tokenizer is the program that “cuts” raw text into those tokens and then converts each token into a unique number (ID) the model can work with (e.g., “unhappiness” → “un”, “happi”, “ness” → “un” = 137, “happi” = 428, etc).

SLMs vs. LLMs

LMs Ultimate Showdown — stylized vs. large model illustration

  • SLMs (Small Language Models) have fewer parameters and focus on specific tasks or domains — think of them as pocket calculators solving one type of math problem quickly and efficiently.
  • LLMs (Large Language Models) are like supercomputers trained on vast, diverse data; they can tackle many tasks — writing essays, summarizing articles, or coding — because they’ve “read” almost the entire internet.
  • SLMs require less compute power and are ideal for on-device or specialized applications, whereas LLMs need massive cloud resources but offer broader versatility.
  • In practice, you might use an SLM for customer-service chat on your phone, but call an LLM when you need deep research help or creative story generation.

1. Getting Started: Colab Setup

Why Google Colab & Tesla T4?

  • Cost: Free GPU access, that's all!
  • Performance: Tesla T4 handles mid-size LLMs effectively when paired with quantization and PEFT
  • Accessibility: No local GPU required — ideal for beginners

Installing Unsloth

# Stable release from PyPI:
pip install unsloth

# OR

# Install the Nightly (latest GitHub) for cutting-edge features:
pip uninstall unsloth -y
pip install --upgrade --no-cache-dir --no-deps \
  git+https://github.com/unslothai/unsloth.git@nightly \
  git+https://github.com/unslothai/unsloth-zoo.git
Enter fullscreen mode Exit fullscreen mode
  • pip install unsloth: grabs the vetted, stable version
  • uninstall & install: fetches the newest commits from GitHub (may include experimental updates)

2. Loading a Model Efficiently

We load the Llama 3.2 1B model using Unsloth in a memory-efficient 4-bit quantization mode, using roughly one-quarter the memory of full precision so it runs faster and fits on small GPUs. It also sets how long each input can be (up to 2048 tokens).

from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048      # How many tokens each input can have
dtype = None               # None for auto detection. Float16 for Tesla T4, V100; Bfloat16 for Ampere+
load_in_4bit = True        # Use 4-bit quantization to reduce memory usage
model_name = "unsloth/Llama-3.2-1B-Instruct"

# Load both model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
Enter fullscreen mode Exit fullscreen mode
  • FastLanguageModel.from_pretrained: downloads and prepares the model + tokenizer
  • max_seq_length: sets the maximum context length

Diagram showing memory usage and model loading output

3. Introducing PEFT & LoRA

Instead of updating all model weights (which can be billions), PEFT adds small adapter layers you train. LoRA is one such method.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,          # Adapter rank (size). Suggested: 8, 16, 32, 64, 128
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "up_proj", "down_proj"
    ],
    lora_alpha=16,                  # Scales adapter updates
    lora_dropout=0,                 # No dropout
    bias="none",                    # Skip bias updates
    use_gradient_checkpointing="unsloth",  # Unsloth-optimized checkpointing
    random_state=3407,              # For reproducibility
    use_rslora=False,               # Optional advanced LoRA variant
    loftq_config=None,              # Optional LoftQ config
)
Enter fullscreen mode Exit fullscreen mode
  • r controls the size of the LoRA layers; higher uses more memory.
  • lora_alpha is like a “volume knob” for learning strength.
  • use_gradient_checkpointing="unsloth" trades compute for lower VRAM.

4. Preparing Your Dataset for Training

When preparing datasets for fine-tuning models like LLaMA 3.1 and Phi-4, format multi-turn conversations per each model’s expected structure.

🦙 LLaMA 3.1: Chat Template Format

LLaMA 3.1 wraps each message with special tokens:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi there! How can I assist you today?<|eot_id|>
Enter fullscreen mode Exit fullscreen mode
  • <|begin_of_text|> marks the start
  • <|start_header_id|>/<|end_header_id|> mark roles
  • <|eot_id|> ends each message

Use Unsloth’s standardize_sharegpt to convert existing data into this format.

Phi-4 uses ChatML JSON:

{
  "messages": [
    {"role": "system",    "content": "You are a helpful assistant."},
    {"role": "user",      "content": "Can you explain machine learning?"},
    {"role": "assistant", "content": "Certainly! Machine learning is..."}
  ]
}
Enter fullscreen mode Exit fullscreen mode

🔄 Converting Between Formats

  1. Identify current format (CSV, ShareGPT, ChatML).
  2. Convert to a ShareGPT-like structure (from/value).
  3. Standardize to role/content with standardize_sharegpt.
  4. Apply chat template via get_chat_template and apply_chat_template.

Below, set USE_CSV = True or False to choose your data source.

Custom Dataset (CSV)

We used a fictional 30-question dataset from “Eastern Caverns” to illustrate fine-tuning on domain-specific Q&A.

Configuration & Imports

# Configuration & Imports
USE_CSV     = True                 # False → load a ShareGPT dataset instead
CSV_PATH    = "your_data.csv"      # CSV must have 'question' & 'answer' columns
SHAREGPT_DS = "mlabonne/FineTome-100k"  # HF ShareGPT-style dataset

import pandas as pd
from datasets import Dataset, load_dataset
from unsloth.chat_templates import get_chat_template, standardize_sharegpt
Enter fullscreen mode Exit fullscreen mode

Uploaded CSV file screenshot showing file path

Loading & Wrapping Data

if USE_CSV:
    df = pd.read_csv(CSV_PATH)
    ds = Dataset.from_pandas(df)

    def to_sharegpt_format(ex):
        return {
            "conversations": [
                {"from": "system", "value": "You are a Victorian-era assistant…"},
                {"from": "human",  "value": ex["question"]},
                {"from": "gpt",    "value": ex["answer"]},
            ]
        }
    ds = ds.map(to_sharegpt_format, remove_columns=df.columns.tolist())
else:
    ds = load_dataset(SHAREGPT_DS, split="train")
Enter fullscreen mode Exit fullscreen mode

Standardizing & Applying the Chat Template

ds = standardize_sharegpt(ds)

CHAT_TEMPLATE = "llama-3.1"
tokenizer = get_chat_template(tokenizer, chat_template=CHAT_TEMPLATE)

def format_prompts(examples):
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        )
        for convo in examples["conversations"]
    ]
    return {"text": texts}

nds = ds.map(format_prompts, batched=True)

print("===== BEFORE =====")
print(nds[0]["conversations"])
print("===== AFTER =====")
print(nds[0]["text"])
Enter fullscreen mode Exit fullscreen mode

Console output showing before and after dataset conversation formatting

5. Supervised Fine Tuning with SFTTrainer

Too many things here 😮‍💨 — lets explore one by one,

🛠️ SFTTrainer

  • Purpose: Trainer for Supervised Fine-Tuning (SFT) of LLMs.
  • Why: Streamlines fine-tuning with built-in utilities.

Key Components

  • model, tokenizer, train_dataset — The model, tokenizer, and dataset (nds).
  • dataset_text_field="text" — Uses the "text" field for input.
  • DataCollatorForSeq2Seq — Pads and batches seq2seq data.
  • TrainingArguments — Hyperparameters (batch size, learning rate, epochs, etc.).
  • per_device_train_batch_size=2 — Examples per device.
  • gradient_accumulation_steps=4 — Simulate larger batches.
  • warmup_steps=5 — Smooth learning rate start.
  • max_steps=100 — Total training steps.
  • optim="adamw_8bit" — 8-bit AdamW optimizer for memory savings.
  • output_dir="outputs" — Where checkpoints go.
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=nds,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        max_steps=100,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        output_dir="outputs",
        report_to="none",
    ),
)
Enter fullscreen mode Exit fullscreen mode

6. Kicking-off the training

Use Unsloth’s train_on_responses_only to compute loss only on the assistant’s output:

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

stats = trainer.train()
Enter fullscreen mode Exit fullscreen mode

You’ll see the loss drop steadily—proof your 4-bit Llama 3.2 is learning efficiently.

Training loss curve showing decrease over steps

7. Inference & Saving Your Model

Fast Inference Mode

model = FastLanguageModel.for_inference(model)

inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "<Your Question Here>"}],
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(input_ids=inputs, max_new_tokens=256)
print(tokenizer.batch_decode(outputs)[0])
Enter fullscreen mode Exit fullscreen mode

Example generated response in notebook output

Save & Reload

# Save
model.save_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")
tokenizer.save_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")

# Reload in 4-bit mode
from transformers import AutoTokenizer
from unsloth import FastLanguageModel

tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/my_llama3_model_eastern_caverns")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="/content/drive/MyDrive/my_llama3_model_eastern_caverns",
    load_in_4bit=True,
    max_seq_length=2048,
)

# Quick test
inputs = tokenizer("What backup options are available for the CavernDB cluster?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Enter fullscreen mode Exit fullscreen mode

Output showing saved model reload test

Conclusion

The Pharaoh’s Workshop of Wisdom with UnSloth Guide — ornate workshop scene

Fine-tuning a large language model is all about balancing precision, speed, and the right data.

  1. Precision vs. Quantization
  • Full precision (FP32) → ~4.3 billion weight values
  • 4-bit quantization → 16 levels (tiny rounding error for big memory savings)
  1. Why 4-Bit Helps
  • 7 B params in FP16 needs ~28 GB RAM; in 4-bit ~7 GB
  • Avoids out-of-memory on free GPUs (Colab, Kaggle)
  1. Unsloth’s Speed Boost
  • Up to 2× faster fine-tuning, ~70% VRAM reduction
  • Memory-efficient kernels without accuracy loss
  1. Picking the Right Dataset
  • Too small → overfitting; too big/wild → underfitting
  • Aim for focused, high-quality examples in proper order

That’s your bird’s-eye view: squeezing precision, saving resources, turbo-charging training with Unsloth, and choosing data wisely. Feedback and questions are always welcome!

One last thing — if you’ve made it this far, I’ve dropped the Colab notebook link in the comments below. Feel free to dive in and give it a spin 😉 !


Further Reading & Resources

#LLM #fine-tuning #Unsloth #tutorial #PEFT #LoRA #4-bit #quantization #Colab #T4GPU #SFTTrainer #inference #optimization

Top comments (1)

Collapse
 
rishabdugar profile image
Rishab Dugar • Edited

Colab Notebook - LLama trained on Eastern Caverns data

colab.research.google.com/drive/19...

If this article was helpful, show some love ❤️ & do follow for more 🤗