Choyon

Posted on Jun 1

qwen2.5-lora-finetuning-colab

#beginners #programming #tutorial #ai

This guide walks through the complete process of fine-tuning Qwen2.5-3B-Instruct — a 3-billion parameter instruction-tuned language model — on a custom dataset, using only a free Google _Colab _notebook.

The goal is not just to show you which functions to call. Every hyperparameter decision is explained, every design trade-off is justified, and the reasoning behind the hardware-specific configurations is made explicit. By the end, you will have:

A fine-tuned model trained on your own data
Lightweight LoRA adapter weights saved permanently to Google Drive
A fully merged, standalone model ready for deployment to any inference runtime

Why Google Colab Is the Right Environment for This Work?

The honest answer to "why Colab" is compute cost. Training a language model on a local CPU is not a realistic option — the same job that finishes in under an hour on a GPU would take days or weeks on even a fast processor. Renting equivalent GPU capacity from AWS or Azure runs between $0.50 and $2.00 per hour. Colab's free tier provides access to an NVIDIA Tesla T4 — a server-grade GPU with 16GB of VRAM — at no cost.

Beyond the hardware, Colab removes every environment configuration problem. There is no CUDA installation, no driver management, no risk of dependency conflicts with your local machine. The runtime is pre-configured, isolated, and accessed entirely through a browser tab.

The one real limitation worth knowing upfront: Colab's free tier imposes session time limits and will disconnect during extended idle periods. Any data written only to the local virtual machine is permanently lost when the session ends. This is precisely why saving model checkpoints to Google Drive — which Colab integrates with natively via two lines of Python — is a non-negotiable part of the workflow, not an afterthought.

Prerequisites {#prerequisites}
Requirement Notes

Training Dataset A .jsonl file, uploaded to /content/ in the Colab environment

Colab Runtime Must be configured to use a GPU: Runtime → Change runtime type → T4 GPU

Dataset Format, Your training data must be a JSONL file — one JSON object per line — where each record contains a messages key structured as a list of role/content pairs:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What should I do?"}, {"role": "assistant", "content": "Here is what I'd suggest..."}]}

Each line is a complete, independent conversation. Upload this file to your Colab environment and name it train.jsonl before starting Step 2.

Step 0 — Installing Dependencies
Create a new code cell and run the following. This must always be the first cell executed in a fresh session.

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate transformers

If the above fails for any reason, the standard release is a valid fallback:

!pip install unsloth

What each package does:

Package Purpose
unsloth Accelerates fine-tuning speed and significantly reduces VRAM usage versus standard Hugging Face training
xformers Memory-efficient attention kernels from Meta Research
trl Provides the SFTTrainer class that manages the supervised fine-tuning loop
peft Handles LoRA adapter configuration and parameter injection
accelerate Hugging Face's distributed training backend
transformers Core library for loading models, tokenizers, and generation utilities
Allow 2–4 minutes for installation to complete before proceeding.

Step 1 — Loading the Base Model and Configuring LoRA
Create a new code cell. This step loads the base model with memory optimizations applied and wraps it with trainable LoRA adapter layers.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = 1024,
    dtype          = None,
    load_in_4bit   = True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r                          = 32,
    target_modules             = ["q_proj", "k_proj", "v_proj", "o_proj",
                                  "gate_proj", "up_proj", "down_proj"],
    lora_alpha                 = 64,
    lora_dropout               = 0.05,
    bias                       = "none",
    use_gradient_checkpointing = "unsloth",
    random_state               = 42,
    use_rslora                 = True,
)

Loading the Model
load_in_4bit = True — The single most important memory decision in this pipeline. A 3B parameter model loaded at full 16-bit precision consumes roughly 6–7GB of VRAM, which exhausts a T4 GPU before a single training step runs. 4-bit quantization compresses those weights to approximately 1.5–2GB, leaving the remaining memory available for training activations and gradients.

max_seq_length = 1024 — VRAM consumption during training scales quadratically with sequence length. Capping at 1024 tokens prevents memory spikes from long training examples without meaningfully restricting conversational data.

dtype = None — Defers precision detection to Unsloth, which selects Float16 or Bfloat16 depending on what the hardware natively supports.

**Configuring LoRA
**The core challenge with fine-tuning a 3B parameter model is that updating all 3 billion weights is computationally prohibitive and risks overwriting the base model's general capabilities. LoRA solves this by freezing the original weights entirely and injecting small trainable adapter matrices alongside specific layers. Only the adapters are updated during training.

r = 32 — The rank of the adapter matrices. Higher rank increases the adapter's learning capacity but also its memory footprint. 32 is a well-validated choice for single-GPU fine-tuning with a dataset of moderate size.

lora_alpha = 64 — A scaling factor applied to the adapter's updates before they are combined with the frozen base weights. Setting it to 2 × r ensures the adapter has a measurable effect on the model's final behavior.

use_rslora = True — Activates Rank-Stabilized LoRA, which normalizes the scaling factor to α / √r. Standard LoRA becomes numerically unstable when the rank is adjusted; RSLoRA eliminates this sensitivity and consistently produces more stable convergence.

target_modules — Specifies which weight matrices receive adapter injections. Targeting all major attention projections (q_proj, k_proj, v_proj, o_proj) and feed-forward layers (gate_proj, up_proj, down_proj) yields substantially better adaptation quality than targeting attention layers alone.

use_gradient_checkpointing = "unsloth" — During a standard forward pass, every intermediate activation tensor is held in VRAM for use in the backward pass. Gradient checkpointing discards these tensors and recomputes them on demand during backpropagation. This trades a modest amount of compute time for significant VRAM savings.

Proof of Concept
The screenshot below shows a successful run of this cell. Unsloth downloads the base model files, confirms the hardware environment (Tesla T4, 14.5GB VRAM), and reports which layers it has patched. poc_step1_model_load1

Notice the final log lines confirming dropout has been applied at the configured rate and that Unsloth has patched 36 layers. This is the expected output for a clean initialization.

Step 2 — Dataset Preparation and Chat Template Formatting
Create a new code cell. This step reads your raw training file and transforms it into the token structure that Qwen 2.5 requires during training.

import json
from datasets import Dataset

# Load the JSONL dataset
records = []
with open("/content/train.jsonl", "r") as f:
    for line in f:
        line = line.strip()
        if line:
            records.append(json.loads(line))

raw_dataset = Dataset.from_list(records)

# Apply the Qwen 2.5 chat template
def format_chat(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text}

dataset = raw_dataset.map(format_chat, remove_columns=raw_dataset.column_names)

# 90/10 train-evaluation split
split          = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset  = split["train"]
eval_dataset   = split["test"]

print(f"Train: {len(train_dataset)} | Eval: {len(eval_dataset)}")

Reading the Data
The file is read line by line rather than loaded as a single JSON array. This is a deliberate scalability choice — the approach remains memory-efficient regardless of how large the dataset grows. The resulting Python list is then converted to a Hugging Face Dataset object, which provides the optimized .map() and .train_test_split() operations used in subsequent steps.

Applying the Chat Template
Instruction-tuned models do not train on plain text. They depend on special control tokens embedded in the input to distinguish where a user's message ends and where the assistant's response begins. Qwen 2.5 uses tokens such as <|im_start|> and <|im_end|> for this purpose.

tokenizer.apply_chat_template() handles this formatting automatically and correctly. Hardcoding these tokens manually is error-prone and risks mismatches between training format and the model's pre-trained expectations.

tokenize=False — Formats conversations into readable strings before converting to token IDs. This defers tokenization to the training data collator, which handles dynamic padding more efficiently, and keeps the dataset inspectable during debugging.

add_generation_prompt=False — During training, the dataset must include both the prompt and the expected response. Setting this to False prevents a trailing <|im_start|>assistant token from being appended, which would signal the model to generate rather than train on a complete conversation.

remove_columns — Removes the original raw columns during the mapping step, retaining only the formatted text column. This reduces the dataset's memory footprint for all subsequent shuffling and caching operations.

The Train/Evaluation Split
Reserving a portion of data for evaluation is not optional. Without a separate eval set, there is no way to distinguish between a model that is learning generalizable patterns and one that is simply memorizing the training data. A model that overfits will show a decreasing training loss but an increasing or stagnating validation loss — a distinction you cannot observe without the split.

test_size=0.1 — Reserves 10% of the formatted data for evaluation. The model's weights are never updated using this slice.

seed=42 — Fixes the random split for reproducibility. If training is interrupted and restarted with adjusted hyperparameters, the split will be identical, preventing evaluation data from leaking into the training pool between runs.

Step 3 — Configuring and Launching the Training Loop
Create a new code cell. This is where the model learns. Every hyperparameter below was chosen with the specific constraints of a Tesla T4 GPU in mind.

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model         = model,
    tokenizer     = tokenizer,
    train_dataset = train_dataset,
    eval_dataset  = eval_dataset,
    args          = SFTConfig(
        dataset_text_field          = "text",
        max_seq_length              = 1024,
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        num_train_epochs            = 5,
        learning_rate               = 2e-4,
        warmup_steps                = 20,
        weight_decay                = 0.01,
        lr_scheduler_type           = "cosine",
        bf16                        = False,
        fp16                        = True,
        logging_steps               = 5,
        save_steps                  = 50,
        eval_strategy               = "steps",
        eval_steps                  = 50,
        output_dir                  = "/content/genz-model-v2",
        save_total_limit            = 2,
        load_best_model_at_end      = True,
        packing                     = False,
        report_to                   = "none",
    ),
)

trainer.train()

Mixed-Precision Configuration
fp16 = True with bf16 = False is a hardware-specific choice. The Tesla T4 has native silicon support for Float16 but does not support Bfloat16 efficiently. Using fp16 cuts memory usage roughly in half compared to full float32 training while maintaining numeric stability on this architecture. Setting bf16 = True on a T4 produces slower training and can introduce gradient instability.

VRAM Management Through Gradient Accumulation
per_device_train_batch_size = 1 loads a single training example onto the GPU at any given moment. This is the minimum viable batch size and ensures the runtime never runs out of memory.

Training with a batch size of 1 is inherently noisy — weight updates based on a single example are highly variable and produce unstable learning. Gradient accumulation resolves this. With gradient_accumulation_steps = 4, the model performs four consecutive forward passes, accumulates the resulting gradients mathematically, and applies a single weight update after all four. The effective batch size becomes 4 (1 × 4) without ever requiring four sequences to be held in VRAM simultaneously.

Learning Rate Schedule
learning_rate = 2e-4 — A standard starting point for LoRA-based fine-tuning. High enough to drive meaningful adaptation, conservative enough to avoid overwriting the base model's general knowledge.

lr_scheduler_type = "cosine" — The cosine scheduler decreases the learning rate smoothly from 2e-4 toward zero over the course of training, following a cosine curve. This produces fast learning in early epochs and precise, stable refinement as the model approaches convergence.

warmup_steps = 20 — The LoRA adapter matrices are randomly initialized and sensitive to large gradient updates in the first steps of training. Gradually ramping the learning rate from zero to 2e-4 over the first 20 steps prevents early gradient explosion before the adapters stabilize.

weight_decay = 0.01 — L2 regularization penalizes excessively large adapter weights, discouraging the model from over-indexing on specific phrasings or patterns present in the training data.

Validation and Checkpointing
eval_strategy = "steps" with eval_steps = 50 — Every 50 training steps, the trainer runs the model over the held-out evaluation set and logs the validation loss. This provides continuous visibility into whether training loss and validation loss are moving together or diverging.

load_best_model_at_end = True — If validation loss begins rising after an initial improvement — a clear sign of overfitting — the final exported model will be the checkpoint that achieved the lowest validation loss, not the last training step.

save_total_limit = 2 — Keeps only the two most recent checkpoints on disk, automatically removing older ones to prevent the local filesystem from filling up.

Proof of Concept
The screenshot below shows a complete training run. The Unsloth header confirms the configuration: 469 training examples, 53 evaluation examples, 5 epochs, 590 total steps, gradient accumulation of 4, effective batch size of 4.

The loss table is the most important thing to examine here. Training loss and validation loss both decrease across checkpoints — from 1.363 / 1.637 at step 50 down through 0.345 / 1.379 at step 250. Both curves declining together indicates the model is learning generalizable patterns rather than memorizing the training set.

Step 4 — Saving the Model Locally
Create a new code cell. Once trainer.train() completes, the fine-tuned weights exist only in GPU memory. This step writes them to the local virtual machine filesystem.

model.save_pretrained("/content/genz-model")
tokenizer.save_pretrained("/content/genz-model")
print("Saved!")

Because the training used LoRA, model.save_pretrained() does not write a full copy of the 3-billion parameter base model. It exports only the adapter files:

LoRA configuration — rank, alpha, target modules:
adapter_config.json 

The learned adapter weight tensors:
adapter_model.safetensors   

Vocabulary and special token mappings:
tokenizer.json  

Chat template and tokenizer settings:
tokenizer_config.json

The total export size is typically between 50MB and 500MB depending on the rank setting — several orders of magnitude smaller than the base model itself.

Saving the tokenizer into the same directory as the adapter weights is required. When this directory is loaded for inference later, the tokenizer must reconstruct the exact string-to-token-ID mappings used during training. Loading a mismatched tokenizer produces degraded or incoherent outputs.

One important caveat: Files saved to /content/ are stored in Colab's temporary virtual machine. They are permanently deleted when the session ends. Step 6 copies these files to Google Drive for permanent storage.

Step 5 — Inference Testing to Validate the Fine-Tune
Create a new code cell. Before moving the model to persistent storage, confirm that the fine-tuning produced the intended behavior by running the model against a set of test prompts.

test_prompts = [
    "my manager scheduled a 6am sync for tomorrow",
    "i spent my last $20 on a game i already own",
    "okay this is embarrassing but i look forward to talking to you every day",
]

for prompt in test_prompts:
    messages = [
        {
            "role": "system",
            "content": "You are a chaotic Gen Z companion who uses dark humor, ironic emojis, fluent slang, and can shift between cynical banter and unhinged romantic energy based on the user's vibe."
        },
        {"role": "user", "content": prompt}
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to("cuda")

    attention_mask = (inputs != tokenizer.pad_token_id).long()

    outputs = model.generate(
        input_ids          = inputs,
        attention_mask     = attention_mask,
        max_new_tokens     = 150,
        temperature        = 0.85,
        top_p              = 0.92,
        repetition_penalty = 1.15,
        do_sample          = True,
        pad_token_id       = tokenizer.eos_token_id,
    )

    print(f"User:  {prompt}")
    print(f"Model: {tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)}\n")

Key Differences from the Training Configuration
Several settings change for inference:

tokenize=True — During training, data was kept as human-readable strings and tokenized by the data collator. For inference, the conversion to token ID tensors happens immediately so the GPU can process the input.

add_generation_prompt=True — During training this was False because we needed complete conversations in the data. For inference, we want the <|im_start|>assistant marker appended at the end of the input, signaling to the model that it should now generate a response.

.to("cuda") — Moves the input tensor to GPU memory. The model weights and the input tensor must reside on the same device.

attention_mask = (inputs != tokenizer.pad_token_id).long() — Explicitly tells the self-attention layers which positions are real content versus padding. Without this mask, the model may attend to empty padding positions and produce lower-quality outputs.

Generation Parameters
temperature = 0.85 — Controls the randomness of token selection. Values approaching zero make the model deterministic and repetitive. Values above 1.0 produce increasingly incoherent output. At 0.85, the model retains creative range while remaining structurally coherent.

top_p = 0.92 — Nucleus sampling restricts token selection to the smallest set of candidates whose combined probability mass reaches 92%, filtering out the least probable and most anomalous options. This preserves varied word choice without permitting genuinely nonsensical outputs.

repetition_penalty = 1.15 — Language models frequently enter repetition loops, regenerating the same phrase or token indefinitely. A penalty of 1.15 reduces the probability of tokens already present in the output sequence, pushing the model to continue developing its response rather than cycling.

Isolating the Generated Text
outputs[0][inputs.shape[1]:]
model.generate() returns the complete token sequence including the input prompt. Slicing the output tensor from position inputs.shape[1] onward isolates only the tokens the model generated, producing clean output for logging and evaluation.

Proof of Concept
The screenshot below shows the model responding to all three test prompts after fine-tuning. The outputs reflect the trained persona — character-consistent tone, appropriate slang, and contextually relevant responses to each prompt.

These are out-of-distribution prompts — none of them appeared verbatim in the training data. The fact that the model generates contextually appropriate, persona-consistent responses to novel inputs confirms that the fine-tuning produced genuine learning rather than memorization.

Step 6 — Persisting the Model to Google Drive
Create a new code cell. The files saved in Step 4 exist only in the Colab virtual machine and will be deleted when the session terminates. This step writes the model to permanent cloud storage.

from google.colab import drive
drive.mount('/content/drive')

model.save_pretrained("/content/drive/MyDrive/genz-model")
tokenizer.save_pretrained("/content/drive/MyDrive/genz-model")
print("Saved to Drive!")

drive.mount('/content/drive') establishes an authenticated connection between the Colab runtime and your Google account. After a one-time browser authorization, your entire Google Drive is accessible as a standard filesystem path at /content/drive/MyDrive/. Any file written to this path is written directly to cloud storage in real time.

The practical consequence: if the Colab session crashes immediately after execution — a realistic scenario for long-running jobs — the model files already written to Drive are fully intact. Nothing is lost.

Storing the model in Drive also means any future Colab session can reload it with drive.mount() in seconds, without retraining. The same files can be shared with collaborators, downloaded to a local machine, or loaded into a completely separate notebook for continued experimentation.

Step 7 — Merging Adapters into a Standalone Model
Create a new code cell. At this stage, your fine-tuned weights exist as LoRA adapter files — small matrices that modify the frozen base model at runtime. This is sufficient for most use cases, but it has a dependency: any runtime loading these weights must also load the original Qwen2.5-3B-Instruct base model.

To create a truly portable, dependency-free model — one that can be converted to GGUF for local use, served by vLLM, or pushed to the Hugging Face Hub as a standalone artifact — the adapters must be fused back into the base model's weight tensors and exported as a single unit.

Part A — Reload the Adapter from Drive

from unsloth import FastLanguageModel

print("Reloading the trained model and tokenizer...")

model_path = "/content/drive/MyDrive/genz-model"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = model_path,
    max_seq_length = 1024,
    dtype          = None,
    load_in_4bit   = True,
)

Unsloth detects that the specified path contains LoRA adapter weights, downloads the corresponding base model from the Hugging Face Hub, and binds the adapters to it automatically. The max_seq_length and load_in_4bit parameters must match the values used during training to ensure structural alignment when the weights are loaded into memory.

Part B — Fuse and Export

final_save_path = "/content/drive/MyDrive/genz-qwen-3b-complete"

print("Fusing adapter into base model weights...")

model.save_pretrained_merged(
    final_save_path,
    tokenizer,
    save_method = "merged_16bit"
)

print("Merge complete! You now have a fully independent custom Qwen model.")

save_pretrained_merged() performs the mathematical fusion: the low-rank adapter delta matrices are scaled and added directly into the base model's weight tensors (W_merged = W_base + B × A × α/√r), then the result is exported in full 16-bit precision as a standard Hugging Face model directory.

Standard Hugging Face merge utilities perform this operation in Python on the CPU, which is slow and frequently fails with out-of-memory errors on large models. Unsloth's implementation runs the same operation using hardware-accelerated operations, completing the process in a fraction of the time.

The output at genz-qwen-3b-complete is a fully independent model. It carries no adapter configuration files, no dependency on the original base model, and no requirement for the peft or unsloth libraries to load. It can be used with any standard AutoModelForCausalLM workflow, converted to GGUF with llama.cpp, served via vLLM or TGI, or published directly to the Hugging Face Hub.

Proof of Concept
The screenshot below shows a successful merge run. Unsloth reloads the adapter, downloads the base model shard, fuses the weights at 16-bit precision, and confirms the output location.

The final confirmation line — Merge complete. Saved to /content/drive/MyDrive/genz-qwen-3b-complete — indicates that the fused model has been written to persistent storage and is ready for use.

The architecture decisions in this guide — 4-bit quantization, LoRA with RSLoRA, gradient accumulation, cosine learning rate decay, checkpoint selection by validation loss — are not arbitrary. Each one exists to solve a specific constraint: fitting a 3-billion parameter training job into the memory envelope of hardware that costs nothing to run. Understanding the reasoning behind each choice matters as much as copying the code, because the next fine-tuning project you run will present a different set of constraints that require the same kind of deliberate decision-making.

One thing worth stating plainly: the code and configuration in this guide are only half the equation. The quality of your training data determines the ceiling of what the model can become. A well-structured dataset — with consistent formatting, clear intent in every example, and enough diversity to cover the behavior you want — will produce a model that feels genuinely intelligent and capable. Weak or inconsistent data produces the opposite: a model that technically runs but responds in ways that feel hollow or off. The dataset used in this guide was a personal test run, not a production-grade corpus, and the outputs reflect that. Treat your data as seriously as your code, and the results will follow.

DEV Community

qwen2.5-lora-finetuning-colab

Why Google Colab Is the Right Environment for This Work?

Top comments (0)