I turned a base Llama 3.2 model into a step-by-step reasoning machine using a free Colab GPU. Here's exactly how it works.
So you've heard the buzz around fine-tuning LLMs, but every tutorial either drowns you in math or skips the "why" entirely. This article cuts through both. We'll cover the core concepts — LoRA, QLoRA, quantization — and then walk through a real coding demo that produced a genuinely surprising result: a fine-tuned Llama 3 model that reasons through problems step-by-step, just like DeepSeek.
Let's get into it.
The Problem With Base LLMs
A pre-trained model like Llama or GPT is incredibly general. That's its superpower — and its limitation.
Your company has private data the model has never seen. Your brand has a specific tone, vocabulary, and format for responses. A base LLM doesn't know any of that.
You have two main ways to fix this:
Option 1: RAG (Retrieval-Augmented Generation)
RAG doesn't retrain the model at all. Instead, it retrieves relevant chunks from an external knowledge source (a database, PDFs, docs) and injects them into the prompt at inference time.
Pros: Cheap, fast to set up, no GPU needed.
Cons: The model still responds in its own generic voice. It can sound robotic, off-brand, or miss the nuanced tone your use case needs.
Option 2: Fine-Tuning
Fine-tuning actually retrains the model on your specific dataset — teaching it new knowledge, tone, format, or reasoning style.
Pros: Precise outputs, correct brand voice, specific response formats.
Cons: Computationally expensive, requires a curated dataset, takes time.
In practice, the industry often combines both — fine-tune for tone and format, use RAG for dynamic knowledge retrieval.
What Is Fine-Tuning, Really?
Fine-tuning is a form of transfer learning. You take a model that already understands language, then continue training it on a smaller, task-specific dataset. The model adapts its weights to fit your new data without forgetting everything it already learned.
The challenge? A 7-billion parameter model has billions of weights to update. Full fine-tuning requires enormous GPU memory — often inaccessible to most developers.
That's where LoRA comes in.
LoRA: Low-Rank Adaptation
LoRA is a Parameter Efficient Fine-Tuning (PEFT) method. The key insight is elegant:
Instead of updating all the original model weights, freeze them — and train only a small set of new parameters layered on top.
How It Works
During full fine-tuning, you'd update a weight matrix W by computing a change ΔW (delta W). The problem is that ΔW is the same enormous size as W.
LoRA's trick: decompose ΔW into two much smaller matrices, A and B, where:
ΔW = A × B
If W is a 1024×256 matrix (262,144 parameters), and you set rank R = 16, then:
- Matrix A is 1024×16
- Matrix B is 16×256
- Total trainable parameters: (1024×16) + (16×256) = 20,480 instead of 262,144
That's a ~93% reduction in parameters to train.
Key Hyperparameter: Rank (R)
The rank R controls the trade-off between efficiency and expressiveness.
- Lower R (e.g., 4, 8) → fewer parameters, faster training, less expressive
- Higher R (e.g., 16, 32) → more parameters, can capture more complex adaptations
Common starting values are R = 8 or R = 16.
QLoRA: When Your GPU Is Too Small
LoRA is efficient, but the base model still needs to fit in memory. A 7B parameter model at full float32 precision requires ~28 GB of GPU RAM. Most consumer GPUs and free Colab runtimes offer 12–16 GB.
QLoRA solves this by adding quantization before applying LoRA.
What Is Quantization?
Quantization converts high-precision numbers to lower-precision formats to save memory:
| Format | Bytes per value | 7B Model Memory |
|---|---|---|
| float32 | 4 bytes | ~28 GB |
| int8 | 1 byte | ~7 GB |
| NF4 (4-bit) | 0.5 bytes | ~3.5 GB |
You're essentially compressing the model before training on it. The accuracy loss is surprisingly minimal for most tasks.
NF4: Normal Float 4
NF4 isn't just regular 4-bit quantization. Neural network weights follow a roughly normal (bell curve) distribution, and NF4 is specifically designed for this:
- Instead of equal-width bins, NF4 uses percentile-based bins that pack more precision where the data is dense (near zero) and less where it's sparse (at the extremes).
- This makes NF4 far more accurate than naive 4-bit quantization.
The Three Pillars of QLoRA
- 4-bit NF4 Quantization — Compress the base model to fit in less memory.
- Double Quantization — Quantize the quantization constants themselves (the scale and zero-point values), squeezing out a few more MB.
- Paged Optimizers — Uses NVIDIA's unified memory to page optimizer states between GPU and CPU RAM when the GPU runs out of space, preventing out-of-memory crashes during training.
Together, these let you fine-tune a 7B+ model on a single consumer GPU.
Hands-On: Fine-Tuning Llama 3.2 on Google Colab
Let's make this concrete. Here's how the actual demo was built.
Setup
Environment: Google Colab with a T4 GPU (free tier)
Library: Unsloth — a highly optimized fine-tuning library
Model: Llama 3.2 3B Instruct
Dataset: ServiceNow R1 Distill SFT — 172,000 rows of math/logic puzzles with problems, step-by-step reasoning chains, and solutions
pip install unsloth
Step 1: Load the Model with 4-bit Quantization
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True, # QLoRA quantization
)
Step 2: Configure LoRA
model = FastLanguageModel.get_peft_model(
model,
r=16, # Rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
)
What are q_proj, k_proj, etc.?
These are the attention projection matrices inside the transformer. LoRA is applied specifically to these layers because they carry the most task-relevant learning signal.
Step 3: Format the Dataset
Each training example is structured as a prompt combining the problem, reasoning, and solution:
def format_prompt(example):
return f"""
Problem: {example['problem']}
<thinking>
{example['thought']}
</thinking>
Solution: {example['solution']}
"""
The <thinking> tags teach the model to externalize its reasoning — this is how we get the DeepSeek-like behavior.
Step 4: Train with SFTTrainer
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=2,
learning_rate=2e-4,
output_dir="outputs",
),
)
trainer.train()
Training time: ~20 minutes on a T4 GPU
Final loss: ~0.48 ✅
The Result
After fine-tuning, the model was asked:
"How many R's are in the word 'strawberry'?"
Before fine-tuning (base model): Confidently gives the wrong answer.
After fine-tuning: The model works through it step-by-step:
<thinking>
Let me count each letter in "strawberry":
s-t-r-a-w-b-e-r-r-y
Position 3: r
Position 8: r
Position 9: r
That's 3 R's total.
</thinking>
The word "strawberry" contains 3 R's.
That reasoning pattern — breaking the problem down, checking each step — came entirely from the fine-tuning on the reasoning dataset. The base model didn't do this. The fine-tuned model does.
When Should You Use Each Approach?
| Scenario | Best Approach |
|---|---|
| Add recent/private knowledge | RAG |
| Change response tone or format | Fine-Tuning |
| Teach step-by-step reasoning | Fine-Tuning |
| Limited GPU resources | QLoRA |
| Moderate GPU, need flexibility | LoRA |
| Production system at scale | RAG + Fine-Tuning combined |
Key Takeaways
- Fine-tuning adapts a pre-trained model to your specific task, tone, or data — but it's expensive without the right tools.
- LoRA makes it efficient by training only a small set of low-rank adapter matrices, leaving original weights frozen.
- QLoRA takes this further by quantizing the base model to 4-bit (NF4) first, making large model fine-tuning possible on consumer hardware.
- With Unsloth + Google Colab, you can fine-tune a 3B parameter model in under 30 minutes for free.
- The results can be genuinely impressive — a reasoning style the base model didn't have, emergent from the training data alone.
What's Next?
If you want to go deeper:
- Try different ranks (R=4 vs R=32) and observe the loss difference
- Experiment with which target modules to apply LoRA to
- Combine your fine-tuned model with a RAG pipeline for maximum power
- Push your model to Hugging Face Hub and share it
Fine-tuning used to require a research lab. Now it takes a free GPU and 30 minutes. There's never been a better time to build something genuinely your own.
Found this useful? Drop a reaction or share it with someone building with LLMs. Questions or corrections? I'm all ears in the comments.
Top comments (0)