M TOQEER ZIA

Posted on Jun 11

How I Fine-Tuned Llama 3 to Think Like DeepSeek — A Practical Guide to LoRA & QLoRA

#deepseek #llm #machinelearning #tutorial

I turned a base Llama 3.2 model into a step-by-step reasoning machine using a free Colab GPU. Here's exactly how it works.

So you've heard the buzz around fine-tuning LLMs, but every tutorial either drowns you in math or skips the "why" entirely. This article cuts through both. We'll cover the core concepts — LoRA, QLoRA, quantization — and then walk through a real coding demo that produced a genuinely surprising result: a fine-tuned Llama 3 model that reasons through problems step-by-step, just like DeepSeek.

Let's get into it.

The Problem With Base LLMs

A pre-trained model like Llama or GPT is incredibly general. That's its superpower — and its limitation.

Your company has private data the model has never seen. Your brand has a specific tone, vocabulary, and format for responses. A base LLM doesn't know any of that.

You have two main ways to fix this:

Option 1: RAG (Retrieval-Augmented Generation)

RAG doesn't retrain the model at all. Instead, it retrieves relevant chunks from an external knowledge source (a database, PDFs, docs) and injects them into the prompt at inference time.

Pros: Cheap, fast to set up, no GPU needed.

Cons: The model still responds in its own generic voice. It can sound robotic, off-brand, or miss the nuanced tone your use case needs.

Option 2: Fine-Tuning

Fine-tuning actually retrains the model on your specific dataset — teaching it new knowledge, tone, format, or reasoning style.

Pros: Precise outputs, correct brand voice, specific response formats.

Cons: Computationally expensive, requires a curated dataset, takes time.

In practice, the industry often combines both — fine-tune for tone and format, use RAG for dynamic knowledge retrieval.

What Is Fine-Tuning, Really?

Fine-tuning is a form of transfer learning. You take a model that already understands language, then continue training it on a smaller, task-specific dataset. The model adapts its weights to fit your new data without forgetting everything it already learned.

The challenge? A 7-billion parameter model has billions of weights to update. Full fine-tuning requires enormous GPU memory — often inaccessible to most developers.

That's where LoRA comes in.

LoRA: Low-Rank Adaptation

LoRA is a Parameter Efficient Fine-Tuning (PEFT) method. The key insight is elegant:

Instead of updating all the original model weights, freeze them — and train only a small set of new parameters layered on top.

How It Works

During full fine-tuning, you'd update a weight matrix W by computing a change ΔW (delta W). The problem is that ΔW is the same enormous size as W.

LoRA's trick: decompose ΔW into two much smaller matrices, A and B, where:

ΔW = A × B

If W is a 1024×256 matrix (262,144 parameters), and you set rank R = 16, then:

Matrix A is 1024×16
Matrix B is 16×256
Total trainable parameters: (1024×16) + (16×256) = 20,480 instead of 262,144

That's a ~93% reduction in parameters to train.

Key Hyperparameter: Rank (R)

The rank R controls the trade-off between efficiency and expressiveness.

Lower R (e.g., 4, 8) → fewer parameters, faster training, less expressive
Higher R (e.g., 16, 32) → more parameters, can capture more complex adaptations

Common starting values are R = 8 or R = 16.

QLoRA: When Your GPU Is Too Small

LoRA is efficient, but the base model still needs to fit in memory. A 7B parameter model at full float32 precision requires ~28 GB of GPU RAM. Most consumer GPUs and free Colab runtimes offer 12–16 GB.

QLoRA solves this by adding quantization before applying LoRA.

What Is Quantization?

Quantization converts high-precision numbers to lower-precision formats to save memory:

Format	Bytes per value	7B Model Memory
float32	4 bytes	~28 GB
int8	1 byte	~7 GB
NF4 (4-bit)	0.5 bytes	~3.5 GB

You're essentially compressing the model before training on it. The accuracy loss is surprisingly minimal for most tasks.

NF4: Normal Float 4

NF4 isn't just regular 4-bit quantization. Neural network weights follow a roughly normal (bell curve) distribution, and NF4 is specifically designed for this:

Instead of equal-width bins, NF4 uses percentile-based bins that pack more precision where the data is dense (near zero) and less where it's sparse (at the extremes).
This makes NF4 far more accurate than naive 4-bit quantization.

The Three Pillars of QLoRA

4-bit NF4 Quantization — Compress the base model to fit in less memory.
Double Quantization — Quantize the quantization constants themselves (the scale and zero-point values), squeezing out a few more MB.
Paged Optimizers — Uses NVIDIA's unified memory to page optimizer states between GPU and CPU RAM when the GPU runs out of space, preventing out-of-memory crashes during training.

Together, these let you fine-tune a 7B+ model on a single consumer GPU.

Hands-On: Fine-Tuning Llama 3.2 on Google Colab

Let's make this concrete. Here's how the actual demo was built.

Setup

Environment: Google Colab with a T4 GPU (free tier)

Library: Unsloth — a highly optimized fine-tuning library

Model: Llama 3.2 3B Instruct

Dataset: ServiceNow R1 Distill SFT — 172,000 rows of math/logic puzzles with problems, step-by-step reasoning chains, and solutions

pip install unsloth

Step 1: Load the Model with 4-bit Quantization

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,  # QLoRA quantization
)

Step 2: Configure LoRA

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # Rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
)

What are q_proj, k_proj, etc.?

These are the attention projection matrices inside the transformer. LoRA is applied specifically to these layers because they carry the most task-relevant learning signal.

Step 3: Format the Dataset

Each training example is structured as a prompt combining the problem, reasoning, and solution:

def format_prompt(example):
    return f"""
Problem: {example['problem']}

<thinking>
{example['thought']}
</thinking>

Solution: {example['solution']}
"""

The <thinking> tags teach the model to externalize its reasoning — this is how we get the DeepSeek-like behavior.

Step 4: Train with SFTTrainer

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        num_train_epochs=3,
        per_device_train_batch_size=2,
        learning_rate=2e-4,
        output_dir="outputs",
    ),
)

trainer.train()

Training time: ~20 minutes on a T4 GPU

Final loss: ~0.48 ✅

The Result

After fine-tuning, the model was asked:

"How many R's are in the word 'strawberry'?"

Before fine-tuning (base model): Confidently gives the wrong answer.

After fine-tuning: The model works through it step-by-step:

<thinking>
Let me count each letter in "strawberry":
s-t-r-a-w-b-e-r-r-y
Position 3: r
Position 8: r
Position 9: r
That's 3 R's total.
</thinking>

The word "strawberry" contains 3 R's.

That reasoning pattern — breaking the problem down, checking each step — came entirely from the fine-tuning on the reasoning dataset. The base model didn't do this. The fine-tuned model does.

When Should You Use Each Approach?

Scenario	Best Approach
Add recent/private knowledge	RAG
Change response tone or format	Fine-Tuning
Teach step-by-step reasoning	Fine-Tuning
Limited GPU resources	QLoRA
Moderate GPU, need flexibility	LoRA
Production system at scale	RAG + Fine-Tuning combined

Key Takeaways

Fine-tuning adapts a pre-trained model to your specific task, tone, or data — but it's expensive without the right tools.
LoRA makes it efficient by training only a small set of low-rank adapter matrices, leaving original weights frozen.
QLoRA takes this further by quantizing the base model to 4-bit (NF4) first, making large model fine-tuning possible on consumer hardware.
With Unsloth + Google Colab, you can fine-tune a 3B parameter model in under 30 minutes for free.
The results can be genuinely impressive — a reasoning style the base model didn't have, emergent from the training data alone.

What's Next?

If you want to go deeper:

Try different ranks (R=4 vs R=32) and observe the loss difference
Experiment with which target modules to apply LoRA to
Combine your fine-tuned model with a RAG pipeline for maximum power
Push your model to Hugging Face Hub and share it

Fine-tuning used to require a research lab. Now it takes a free GPU and 30 minutes. There's never been a better time to build something genuinely your own.

Found this useful? Drop a reaction or share it with someone building with LLMs. Questions or corrections? I'm all ears in the comments.

DEV Community