Richard Sakaguchi

Posted on Dec 10

Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA

#python #machinelearning #tutorial #ai

Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA

No A100. No cloud credits. Just a 3090 and determination.

The Myth

"You need $10,000+ in cloud compute to fine-tune an LLM."

Reality: I fine-tuned Mistral-7B on a single RTX 3090 for $0.

What is QLoRA?

QLoRA = Quantized Low-Rank Adaptation

Quantization: Compress model weights from 32-bit to 4-bit
LoRA: Train small adapter layers instead of full model
Result: 7B model fits in ~6GB VRAM instead of 28GB+

The Setup

Hardware

GPU: RTX 3090 (24GB) - works with 3080 too
RAM: 32GB (16GB minimum)
Storage: 50GB free space

Software Stack

pip install torch transformers peft bitsandbytes trl datasets

The Code

1. Load Model in 4-bit

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_config,
    device_map="auto"
)

2. Configure LoRA

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Alpha scaling
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 7,261,749,248
# trainable%: 0.29%

3. Prepare Dataset

from datasets import load_dataset

dataset = load_dataset(
    "RichardSakaguchiMS/brazilian-customer-service-conversations"
)

def format_example(example):
    return {
        "text": f"""<s>[INST] {example['input']} [/INST]
{example['output']}</s>"""
    }

dataset = dataset.map(format_example)

4. Train

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=training_args,
    tokenizer=tokenizer,
    max_seq_length=2048,
    dataset_text_field="text"
)

trainer.train()

Training Stats

Metric	Value
Dataset	10,000 examples
Epochs	3
Batch Size	4 x 4 (gradient acc)
Training Time	~4 hours
VRAM Peak	18GB
Final Loss	0.82

Tips and Tricks

1. Gradient Checkpointing

model.gradient_checkpointing_enable()

Saves VRAM at cost of ~20% slower training.

2. Flash Attention 2

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2"
)

2x faster, less VRAM.

3. Data Quality > Quantity

1,000 high-quality examples > 100,000 noisy examples
Clean your data!
Validate format consistency

4. Monitor Loss Curve

If loss plateaus: increase learning rate
If loss spikes: decrease learning rate
If loss oscillates: decrease batch size or lr

Inference

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_config
)
model = PeftModel.from_pretrained(base_model, "./output")

prompt = "[INST] Cliente: Quero saber do meu pedido [/INST]"
outputs = model.generate(
    tokenizer.encode(prompt, return_tensors="pt"),
    max_new_tokens=200
)
print(tokenizer.decode(outputs[0]))

When NOT to Fine-Tune

Simple prompt engineering works
You have < 1,000 examples
Task is too generic (use base model)
Budget for API calls is acceptable

When TO Fine-Tune

Specific domain language (legal, medical, regional)
Consistent output format required
Privacy requirements (no cloud APIs)
Cost optimization at scale

Open Source

Model and dataset used in this guide:

Questions? Drop them in the comments!

sakaguchi.ia.br | GitHub

DEV Community

Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA

Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA

The Myth

What is QLoRA?

The Setup

Hardware

Software Stack

The Code

1. Load Model in 4-bit

2. Configure LoRA

3. Prepare Dataset

4. Train

Training Stats

Tips and Tricks

1. Gradient Checkpointing

2. Flash Attention 2

3. Data Quality > Quantity

4. Monitor Loss Curve

Inference

When NOT to Fine-Tune

When TO Fine-Tune

Open Source

Top comments (0)