DEV Community

Cover image for Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA
Richard Sakaguchi
Richard Sakaguchi

Posted on

Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA

Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA

No A100. No cloud credits. Just a 3090 and determination.

The Myth

"You need $10,000+ in cloud compute to fine-tune an LLM."

Reality: I fine-tuned Mistral-7B on a single RTX 3090 for $0.

What is QLoRA?

QLoRA = Quantized Low-Rank Adaptation

  • Quantization: Compress model weights from 32-bit to 4-bit
  • LoRA: Train small adapter layers instead of full model
  • Result: 7B model fits in ~6GB VRAM instead of 28GB+

The Setup

Hardware

  • GPU: RTX 3090 (24GB) - works with 3080 too
  • RAM: 32GB (16GB minimum)
  • Storage: 50GB free space

Software Stack

pip install torch transformers peft bitsandbytes trl datasets
Enter fullscreen mode Exit fullscreen mode

The Code

1. Load Model in 4-bit

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_config,
    device_map="auto"
)
Enter fullscreen mode Exit fullscreen mode

2. Configure LoRA

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Alpha scaling
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 7,261,749,248
# trainable%: 0.29%
Enter fullscreen mode Exit fullscreen mode

3. Prepare Dataset

from datasets import load_dataset

dataset = load_dataset(
    "RichardSakaguchiMS/brazilian-customer-service-conversations"
)

def format_example(example):
    return {
        "text": f"""<s>[INST] {example['input']} [/INST]
{example['output']}</s>"""
    }

dataset = dataset.map(format_example)
Enter fullscreen mode Exit fullscreen mode

4. Train

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    args=training_args,
    tokenizer=tokenizer,
    max_seq_length=2048,
    dataset_text_field="text"
)

trainer.train()
Enter fullscreen mode Exit fullscreen mode

Training Stats

Metric Value
Dataset 10,000 examples
Epochs 3
Batch Size 4 x 4 (gradient acc)
Training Time ~4 hours
VRAM Peak 18GB
Final Loss 0.82

Tips and Tricks

1. Gradient Checkpointing

model.gradient_checkpointing_enable()
Enter fullscreen mode Exit fullscreen mode

Saves VRAM at cost of ~20% slower training.

2. Flash Attention 2

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2"
)
Enter fullscreen mode Exit fullscreen mode

2x faster, less VRAM.

3. Data Quality > Quantity

  • 1,000 high-quality examples > 100,000 noisy examples
  • Clean your data!
  • Validate format consistency

4. Monitor Loss Curve

  • If loss plateaus: increase learning rate
  • If loss spikes: decrease learning rate
  • If loss oscillates: decrease batch size or lr

Inference

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_config
)
model = PeftModel.from_pretrained(base_model, "./output")

prompt = "[INST] Cliente: Quero saber do meu pedido [/INST]"
outputs = model.generate(
    tokenizer.encode(prompt, return_tensors="pt"),
    max_new_tokens=200
)
print(tokenizer.decode(outputs[0]))
Enter fullscreen mode Exit fullscreen mode

When NOT to Fine-Tune

  • Simple prompt engineering works
  • You have < 1,000 examples
  • Task is too generic (use base model)
  • Budget for API calls is acceptable

When TO Fine-Tune

  • Specific domain language (legal, medical, regional)
  • Consistent output format required
  • Privacy requirements (no cloud APIs)
  • Cost optimization at scale

Open Source

Model and dataset used in this guide:


Questions? Drop them in the comments!

sakaguchi.ia.br | GitHub

Top comments (0)