Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA
No A100. No cloud credits. Just a 3090 and determination.
The Myth
"You need $10,000+ in cloud compute to fine-tune an LLM."
Reality: I fine-tuned Mistral-7B on a single RTX 3090 for $0.
What is QLoRA?
QLoRA = Quantized Low-Rank Adaptation
- Quantization: Compress model weights from 32-bit to 4-bit
- LoRA: Train small adapter layers instead of full model
- Result: 7B model fits in ~6GB VRAM instead of 28GB+
The Setup
Hardware
- GPU: RTX 3090 (24GB) - works with 3080 too
- RAM: 32GB (16GB minimum)
- Storage: 50GB free space
Software Stack
pip install torch transformers peft bitsandbytes trl datasets
The Code
1. Load Model in 4-bit
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig
)
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config,
device_map="auto"
)
2. Configure LoRA
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Alpha scaling
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 7,261,749,248
# trainable%: 0.29%
3. Prepare Dataset
from datasets import load_dataset
dataset = load_dataset(
"RichardSakaguchiMS/brazilian-customer-service-conversations"
)
def format_example(example):
return {
"text": f"""<s>[INST] {example['input']} [/INST]
{example['output']}</s>"""
}
dataset = dataset.map(format_example)
4. Train
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="epoch"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
args=training_args,
tokenizer=tokenizer,
max_seq_length=2048,
dataset_text_field="text"
)
trainer.train()
Training Stats
| Metric | Value |
|---|---|
| Dataset | 10,000 examples |
| Epochs | 3 |
| Batch Size | 4 x 4 (gradient acc) |
| Training Time | ~4 hours |
| VRAM Peak | 18GB |
| Final Loss | 0.82 |
Tips and Tricks
1. Gradient Checkpointing
model.gradient_checkpointing_enable()
Saves VRAM at cost of ~20% slower training.
2. Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
model_name,
attn_implementation="flash_attention_2"
)
2x faster, less VRAM.
3. Data Quality > Quantity
- 1,000 high-quality examples > 100,000 noisy examples
- Clean your data!
- Validate format consistency
4. Monitor Loss Curve
- If loss plateaus: increase learning rate
- If loss spikes: decrease learning rate
- If loss oscillates: decrease batch size or lr
Inference
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config
)
model = PeftModel.from_pretrained(base_model, "./output")
prompt = "[INST] Cliente: Quero saber do meu pedido [/INST]"
outputs = model.generate(
tokenizer.encode(prompt, return_tensors="pt"),
max_new_tokens=200
)
print(tokenizer.decode(outputs[0]))
When NOT to Fine-Tune
- Simple prompt engineering works
- You have < 1,000 examples
- Task is too generic (use base model)
- Budget for API calls is acceptable
When TO Fine-Tune
- Specific domain language (legal, medical, regional)
- Consistent output format required
- Privacy requirements (no cloud APIs)
- Cost optimization at scale
Open Source
Model and dataset used in this guide:
Questions? Drop them in the comments!
Top comments (0)