I spent nearly three weeks and close to $300 trying to fine-tune a large language model the traditional way. VRAM errors, disappointing results, and massive bills, it was painful.
If you've ever felt the same frustration, this tutorial is for you. In 2026, fine-tuning LLMs doesn't need to be expensive or complicated. I'll show you exactly how to fine-tune Llama 3.1 8B using QLoRA for under $5, while getting solid, usable results.
By the end of this guide, you'll have a complete, working workflow you can adapt to your own domain or task.
Why Bother Fine-Tuning at All?
Let’s be honest upfront: fine-tuning isn’t always the right answer. For many applications, good prompt engineering combined with RAG delivers faster and cheaper results.
However, when you need consistent behavior, specialized knowledge, or better performance on structured outputs, fine-tuning still wins. The good news? Thanks to QLoRA and tools like Unsloth, it’s now accessible without a research lab budget.
Prerequisites
Before we start, make sure you have:
- Intermediate Python skills and basic familiarity with Hugging Face
- A Hugging Face account (for gated models like Llama 3.1)
- Access to a GPU with at least 16GB VRAM (RTX 4090, A100, or Colab Pro works well)
- Basic understanding of what LoRA is (we’ll cover the practical side below)
Concepts Overview: Why QLoRA?
Full fine-tuning updates every parameter in the model — extremely expensive in both memory and compute.
LoRA (Low-Rank Adaptation) freezes the base model weights and only trains small adapter layers. This dramatically reduces the number of trainable parameters.
QLoRA takes it further by quantizing the base model to 4-bit precision while keeping the adapters in higher precision. The result is massive memory savings with surprisingly little drop in quality.
In practice, this means you can fine-tune an 8B model on relatively modest hardware without sacrificing too much performance. That’s why QLoRA became the go-to efficient fine-tuning method in 2026.
Step-by-Step Implementation
1. Environment Setup
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers trl peft accelerate bitsandbytes
2. Load the Model in 4-bit
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
3. Prepare Your Dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files="training_data.json", split="train")
def formatting_prompts_func(examples):
texts = []
for instruction, input_text, output in zip(examples['instruction'],
examples['input'],
examples['output']):
text = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input_text}
### Response:
{output}"""
texts.append(text)
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)
4. Apply LoRA Adapters
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 42,
)
5. Configure and Start Training
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 80,
learning_rate = 2e-4,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 42,
output_dir = "outputs",
report_to = "none",
),
)
trainer.train()
6. Save and Merge the Model
model.save_pretrained("lora_adapter")
tokenizer.save_pretrained("lora_adapter")
model = model.merge_and_unload()
model.save_pretrained("fine_tuned_llama_8b")
tokenizer.save_pretrained("fine_tuned_llama_8b")
Running and Testing
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Explain how QLoRA works in simple terms."}]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Troubleshooting Common Issues
- Out of Memory (OOM): Reduce batch size or increase gradient accumulation steps
- Overfitting: Use fewer steps, add more diverse data, or lower learning rate
- Poor generation quality: Check your dataset formatting and instruction quality
- Slow training: Make sure you're using Unsloth and gradient checkpointing
Next Steps
Once you have a working setup, you can explore:
- Preference tuning with DPO or ORPO
- Model merging techniques
- Production inference with vLLM
- Combining fine-tuning with RAG for better results
Have you tried fine-tuning with QLoRA yet? What challenges did you face? Share your experience in the comments!
Top comments (0)