This article was originally published on aifoss.dev
---
title: 'Fine-Tune Llama 3 with Unsloth 2026: Dataset to GGUF'
description: 'Step-by-step guide to fine-tuning Llama 3.1 8B with Unsloth on a consumer GPU: QLoRA setup, dataset prep, SFTTrainer config, GGUF export, and Ollama import.'
pubDate: 'May 28 2026'
tags: ["finetuning", "ai", "llm", "gpu", "python"]
TL;DR: Unsloth (v2026.5.8) cuts Llama 3.1 8B fine-tuning to 2–4 hours on a consumer GPU with 8GB+ VRAM, using 70% less memory than standard QLoRA. You get a GGUF you can drop straight into Ollama. The catch: output quality depends entirely on your dataset quality.
What you'll have running after this guide:
- A domain-adapted Llama 3.1 8B model trained on your own dataset
- A Q4_K_M GGUF file ready to run in Ollama, LM Studio, or Jan.ai
- A repeatable training pipeline you can re-run when your data changes
Honest take: Unsloth is the right tool for single-GPU fine-tuning in 2026. axolotl has more knobs for complex pipelines; Unsloth is faster to get working and easier on VRAM.
Why fine-tune instead of just prompting?
Prompting a general model works until it doesn't. If your use case involves a specific writing style the model keeps drifting from, domain vocabulary it consistently mangles (medical codes, legal terms, proprietary jargon), or a structured output format it forgets mid-conversation — fine-tuning fixes those problems permanently instead of requiring a 500-token system prompt every call.
The other option is RAG, which is the right answer when the knowledge lives in documents you want to retrieve. Fine-tuning is better when you want to change how the model behaves: its tone, its output structure, its fluency in a domain. These are different problems with different solutions.
Which fine-tuning framework to use
Before getting into the steps, here's where Unsloth sits relative to the alternatives:
| Unsloth | axolotl | HF TRL (stock) | |
|---|---|---|---|
| Single-GPU speed | 2–5× faster | 1× baseline | 1× baseline |
| VRAM usage (8B QLoRA) | ~8–10 GB | ~12–14 GB | ~14–18 GB |
| Setup complexity | Low (pip install) | Medium (config YAML) | Low |
| Multi-GPU support | Limited | Strong | Strong |
| Custom training loops | Limited | Full | Full |
| Best for | Fast iteration, single GPU | Production pipelines, multi-GPU | Research, custom objectives |
Unsloth wins on a single consumer GPU. If you're distributing across multiple cards or need custom training objectives (DPO, PPO, GRPO), axolotl or standard TRL give you more control. For this guide, single-GPU fine-tuning with Unsloth is the path.
Hardware requirements
QLoRA makes 8B-parameter fine-tuning possible on cards most developers already own:
| Model | Method | Minimum VRAM | Training time (1k examples, 3 epochs) |
|---|---|---|---|
| Llama 3.2 3B | QLoRA | 6 GB | ~30 min |
| Llama 3.1 8B | QLoRA (4-bit) | 8 GB | ~2 hours |
| Llama 3.1 8B | LoRA (16-bit) | 18 GB | ~2.5 hours |
| Llama 3.1 70B | QLoRA (4-bit) | 24 GB | ~8–12 hours |
An RTX 3090 (24GB) handles the 8B run with room to spare. An RTX 4090 cuts training time roughly in half. If you're on 8GB VRAM (RTX 4060 or similar), drop max_seq_length to 1024 and use Llama 3.2 3B instead of 8B.
If you don't have a suitable local GPU, RunPod rents RTX 4090 and A100 instances by the hour. A full 8B fine-tune run typically costs under $3.
OS: Linux is the primary target. Windows via WSL2 works. macOS with Apple Silicon is supported through Unsloth Studio (MLX-based). Native Windows training works but is less tested.
Python: 3.9–3.14. PyTorch 2.5+ recommended.
Step 1: Install Unsloth
pip install unsloth
Current version: 2026.5.8 (released May 26, 2026). The version numbering follows a date-based scheme — YYYY.MM.DD.
Verify:
python -c "import unsloth; print(unsloth.__version__)"
Also install the training stack:
pip install trl transformers datasets accelerate
If you hit CUDA version mismatches, Unsloth's docs at unsloth.ai/docs have conda environment files for the most common CUDA + PyTorch combinations. The conda path is more reliable when your system has multiple CUDA versions installed.
Step 2: Get access to Llama 3.1
Llama 3.1 is gated on Hugging Face. You need to request access once:
- Create an account at huggingface.co
- Visit meta-llama/Llama-3.1-8B-Instruct and accept the license
- Generate an access token at huggingface.co/settings/tokens
- Authenticate:
huggingface-cli login
License note: Llama 3.1 uses the Meta Llama 3.1 Community License — not Apache or MIT. Commercial use is allowed for most cases, but the license kicks in specific obligations above 700 million monthly active users, and any fine-tuned model you distribute must include "Llama" in its name. Read the full terms at llama.com/llama3_1/license/ before shipping a product.
Alternatively, use Unsloth's pre-uploaded mirror, which bypasses the individual HF approval process:
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct"
Step 3: Prepare your dataset
Unsloth's SFTTrainer accepts three common formats.
Alpaca format (instruction/input/output):
{"instruction": "Convert this date to ISO 8601:", "input": "March 15th, 2026", "output": "2026-03-15"}
{"instruction": "Summarize this clause in plain English:", "input": "The party of the first part...", "output": "This clause means..."}
ShareGPT format (multi-turn conversations):
{"conversations": [
{"from": "human", "value": "What does EBITDA stand for?"},
{"from": "gpt", "value": "Earnings Before Interest, Taxes, Depreciation, and Amortization."}
]}
How much data?
- Under 300 examples: fine-tune the Instruct model (style and behavior shaping)
- 300–1,000 examples: Instruct or base model both work
- Over 1,000 examples: base model preferred for deeper behavior change
More data doesn't reliably beat better data. If you have 10,000 mediocre examples and 500 carefully curated ones, the 500 will often produce a better model. Deduplicate, filter out short or malformed entries, and aim for consistent quality before you care about quantity.
Load your data:
from datasets import load_dataset
# Local JSON Lines file
dataset = load_dataset("json", data_files="your_data.jsonl", split="train")
# Or a public HF dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
Step 4: Load the model with QLoRA
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048, # drop to 1024 if you hit OOM on 8GB VRAM
dtype=None, # auto-detect: bfloat16 on Ampere+, float16 older
load_in_4bit=True, # QLoRA: model in 4-bit, adapters in 16-bit
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity, more VRAM
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0, # 0 is optimal for Unsloth's fused kernels
bias="none",
use_gradient_checkpointing="unsloth", # long-context support, less VRAM
random_state=3407,
)
load_in_4bit=True is the QLoRA switch. The base model loads compressed to 4-bit; the LoRA adapters — the actual trainable parameters — remain in 16-bit. You're training roughly 1–5% of the total parameter count, which is why 8GB is enough.
LoRA rank (r): r=16 is the standard starting point. Raise it to 32 or 64 if you're doing style transfer or long-form generation and have the VRAM headroom. For simple format trai
Top comments (0)