DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

fine-tuning-llama3-unsloth-2026

This article was originally published on aifoss.dev

---
title: 'Fine-Tune Llama 3 with Unsloth 2026: Dataset to GGUF'
description: 'Step-by-step guide to fine-tuning Llama 3.1 8B with Unsloth on a consumer GPU: QLoRA setup, dataset prep, SFTTrainer config, GGUF export, and Ollama import.'
pubDate: 'May 28 2026'

tags: ["finetuning", "ai", "llm", "gpu", "python"]

TL;DR: Unsloth (v2026.5.8) cuts Llama 3.1 8B fine-tuning to 2–4 hours on a consumer GPU with 8GB+ VRAM, using 70% less memory than standard QLoRA. You get a GGUF you can drop straight into Ollama. The catch: output quality depends entirely on your dataset quality.

What you'll have running after this guide:

  • A domain-adapted Llama 3.1 8B model trained on your own dataset
  • A Q4_K_M GGUF file ready to run in Ollama, LM Studio, or Jan.ai
  • A repeatable training pipeline you can re-run when your data changes

Honest take: Unsloth is the right tool for single-GPU fine-tuning in 2026. axolotl has more knobs for complex pipelines; Unsloth is faster to get working and easier on VRAM.

Why fine-tune instead of just prompting?

Prompting a general model works until it doesn't. If your use case involves a specific writing style the model keeps drifting from, domain vocabulary it consistently mangles (medical codes, legal terms, proprietary jargon), or a structured output format it forgets mid-conversation — fine-tuning fixes those problems permanently instead of requiring a 500-token system prompt every call.

The other option is RAG, which is the right answer when the knowledge lives in documents you want to retrieve. Fine-tuning is better when you want to change how the model behaves: its tone, its output structure, its fluency in a domain. These are different problems with different solutions.

Which fine-tuning framework to use

Before getting into the steps, here's where Unsloth sits relative to the alternatives:

Unsloth axolotl HF TRL (stock)
Single-GPU speed 2–5× faster 1× baseline 1× baseline
VRAM usage (8B QLoRA) ~8–10 GB ~12–14 GB ~14–18 GB
Setup complexity Low (pip install) Medium (config YAML) Low
Multi-GPU support Limited Strong Strong
Custom training loops Limited Full Full
Best for Fast iteration, single GPU Production pipelines, multi-GPU Research, custom objectives

Unsloth wins on a single consumer GPU. If you're distributing across multiple cards or need custom training objectives (DPO, PPO, GRPO), axolotl or standard TRL give you more control. For this guide, single-GPU fine-tuning with Unsloth is the path.

Hardware requirements

QLoRA makes 8B-parameter fine-tuning possible on cards most developers already own:

Model Method Minimum VRAM Training time (1k examples, 3 epochs)
Llama 3.2 3B QLoRA 6 GB ~30 min
Llama 3.1 8B QLoRA (4-bit) 8 GB ~2 hours
Llama 3.1 8B LoRA (16-bit) 18 GB ~2.5 hours
Llama 3.1 70B QLoRA (4-bit) 24 GB ~8–12 hours

An RTX 3090 (24GB) handles the 8B run with room to spare. An RTX 4090 cuts training time roughly in half. If you're on 8GB VRAM (RTX 4060 or similar), drop max_seq_length to 1024 and use Llama 3.2 3B instead of 8B.

If you don't have a suitable local GPU, RunPod rents RTX 4090 and A100 instances by the hour. A full 8B fine-tune run typically costs under $3.

OS: Linux is the primary target. Windows via WSL2 works. macOS with Apple Silicon is supported through Unsloth Studio (MLX-based). Native Windows training works but is less tested.

Python: 3.9–3.14. PyTorch 2.5+ recommended.

Step 1: Install Unsloth

pip install unsloth
Enter fullscreen mode Exit fullscreen mode

Current version: 2026.5.8 (released May 26, 2026). The version numbering follows a date-based scheme — YYYY.MM.DD.

Verify:

python -c "import unsloth; print(unsloth.__version__)"
Enter fullscreen mode Exit fullscreen mode

Also install the training stack:

pip install trl transformers datasets accelerate
Enter fullscreen mode Exit fullscreen mode

If you hit CUDA version mismatches, Unsloth's docs at unsloth.ai/docs have conda environment files for the most common CUDA + PyTorch combinations. The conda path is more reliable when your system has multiple CUDA versions installed.

Step 2: Get access to Llama 3.1

Llama 3.1 is gated on Hugging Face. You need to request access once:

  1. Create an account at huggingface.co
  2. Visit meta-llama/Llama-3.1-8B-Instruct and accept the license
  3. Generate an access token at huggingface.co/settings/tokens
  4. Authenticate: huggingface-cli login

License note: Llama 3.1 uses the Meta Llama 3.1 Community License — not Apache or MIT. Commercial use is allowed for most cases, but the license kicks in specific obligations above 700 million monthly active users, and any fine-tuned model you distribute must include "Llama" in its name. Read the full terms at llama.com/llama3_1/license/ before shipping a product.

Alternatively, use Unsloth's pre-uploaded mirror, which bypasses the individual HF approval process:

model_name = "unsloth/Meta-Llama-3.1-8B-Instruct"
Enter fullscreen mode Exit fullscreen mode

Step 3: Prepare your dataset

Unsloth's SFTTrainer accepts three common formats.

Alpaca format (instruction/input/output):

{"instruction": "Convert this date to ISO 8601:", "input": "March 15th, 2026", "output": "2026-03-15"}
{"instruction": "Summarize this clause in plain English:", "input": "The party of the first part...", "output": "This clause means..."}
Enter fullscreen mode Exit fullscreen mode

ShareGPT format (multi-turn conversations):

{"conversations": [
  {"from": "human", "value": "What does EBITDA stand for?"},
  {"from": "gpt", "value": "Earnings Before Interest, Taxes, Depreciation, and Amortization."}
]}
Enter fullscreen mode Exit fullscreen mode

How much data?

  • Under 300 examples: fine-tune the Instruct model (style and behavior shaping)
  • 300–1,000 examples: Instruct or base model both work
  • Over 1,000 examples: base model preferred for deeper behavior change

More data doesn't reliably beat better data. If you have 10,000 mediocre examples and 500 carefully curated ones, the 500 will often produce a better model. Deduplicate, filter out short or malformed entries, and aim for consistent quality before you care about quantity.

Load your data:

from datasets import load_dataset

# Local JSON Lines file
dataset = load_dataset("json", data_files="your_data.jsonl", split="train")

# Or a public HF dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
Enter fullscreen mode Exit fullscreen mode

Step 4: Load the model with QLoRA

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,   # drop to 1024 if you hit OOM on 8GB VRAM
    dtype=None,            # auto-detect: bfloat16 on Ampere+, float16 older
    load_in_4bit=True,     # QLoRA: model in 4-bit, adapters in 16-bit
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                  # LoRA rank — higher = more capacity, more VRAM
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,        # 0 is optimal for Unsloth's fused kernels
    bias="none",
    use_gradient_checkpointing="unsloth",  # long-context support, less VRAM
    random_state=3407,
)
Enter fullscreen mode Exit fullscreen mode

load_in_4bit=True is the QLoRA switch. The base model loads compressed to 4-bit; the LoRA adapters — the actual trainable parameters — remain in 16-bit. You're training roughly 1–5% of the total parameter count, which is why 8GB is enough.

LoRA rank (r): r=16 is the standard starting point. Raise it to 32 or 64 if you're doing style transfer or long-form generation and have the VRAM headroom. For simple format trai

Top comments (0)