HK Lee

Posted on Mar 26 • Originally published at pockit.tools

Fine-Tuning Open-Source LLMs with QLoRA and Unsloth: The Complete 2026 Guide

#ai #llm #qlora #unsloth

You've built a prototype with GPT-4 or Claude. It works great. Then the invoice arrives: $12,000 for last month's API calls. And it's growing 40% month-over-month.

This is the moment every AI engineer hits. The prototype-to-production cliff where managed API costs become unsustainable, latency requirements tighten, and you realize you need a model that actually understands your domain — not the entire internet's worth of knowledge.

Fine-tuning an open-source LLM is the answer. And thanks to QLoRA (Quantized Low-Rank Adaptation) and tools like Unsloth, you no longer need a cluster of A100 GPUs or a PhD in machine learning to do it. A single consumer GPU with 24GB of VRAM — an RTX 4090 or even a free Google Colab T4 — is enough to fine-tune a model with billions of active parameters to outperform GPT-4 on your specific task.

This guide covers everything from zero to production: why fine-tuning works, how QLoRA makes it feasible on consumer hardware, how to prepare your dataset, the exact training code, evaluation strategies, and deployment. Every code example is production-tested.

Why Fine-Tune Instead of Prompting?

Before diving into the how, let's be precise about when fine-tuning is the right choice. It's not always the answer.

Use prompting / RAG when:

Your task is general-purpose (summarization, translation, Q&A over documents)
Your data changes frequently (knowledge bases, support tickets)
You're still exploring what the model should do
You need to ship in days, not weeks

Use fine-tuning when:

The model needs to learn a specific style, format, or behavior that prompting can't reliably produce
You have a well-defined task with consistent input/output patterns
Latency and cost at scale matter (a fine-tuned 7B model is 10–50x cheaper per token than GPT-4)
You need the model to deeply understand domain-specific terminology
You want to reduce hallucinations on domain-specific facts

The most common fine-tuning use cases in production:

Use Case	Why Prompting Falls Short	Fine-Tuning Advantage
Code generation for internal APIs	Model doesn't know your SDK	Learns your specific patterns and conventions
Medical/Legal document analysis	Generic models hedge too much	Confident, domain-specific outputs
Structured data extraction	Prompt-based formatting is brittle	Consistent schema adherence
Customer support tone matching	System prompts drift over long conversations	Baked-in voice and personality
SQL generation for custom schemas	Schema in context eats tokens	Internalized schema knowledge

The key insight: fine-tuning doesn't teach the model new knowledge per se. It teaches the model new behaviors. A fine-tuned model doesn't memorize your database — it learns how to reason about your domain's patterns, produce outputs in your specific format, and apply your organization's conventions consistently.

Understanding LoRA and QLoRA

The Problem: Full Fine-Tuning Is Expensive

Traditional full fine-tuning updates every parameter in the model. For a 7B parameter model, this means:

Memory: ~28GB just for the model weights in FP32, plus ~28GB for optimizer states, plus ~28GB for gradients. Total: ~84GB of VRAM minimum.
Hardware: Multiple A100 80GB GPUs.
Cost: $10–50/hour on cloud GPU instances, training runs lasting hours to days.
Risk: Catastrophic forgetting — the model loses its general capabilities while learning your specific task.

LoRA: The Breakthrough

LoRA (Low-Rank Adaptation), introduced by Hu et al. in 2021, had a key insight: you don't need to update all parameters. When fine-tuning a pre-trained model, the weight changes tend to have a low intrinsic rank. This means the update matrix can be decomposed into two much smaller matrices.

Instead of updating a weight matrix W of dimensions d × k, LoRA freezes W and trains two small matrices A (d × r) and B (r × k), where r (the rank) is much smaller than both d and k:

Original:  W (4096 × 4096) → 16.7M parameters to update
LoRA:      A (4096 × 16) + B (16 × 4096) → 131K parameters to update

Reduction: 99.2% fewer trainable parameters

The forward pass becomes: output = W·x + α·B·A·x, where α is a scaling factor. During inference, you can merge B·A back into W, so there's zero additional latency compared to the original model.

# Conceptual illustration of LoRA
import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, original_layer: nn.Linear, rank: int = 16, alpha: float = 32):
        super().__init__()
        self.original = original_layer
        self.original.weight.requires_grad = False  # Freeze original weights

        d_in = original_layer.in_features
        d_out = original_layer.out_features

        # Low-rank decomposition matrices
        self.lora_A = nn.Parameter(torch.randn(d_in, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, d_out))

        self.scale = alpha / rank

    def forward(self, x):
        # Original computation (frozen) + low-rank update
        original_output = self.original(x)
        lora_output = (x @ self.lora_A @ self.lora_B) * self.scale
        return original_output + lora_output

    def merge(self):
        """Merge LoRA weights into original for zero-cost inference."""
        self.original.weight.data += (self.lora_A @ self.lora_B).T * self.scale

QLoRA: Making It Accessible

QLoRA (Quantized LoRA), introduced by Dettmers et al. in 2023, added three innovations that made fine-tuning accessible to consumer hardware:

4-bit NormalFloat (NF4) Quantization: The base model is quantized to 4 bits using a distribution-aware quantization scheme. This reduces a 7B model from ~14GB (FP16) to ~3.5GB.
Double Quantization: The quantization constants themselves are quantized, saving an additional 0.37 bits per parameter (~325MB on a 7B model).
Paged Optimizers: Optimizer states are offloaded to CPU RAM when GPU memory runs low, using NVIDIA unified memory. This prevents OOM crashes during training spikes.

The result: fine-tune a 7B model on a single 24GB GPU, or a 13B model on a 48GB GPU. Here's the memory breakdown:

Full Fine-Tuning (7B model):
  Model weights (FP32):    ~28 GB
  Optimizer states:        ~28 GB
  Gradients:               ~28 GB
  Total:                   ~84 GB → Needs 2x A100 80GB

QLoRA Fine-Tuning (7B model):
  Model weights (NF4):     ~3.5 GB
  LoRA adapters (FP16):    ~0.1 GB
  Optimizer states:        ~0.4 GB
  Gradients + activations: ~4.0 GB
  Total:                   ~8.0 GB → Fits on RTX 4090 (24GB) with room to spare

The quality difference between full fine-tuning and QLoRA? In most benchmarks, it's within 1–2% — a negligible tradeoff for a 10x reduction in hardware requirements.

Setting Up Your Environment

Hardware Requirements

GPU	VRAM	Maximum Model Size	Training Speed
T4 (Colab Free)	16GB	7B (tight)	~1.5 hours/epoch on 10K samples
RTX 3090/4090	24GB	7B (comfortable), 13B (tight)	~45 min/epoch on 10K samples
A100 40GB	40GB	13B (comfortable), 34B (tight)	~20 min/epoch on 10K samples
A100 80GB	80GB	70B with aggressive quantization	~15 min/epoch on 10K samples

Installation

Using Unsloth (recommended for 2–5x speedup over standard HuggingFace training):

# Create a fresh environment
conda create -n finetune python=3.11 -y
conda activate finetune

# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install Unsloth (handles bitsandbytes, transformers, peft, trl automatically)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers trl peft accelerate bitsandbytes

# For evaluation
pip install rouge-score nltk scikit-learn

For a standard HuggingFace setup (without Unsloth):

pip install transformers peft trl bitsandbytes accelerate datasets
pip install flash-attn --no-build-isolation  # Optional but recommended

Choosing a Base Model (March 2026)

The choice of base model matters more than most people realize. Here's the current landscape:

Model	Parameters	Context Length	Best For	License
Llama 4 Scout	109B total / 17B active (MoE)	10M	Latest Meta flagship, massive context, needs H100	Llama 4 Community
Llama 3.1 8B	8B	128K	Best quality-to-size ratio for beginners	Llama 3.1 Community
Mistral Small 4	32B	128K	Multilingual, strong reasoning, Apache licensed	Apache 2.0
Gemma 3 27B	27B	128K	Multimodal, strong coding, Google ecosystem	Gemma License
Qwen 2.5 7B	7B	128K	Best for Chinese + English bilingual	Apache 2.0
Phi-4 14B	14B	16K	Compact, strong reasoning	MIT

Recommendation for first-timers: Start with Llama 3.1 8B. It has the largest community, the most tutorials, and the best quality-to-resources ratio. Once your pipeline works, swap in larger models like Llama 4 Scout or Gemma 3 27B. Note that Llama 4 Scout uses a Mixture-of-Experts (MoE) architecture — only 17B parameters are active per token, making it surprisingly efficient despite its 109B total parameter count.

Preparing Your Dataset

This is where most fine-tuning projects succeed or fail. The model is only as good as the data you feed it. A 500-example dataset of exceptional quality will outperform a 50,000-example dataset of mediocre quality.

Data Format: The Chat Template

Modern fine-tuning uses the conversational format (also called the "chat template" or "instruction format"). Here's the standard structure:

{
  "conversations": [
    {"role": "system", "content": "You are a senior backend engineer specializing in PostgreSQL optimization."},
    {"role": "user", "content": "This query is taking 30 seconds on a table with 50M rows:\n\nSELECT * FROM orders WHERE customer_id = 12345 AND status = 'pending' ORDER BY created_at DESC LIMIT 10;"},
    {"role": "assistant", "content": "The issue is likely a missing composite index. Here's the analysis:\n\n**Problem:** Without an index on `(customer_id, status, created_at)`, PostgreSQL performs a sequential scan on 50M rows, then sorts the results.\n\n**Solution:**\n```

sql\nCREATE INDEX CONCURRENTLY idx_orders_customer_status_created\nON orders (customer_id, status, created_at DESC);\n

```\n\n**Why this works:**\n1. The index covers all three columns in the WHERE and ORDER BY clauses\n2. `created_at DESC` in the index matches your ORDER BY, eliminating a sort operation\n3. PostgreSQL can use an Index Only Scan if you replace `SELECT *` with specific columns\n4. `CONCURRENTLY` prevents locking the table during creation\n\n**Expected improvement:** From ~30s to <10ms for this query pattern."}
  ]
}

Data Quality Checklist

Before training, audit your dataset against these criteria:

def audit_dataset(dataset):
    issues = []

    for i, example in enumerate(dataset):
        conversations = example['conversations']

        # Check 1: Minimum conversation length
        if len(conversations) < 2:
            issues.append(f"Example {i}: Less than 2 turns")

        # Check 2: Response quality (length proxy)
        assistant_msgs = [c for c in conversations if c['role'] == 'assistant']
        for msg in assistant_msgs:
            if len(msg['content']) < 50:
                issues.append(f"Example {i}: Very short assistant response ({len(msg['content'])} chars)")
            if len(msg['content']) > 8000:
                issues.append(f"Example {i}: Very long assistant response ({len(msg['content'])} chars)")

        # Check 3: No empty messages
        for c in conversations:
            if not c['content'].strip():
                issues.append(f"Example {i}: Empty message from {c['role']}")

        # Check 4: Proper role alternation
        roles = [c['role'] for c in conversations if c['role'] != 'system']
        for j in range(1, len(roles)):
            if roles[j] == roles[j-1]:
                issues.append(f"Example {i}: Consecutive {roles[j]} messages")

        # Check 5: No data leakage (model shouldn't reference being fine-tuned)
        for c in conversations:
            if any(phrase in c['content'].lower() for phrase in ['as an ai', 'i am an ai', 'language model']):
                issues.append(f"Example {i}: Potential identity leakage in {c['role']} message")

    return issues

How Many Examples Do You Need?

The answer depends on your task:

Task Type	Minimum	Sweet Spot	Diminishing Returns
Style/tone adaptation	50–100	200–500	>1,000
Domain-specific Q&A	200–500	1,000–3,000	>10,000
Code generation (specific SDK)	500–1,000	2,000–5,000	>15,000
Complex reasoning chains	1,000–2,000	5,000–10,000	>20,000

The 80/10/10 rule for data splitting:

80% for training
10% for validation (monitored during training to prevent overfitting)
10% for final evaluation (never seen during training)

from datasets import load_dataset, DatasetDict

def prepare_splits(dataset_path):
    dataset = load_dataset("json", data_files=dataset_path, split="train")
    dataset = dataset.shuffle(seed=42)

    # 80/10/10 split
    train_test = dataset.train_test_split(test_size=0.2, seed=42)
    val_test = train_test['test'].train_test_split(test_size=0.5, seed=42)

    return DatasetDict({
        'train': train_test['train'],
        'validation': val_test['train'],
        'test': val_test['test'],
    })

Generating Synthetic Training Data

If you don't have enough examples, you can bootstrap your dataset using a strong model (GPT-4, Claude) to generate training data for a smaller model. This technique, called knowledge distillation via synthetic data, is used extensively in production.

import openai
import json

SYSTEM_PROMPT = """You are generating training data for a fine-tuned model
that will act as a PostgreSQL optimization expert.

Generate realistic user questions about PostgreSQL performance issues and
provide expert-level responses. Include:
- Specific SQL queries with realistic table names and sizes
- EXPLAIN ANALYZE output interpretation
- Concrete index recommendations with CREATE INDEX statements
- Performance improvement estimates

Each response should be 200-500 words with code examples."""

async def generate_training_examples(n_examples: int = 500):
    client = openai.AsyncOpenAI()
    examples = []

    topics = [
        "slow JOIN queries on large tables",
        "N+1 query problems in ORMs",
        "full table scans on indexed columns",
        "lock contention in high-write scenarios",
        "query plan regression after VACUUM",
        "connection pool exhaustion",
        "index bloat detection and remediation",
        # ... more topics
    ]

    for i in range(n_examples):
        topic = topics[i % len(topics)]

        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"Generate a training example about: {topic}. "
                                             f"Vary the complexity and table schemas."}
            ],
            temperature=0.9,  # Higher temperature for diversity
            response_format={"type": "json_object"},
        )

        example = json.loads(response.choices[0].message.content)
        examples.append(example)

    return examples

Critical warning: Always manually review a sample of your synthetic data. LLMs can generate plausible-sounding but incorrect technical advice. Budget time for human review of at least 10–20% of synthetic examples.

Training with Unsloth

Now for the main event. Here's the complete training script:

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# ─────────────────────────────────────────
# 1. Load Model with 4-bit Quantization
# ─────────────────────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=4096,      # Maximum sequence length for training
    dtype=None,                # Auto-detect: float16 for older GPUs, bfloat16 for Ampere+
    load_in_4bit=True,         # QLoRA: load base model in 4-bit NF4
    # token="hf_...",          # Uncomment if using gated models
)

# ─────────────────────────────────────────
# 2. Configure LoRA Adapters
# ─────────────────────────────────────────
model = FastLanguageModel.get_peft_model(
    model,
    r=32,                       # LoRA rank — higher = more capacity, more VRAM
    lora_alpha=64,              # Scaling factor — typically 2x rank
    target_modules=[            # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
        "gate_proj", "up_proj", "down_proj",       # MLP layers
    ],
    lora_dropout=0.05,          # Slight dropout for regularization
    bias="none",                # Don't train bias terms
    use_gradient_checkpointing="unsloth",  # 60% less VRAM for long contexts
    random_state=42,
)

# Verify trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 83,886,080 || all params: 8,113,831,936 || trainable%: 1.034%

# ─────────────────────────────────────────
# 3. Load and Format Dataset
# ─────────────────────────────────────────
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

def format_chat(example):
    """Format conversations into the Llama 3 chat template."""
    messages = example['conversations']
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_chat, remove_columns=dataset.column_names)

# ─────────────────────────────────────────
# 4. Configure Training
# ─────────────────────────────────────────
training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=4,      # Adjust based on VRAM
    gradient_accumulation_steps=4,       # Effective batch size = 4 * 4 = 16
    num_train_epochs=3,                  # 2-4 epochs is typical
    learning_rate=2e-4,                  # Standard for QLoRA
    lr_scheduler_type="cosine",          # Cosine decay with warmup
    warmup_ratio=0.05,                   # 5% of steps for warmup
    weight_decay=0.01,                   # Mild regularization
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    optim="adamw_8bit",                  # 8-bit Adam saves VRAM
    seed=42,
    max_grad_norm=0.3,                   # Gradient clipping for stability
    report_to="wandb",                   # Optional: Weights & Biases logging
)

# ─────────────────────────────────────────
# 5. Initialize Trainer and Start
# ─────────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
    max_seq_length=4096,
    dataset_text_field="text",
    packing=True,              # Pack short examples together for efficiency
)

# Start training
print("Starting fine-tuning...")
stats = trainer.train()
print(f"Training completed in {stats.metrics['train_runtime']:.0f} seconds")
print(f"Final loss: {stats.metrics['train_loss']:.4f}")

Hyperparameter Tuning Guide

The defaults above work well for most cases, but here's how to tune them:

LoRA Rank (r):

r=8: Minimal adaptation. Good for simple style changes.
r=16: Default. Works for most tasks.
r=32: Higher capacity. Good for complex domain adaptation.
r=64+: Approaching full fine-tuning capacity. Rarely needed.

Rule of thumb: start with r=16. If validation loss plateaus early, increase to 32. If it overfits quickly, decrease to 8.

Learning Rate:

2e-4: Standard QLoRA learning rate. Start here.
1e-4: More conservative. Use if training is unstable.
5e-5: Very conservative. Use for very small datasets (<200 examples).

Number of Epochs:

1–2: Large datasets (>10K examples)
2–4: Medium datasets (1K–10K examples)
4–8: Small datasets (<1K examples)
Watch for overfitting: If validation loss starts increasing while training loss keeps decreasing, you're overfitting. Stop training.

# Monitor overfitting during training
from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,        # Critical: provide validation set
    args=training_args,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=3,    # Stop after 3 eval steps without improvement
            early_stopping_threshold=0.01 # Minimum improvement threshold
        ),
    ],
    max_seq_length=4096,
    dataset_text_field="text",
    packing=True,
)

What the Training Loss Curve Should Look Like

A healthy training run looks like this:

Loss
 4.0 |X
     |  X
 3.0 |    X
     |      X
 2.0 |        X  X
     |            X  X  X
 1.0 |                    X  X  X  X  X  X  ← plateau (good, model converged)
     |
 0.0 +─────────────────────────────────────
     0    200   400   600   800   1000
                    Steps

Red flags:
- Loss doesn't decrease → Learning rate too low, or data has issues
- Loss drops to near 0 → Overfitting badly, reduce epochs or increase data
- Loss is very noisy → Batch size too small or learning rate too high
- Loss spikes suddenly → Gradient explosion, reduce learning rate

Evaluating Your Model

Training is only half the battle. Evaluation tells you whether your fine-tuned model actually improves on the base model for your specific task.

Automated Evaluation

import torch
from rouge_score import rouge_scorer
from sklearn.metrics import accuracy_score
import json

def evaluate_model(model, tokenizer, test_dataset, max_samples=100):
    """Comprehensive evaluation of fine-tuned model."""
    model.eval()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

    results = {
        'rouge1_scores': [],
        'rougeL_scores': [],
        'format_compliance': [],
        'avg_response_length': [],
        'examples': [],
    }

    for i, example in enumerate(test_dataset.select(range(min(max_samples, len(test_dataset))))):
        conversations = example['conversations']

        # Build prompt from all messages except the last assistant response
        prompt_messages = []
        expected_response = ""
        for msg in conversations:
            if msg['role'] == 'assistant' and msg == conversations[-1]:
                expected_response = msg['content']
            else:
                prompt_messages.append(msg)

        # Generate response
        inputs = tokenizer.apply_chat_template(
            prompt_messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt"
        ).to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                inputs,
                max_new_tokens=1024,
                temperature=0.1,       # Low temperature for evaluation
                do_sample=False,        # Greedy decoding for reproducibility
            )

        generated = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

        # Score
        rouge_scores = scorer.score(expected_response, generated)
        results['rouge1_scores'].append(rouge_scores['rouge1'].fmeasure)
        results['rougeL_scores'].append(rouge_scores['rougeL'].fmeasure)
        results['avg_response_length'].append(len(generated))

        # Check format compliance (e.g., does it include code blocks when expected?)
        expected_has_code = '```

' in expected_response
        generated_has_code = '

```' in generated
        results['format_compliance'].append(expected_has_code == generated_has_code)

        # Store examples for manual review
        if i < 10:
            results['examples'].append({
                'prompt': prompt_messages[-1]['content'][:200],
                'expected': expected_response[:300],
                'generated': generated[:300],
                'rouge1': rouge_scores['rouge1'].fmeasure,
            })

    # Aggregate results
    summary = {
        'avg_rouge1': sum(results['rouge1_scores']) / len(results['rouge1_scores']),
        'avg_rougeL': sum(results['rougeL_scores']) / len(results['rougeL_scores']),
        'format_compliance_rate': sum(results['format_compliance']) / len(results['format_compliance']),
        'avg_response_length': sum(results['avg_response_length']) / len(results['avg_response_length']),
        'examples': results['examples'],
    }

    return summary

A/B Comparison: Base vs. Fine-Tuned

The most informative evaluation compares your fine-tuned model against the base model on the same prompts:

async def ab_comparison(base_model, finetuned_model, tokenizer, test_prompts):
    """Side-by-side comparison of base vs fine-tuned model responses."""
    results = []

    for prompt in test_prompts:
        messages = [
            {"role": "system", "content": "You are a PostgreSQL optimization expert."},
            {"role": "user", "content": prompt},
        ]

        # Generate from both models
        inputs = tokenizer.apply_chat_template(messages, tokenize=True,
                                                add_generation_prompt=True,
                                                return_tensors="pt")

        base_output = base_model.generate(inputs.to(base_model.device),
                                           max_new_tokens=512, temperature=0.1)
        ft_output = finetuned_model.generate(inputs.to(finetuned_model.device),
                                              max_new_tokens=512, temperature=0.1)

        base_text = tokenizer.decode(base_output[0][inputs.shape[1]:], skip_special_tokens=True)
        ft_text = tokenizer.decode(ft_output[0][inputs.shape[1]:], skip_special_tokens=True)

        results.append({
            'prompt': prompt,
            'base_response': base_text,
            'finetuned_response': ft_text,
        })

    return results

LLM-as-Judge Evaluation

For subjective quality assessment, use a stronger model to judge:

async def llm_judge_evaluation(examples, judge_model="gpt-4o"):
    """Use a strong LLM to evaluate response quality."""
    client = openai.AsyncOpenAI()
    scores = []

    for ex in examples:
        response = await client.chat.completions.create(
            model=judge_model,
            messages=[{
                "role": "system",
                "content": """You are evaluating the quality of a fine-tuned model's response.
                Rate each response on these dimensions (1-5 scale):
                1. Technical Accuracy: Are the technical claims correct?
                2. Completeness: Does it address all aspects of the question?
                3. Format Compliance: Does it follow the expected output format?
                4. Actionability: Can the user directly apply this advice?

                Respond as JSON: {"accuracy": N, "completeness": N, "format": N, "actionability": N, "reasoning": "..."}"""
            }, {
                "role": "user",
                "content": f"Question: {ex['prompt']}\n\nResponse: {ex['finetuned_response']}"
            }],
            response_format={"type": "json_object"},
        )

        score = json.loads(response.choices[0].message.content)
        scores.append(score)

    return scores

Saving and Deploying Your Model

Saving Options

After training, you have three saving options:

# Option 1: Save LoRA adapters only (~100-300MB)
# Best for: Version control, quick swapping between adapters
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")

# Option 2: Merge and save full model in 16-bit (~14GB for 7B)
# Best for: Standard deployment with vLLM or TGI
model.save_pretrained_merged("./my-model-merged", tokenizer, save_method="merged_16bit")

# Option 3: Export as GGUF for llama.cpp / Ollama deployment
# Best for: CPU inference, edge deployment, local development
model.save_pretrained_gguf("./my-model-gguf", tokenizer, quantization_method="q4_k_m")

# Option 4: Push to Hugging Face Hub
model.push_to_hub("your-username/my-fine-tuned-model", token="hf_...")
tokenizer.push_to_hub("your-username/my-fine-tuned-model", token="hf_...")

Deployment with vLLM (Production Recommended)

vLLM is the standard for production LLM serving. It supports continuous batching, PagedAttention, and speculative decoding for maximum throughput:

# Install vLLM
# pip install vllm

# Serve the merged model
# vllm serve ./my-model-merged --port 8000 --max-model-len 4096

# Or serve the base model with LoRA adapter (hot-swappable!)
# vllm serve unsloth/Meta-Llama-3.1-8B-Instruct \
#   --enable-lora \
#   --lora-modules my-adapter=./my-lora-adapter \
#   --port 8000

# Client code — uses OpenAI-compatible API
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="./my-model-merged",    # Or "my-adapter" for LoRA
    messages=[
        {"role": "system", "content": "You are a PostgreSQL optimization expert."},
        {"role": "user", "content": "My query is doing a sequential scan on 100M rows..."},
    ],
    temperature=0.3,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Deployment with Ollama (Local/Edge)

For local development or edge deployment, export to GGUF and use Ollama:

# After exporting to GGUF
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./my-model-gguf/unsloth.Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096
SYSTEM "You are a PostgreSQL optimization expert."
EOF

# Create and run the Ollama model
ollama create my-pg-expert -f Modelfile
ollama run my-pg-expert "How do I optimize a slow GROUP BY query?"

Production Checklist

Before shipping your fine-tuned model to production, run through this checklist:

Pre-Deployment

[ ] Evaluation scores exceed baseline: Fine-tuned model outperforms base model on your test set by a meaningful margin
[ ] No catastrophic forgetting: Test on general tasks to ensure the model hasn't lost basic capabilities
[ ] Guardrails in place: Test for harmful, biased, or out-of-scope outputs and implement content filtering
[ ] Latency benchmarks: Measure P50, P95, and P99 latency under realistic load
[ ] Cost projection: Calculate per-request cost including GPU compute, and compare to API-based alternatives

Infrastructure

# Health check endpoint for production deployment
from fastapi import FastAPI
import time

app = FastAPI()

@app.get("/health")
async def health():
    start = time.time()
    # Quick inference test
    response = model.generate(
        tokenizer("test", return_tensors="pt")["input_ids"].to(model.device),
        max_new_tokens=10,
    )
    latency_ms = (time.time() - start) * 1000

    return {
        "status": "healthy",
        "model": "my-finetuned-model-v1",
        "inference_latency_ms": round(latency_ms, 2),
        "gpu_memory_used_gb": round(torch.cuda.memory_allocated() / 1e9, 2),
        "gpu_memory_total_gb": round(torch.cuda.get_device_properties(0).total_mem / 1e9, 2),
    }

Post-Deployment Monitoring

[ ] Track output quality over time: Set up automated evaluation on a random sample of production requests
[ ] Monitor for distribution drift: If user queries start differing from training data, model quality will degrade
[ ] Version your models: Use semantic versioning (v1.0.0, v1.1.0) and maintain rollback capability
[ ] Retrain cadence: Plan for periodic retraining as you accumulate more production data

Common Pitfalls and How to Avoid Them

Pitfall 1: Training on Bad Data

Symptom: Model generates plausible-sounding but factually incorrect responses.
Cause: Synthetic training data was generated without human verification.
Fix: Always have domain experts review at least 10% of your training data. One incorrect example can poison hundreds of related outputs.

Pitfall 2: Overfitting on Small Datasets

Symptom: Training loss approaches 0, but model outputs are formulaic and seem to memorize training examples verbatim.
Cause: Too many epochs on too few examples.
Fix: Reduce epochs, increase LoRA dropout to 0.1, use a lower LoRA rank, or augment your dataset.

Pitfall 3: Chat Template Mismatch

Symptom: Model outputs garbage, repeated tokens, or ignores the system prompt.
Cause: Using a different chat template during training vs. inference.
Fix: Always use tokenizer.apply_chat_template() for both training data formatting and inference prompt construction. Never manually construct chat prompts.

Pitfall 4: Catastrophic Forgetting

Symptom: Model performs well on your specific task but can't handle basic tasks it could do before fine-tuning.
Cause: Aggressive fine-tuning that overwrites general knowledge.
Fix: Use a lower learning rate (5e-5 instead of 2e-4), fewer epochs, or mix in general-purpose data (10–20% of your training set should be general-purpose examples).

Pitfall 5: Ignoring Quantization Effects

Symptom: Fine-tuned model performs well in FP16 but degrades significantly after GGUF quantization for deployment.
Cause: Aggressive quantization (Q2_K, Q3_K) on models that weren't designed for it.
Fix: Use Q4_K_M or Q5_K_M for deployment quantization. Always benchmark after quantization to verify quality is maintained. If quality drops significantly, use Q6_K or Q8_0 at the cost of higher memory usage.

Cost Comparison: Fine-Tuning vs. API

Let's do the math on a realistic production scenario:

Scenario: 100,000 requests/month, average 500 input tokens + 500 output tokens per request.

Approach	Monthly Cost	Latency (P50)	Control
GPT-4o API	~$1,500	800ms	Low (OpenAI controls the model)
Claude 3.5 Sonnet API	~$1,800	600ms	Low
GPT-4o-mini API	~$30	400ms	Low
Self-hosted Llama 3.1 8B (A10G)	~$350	120ms	Full
Self-hosted Fine-Tuned 8B (A10G)	~$350	120ms	Full + Domain expertise

The fine-tuned model costs the same to run as the base model — but produces higher-quality domain-specific outputs. On the A10G instance, you're paying ~$0.75/hour for dedicated GPU compute versus per-token API pricing that scales linearly with usage.

Break-even point: For most teams, self-hosting a fine-tuned model becomes cheaper than API calls at around 30,000–50,000 requests/month, depending on the API model and response length.

What's Next: Emerging Techniques in 2026

Fine-tuning is evolving fast. Here's what's on the horizon:

Unsloth Studio (March 2026): Unsloth just released an open-source, local, no-code interface that handles the entire fine-tuning lifecycle — data preparation, training, and deployment — in a single GUI. It claims 70% less VRAM and 2x faster training. If you're not comfortable with the Python scripts above, Studio is a game-changer for accessibility.

DoRA (Weight-Decomposed Low-Rank Adaptation): A 2024 innovation that separates magnitude and direction in weight updates, consistently outperforming LoRA by 1–3% with negligible additional overhead. Already integrated into the PEFT library.

GaLore (Gradient Low-Rank Projection): Promises full fine-tuning quality at LoRA-level memory costs by projecting gradients into a low-rank space. Still experimental but promising for users who want to avoid the LoRA quality ceiling.

LoRA Merging and Model Soups: Combining multiple LoRA adapters (each trained on different tasks) into a single model through weight averaging. Enables multi-task specialization without separate model deployments.

GRPO (Group Relative Policy Optimization): An emerging technique for training "reasoning AI" models that can perform multi-step logic and chain-of-thought. Unsloth supports GRPO with as little as 5GB of VRAM, making it accessible on local hardware. This is how the next generation of reasoning models (like DeepSeek-R1) are being trained.

Reward Model Fine-Tuning (RLHF/DPO): After supervised fine-tuning (SFT), a second training pass using Direct Preference Optimization (DPO) aligns model outputs with human preferences. This is how production models are trained to be helpful, harmless, and honest — and the tooling for applying it to custom models is now accessible.

# DPO training after SFT (sketch)
from trl import DPOTrainer, DPOConfig

dpo_config = DPOConfig(
    output_dir="./dpo-output",
    beta=0.1,                    # KL-divergence penalty strength
    learning_rate=5e-6,          # Much lower than SFT
    per_device_train_batch_size=2,
    num_train_epochs=1,
)

# DPO dataset requires chosen/rejected pairs
# {"prompt": "...", "chosen": "good response", "rejected": "bad response"}
dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=None,              # Uses implicit reference with PEFT
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

Final Summary

Fine-tuning open-source LLMs has gone from a research-lab activity to a standard engineering workflow. With QLoRA and Unsloth, you can fine-tune a model that outperforms GPT-4 on your specific task — on a single consumer GPU, in under an hour.

The key principles:

Data quality over quantity. 500 excellent examples beat 50,000 mediocre ones.
Start small. Use Llama 3.1 8B with QLoRA on your smallest viable dataset. Get the pipeline working before scaling.
Evaluate rigorously. Automated metrics (ROUGE, format compliance) plus LLM-as-judge plus manual review. All three.
Monitor in production. The model's performance will drift as user queries evolve. Plan for periodic retraining.
Know when NOT to fine-tune. If RAG or better prompting solves your problem, it's cheaper and faster than fine-tuning.

The gap between closed-source and open-source models shrinks every month. With the techniques in this guide, you can build production AI systems that are cheaper, faster, more private, and more specialized than anything an API can give you.

🛠️ Developer Toolkit: This post first appeared on the Pockit Blog.

Need a Regex Tester, JWT Decoder, or Image Converter? Use them on Pockit.tools or install the Extension to avoid switching tabs. No signup required.