DEV Community

Cover image for 84. Fine-Tuning LLMs: Teaching Giants New Tricks
Akhilesh
Akhilesh

Posted on

84. Fine-Tuning LLMs: Teaching Giants New Tricks

GPT-3 has 175 billion parameters.

Full fine-tuning updates all 175 billion with every gradient step. You need multiple A100 GPUs (each with 80GB memory) just to fit the model. Training for even a few epochs on a moderate dataset costs thousands of dollars. A startup cannot do this. A PhD student cannot do this.

Yet fine-tuned versions of large models consistently outperform their base versions on specific tasks. The performance benefit is real. The cost is prohibitive.

LoRA (Low-Rank Adaptation) resolves this. Instead of updating all 175 billion parameters, it adds small trainable adapter matrices to specific weight matrices while keeping everything else frozen. The adapters are tiny: maybe 0.1% of the total parameters. Training costs drop by 10,000x. A single consumer GPU handles it. Performance on the target task approaches full fine-tuning.


Why Fine-Tuning Works

import torch
import torch.nn as nn
from transformers import (AutoTokenizer, AutoModelForCausalLM,
                           AutoModelForSequenceClassification,
                           TrainingArguments, Trainer)
from peft import LoraConfig, get_peft_model, TaskType, PeftModel
from datasets import load_dataset
import numpy as np
import warnings
warnings.filterwarnings("ignore")

torch.manual_seed(42)

print("Why fine-tuning matters:")
print()
print("Pretrained LLMs know:")
print("  Grammar, facts, reasoning, coding, translation, summarization")
print("  Learned from trillions of tokens")
print()
print("Pretrained LLMs do NOT know:")
print("  Your company's specific writing style")
print("  Your domain's specialized terminology")
print("  The format you want for outputs")
print("  How to respond to your specific prompt patterns")
print()
print("Fine-tuning teaches the model your specific requirements.")
print("The general knowledge transfers. The specific behavior adapts.")
print()
print("Types of fine-tuning:")
ft_types = {
    "Full fine-tuning":     "Update all parameters. Best performance, highest cost.",
    "Instruction tuning":   "Fine-tune on (instruction, response) pairs. Teaches following.",
    "LoRA":                 "Add small adapters. 99% of params frozen. Most practical.",
    "QLoRA":                "LoRA + 4-bit quantization. Runs on a single GPU.",
    "Prefix tuning":        "Learn soft prompts prepended to input. No weight changes.",
    "Adapter layers":       "Insert small bottleneck layers. Older but related to LoRA.",
}
for method, description in ft_types.items():
    print(f"  {method:<22}: {description}")
Enter fullscreen mode Exit fullscreen mode

LoRA: The Math

class LoRALayer(nn.Module):
    """
    LoRA adds low-rank matrices A and B to an existing weight matrix W.
    The adapted weight is: W' = W + (B @ A) * scaling
    Only A and B are trained. W is frozen.
    """
    def __init__(self, in_features, out_features, rank=8, alpha=16):
        super().__init__()
        self.rank    = rank
        self.scaling = alpha / rank

        self.lora_A = nn.Linear(in_features, rank,         bias=False)
        self.lora_B = nn.Linear(rank,         out_features, bias=False)

        nn.init.kaiming_uniform_(self.lora_A.weight, a=np.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)

    def forward(self, x, frozen_output):
        lora_out = self.lora_B(self.lora_A(x)) * self.scaling
        return frozen_output + lora_out

in_features  = 768
out_features = 768
rank         = 8

original_layer = nn.Linear(in_features, out_features)
lora_layer     = LoRALayer(in_features, out_features, rank=rank)

for param in original_layer.parameters():
    param.requires_grad = False

original_params = sum(p.numel() for p in original_layer.parameters())
lora_params     = sum(p.numel() for p in lora_layer.parameters())

print("LoRA Parameter Comparison:")
print()
print(f"Original weight matrix W:  {in_features}×{out_features} = {original_params:,} params")
print(f"LoRA A matrix:             {in_features}×{rank}    = {rank*in_features:,} params")
print(f"LoRA B matrix:             {rank}×{out_features}   = {rank*out_features:,} params")
print(f"LoRA total:                {lora_params:,} params")
print()
print(f"Reduction: {original_params:,}{lora_params:,} "
      f"({lora_params/original_params:.1%} of original)")
print()

x_sample        = torch.randn(2, 10, in_features)
frozen_output   = original_layer(x_sample)
adapted_output  = lora_layer(x_sample, frozen_output)

print(f"Forward pass:")
print(f"  Input:          {x_sample.shape}")
print(f"  Frozen output:  {frozen_output.shape}")
print(f"  Adapted output: {adapted_output.shape}")
print()
print(f"Initially ΔW = B@A = 0 (B initialized to zeros)")
print(f"As training proceeds, ΔW grows to capture task-specific patterns")
Enter fullscreen mode Exit fullscreen mode

Using PEFT for LoRA

print("Using HuggingFace PEFT (Parameter-Efficient Fine-Tuning):")
print()

model_name = "distilbert-base-uncased"
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2)

total_before = sum(p.numel() for p in base_model.parameters())
print(f"Base model parameters: {total_before:,}")

lora_config = LoraConfig(
    task_type   = TaskType.SEQ_CLS,
    r           = 8,
    lora_alpha  = 16,
    target_modules = ["q_lin", "v_lin"],
    lora_dropout   = 0.05,
    bias        = "none",
)

peft_model = get_peft_model(base_model, lora_config)

trainable   = sum(p.numel() for p in peft_model.parameters() if p.requires_grad)
total_after = sum(p.numel() for p in peft_model.parameters())

print(f"After LoRA:")
print(f"  Total parameters:     {total_after:,}")
print(f"  Trainable parameters: {trainable:,}  ({trainable/total_after:.2%})")
print(f"  Frozen parameters:    {total_after - trainable:,}")
print()
print("LoRA configuration:")
print(f"  rank (r):     {lora_config.r}  (lower = fewer params, less capacity)")
print(f"  lora_alpha:   {lora_config.lora_alpha}  (scaling = alpha/r = {lora_config.lora_alpha/lora_config.r})")
print(f"  target:       {lora_config.target_modules}  (query and value projections)")
print()
print("Why target Q and V projections?")
print("  These attention weights learn WHAT to attend to.")
print("  Task-specific attention patterns live here.")
print("  Key projections change less between tasks.")
Enter fullscreen mode Exit fullscreen mode

QLoRA: Fine-Tuning 7B Models on a Consumer GPU

print("QLoRA: Quantized LoRA")
print()
print("LoRA alone: model weights in float16 or float32.")
print("  A 7B model needs ~14GB GPU memory in float16.")
print("  Requires A100 or similar.")
print()
print("QLoRA: base model quantized to 4-bit, LoRA adapters in float16.")
print("  A 7B model needs ~4-5GB GPU memory.")
print("  Runs on a single 8GB RTX 3070 or Google Colab T4.")
print()

qlora_config_example = """
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
# Now train with standard Trainer
"""

print(qlora_config_example)
print()
print("Memory comparison for 7B model:")
print(f"  Full fine-tuning (bf16):  ~56GB  (needs 2×A100)")
print(f"  LoRA (bf16):              ~14GB  (needs A100)")
print(f"  QLoRA (4-bit + LoRA):     ~5GB   (works on RTX 3070 or Colab T4)")
Enter fullscreen mode Exit fullscreen mode

Instruction Fine-Tuning

print("Instruction Fine-Tuning: Teaching Models to Follow Instructions")
print()
print("Base LLMs predict next tokens. They do not know they are assistants.")
print("Prompt: 'Summarize the following text:'")
print("Base model might continue: 'is a common NLP task...' (completing the prompt)")
print()
print("Instruction-tuned models:")
print("Prompt: 'Summarize the following text:'")
print("Response: [actual summary of the text]")
print()
print("Training format (Alpaca-style):")
print()

alpaca_template = """
### Instruction:
{instruction}

### Input:
{input}

### Response:
{response}
"""

examples = [
    {
        "instruction": "Translate the following text to French.",
        "input":       "The weather is beautiful today.",
        "response":    "Le temps est magnifique aujourd'hui."
    },
    {
        "instruction": "Write a Python function that reverses a string.",
        "input":       "",
        "response":    "def reverse_string(s):\n    return s[::-1]"
    },
    {
        "instruction": "Summarize this paragraph in one sentence.",
        "input":       "Machine learning is a subset of AI...",
        "response":    "Machine learning enables computers to learn from data."
    }
]

for i, ex in enumerate(examples[:2]):
    filled = alpaca_template.format(**ex)
    print(f"Example {i+1}:")
    print(filled)
    print()

print("Key insight: format EVERY training example this way.")
print("The model learns: when I see this format, I should respond appropriately.")
print("This is instruction tuning. All ChatGPT-style models start here.")
Enter fullscreen mode Exit fullscreen mode

Fine-Tuning for Sentiment Classification

from datasets import load_dataset
import evaluate

print("Complete LoRA Fine-Tuning Pipeline:")
print()

dataset = load_dataset("imdb", split={"train": "train[:2000]", "test": "test[:500]"})

model_name  = "distilbert-base-uncased"
tokenizer   = AutoTokenizer.from_pretrained(model_name)
base_model  = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2)

lora_cfg = LoraConfig(
    task_type      = TaskType.SEQ_CLS,
    r              = 8,
    lora_alpha     = 16,
    target_modules = ["q_lin", "v_lin"],
    lora_dropout   = 0.05,
)
model_lora = get_peft_model(base_model, lora_cfg)

def tokenize_fn(batch):
    return tokenizer(batch["text"], truncation=True,
                      padding="max_length", max_length=128)

tokenized = dataset.map(tokenize_fn, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

acc_metric = evaluate.load("accuracy")
f1_metric  = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds  = np.argmax(logits, axis=-1)
    acc    = acc_metric.compute(predictions=preds, references=labels)
    f1     = f1_metric.compute(predictions=preds, references=labels)
    return {**acc, **f1}

training_args = TrainingArguments(
    output_dir             = "./lora_imdb",
    num_train_epochs       = 3,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size  = 32,
    warmup_ratio           = 0.1,
    weight_decay           = 0.01,
    evaluation_strategy    = "epoch",
    save_strategy          = "epoch",
    load_best_model_at_end = True,
    report_to              = "none",
)

trainer = Trainer(
    model           = model_lora,
    args            = training_args,
    train_dataset   = tokenized["train"],
    eval_dataset    = tokenized["test"],
    compute_metrics = compute_metrics,
)

print(f"Model: {model_name} with LoRA (r=8, alpha=16)")
print(f"Trainable params: {sum(p.numel() for p in model_lora.parameters() if p.requires_grad):,}")
print(f"Dataset: IMDB ({len(tokenized['train'])} train, {len(tokenized['test'])} test)")
print()
print("Run: trainer.train()")
print("Expected results after 3 epochs:")
print("  LoRA accuracy: ~92-93%")
print("  Full fine-tune: ~93-94%")
print("  The gap is tiny. The savings are enormous.")
Enter fullscreen mode Exit fullscreen mode

Saving and Loading LoRA Adapters

print("\nSaving and Loading LoRA Adapters:")
print()
print("After training:")
print("  model_lora.save_pretrained('./my_lora_adapter')")
print("  tokenizer.save_pretrained('./my_lora_adapter')")
print()
print("Loading later:")
print("  base = AutoModelForSequenceClassification.from_pretrained(model_name)")
print("  model = PeftModel.from_pretrained(base, './my_lora_adapter')")
print()
print("What gets saved: ONLY the LoRA adapters (~2MB for r=8)")
print("What does NOT get saved: the frozen base model (~250MB)")
print("When loading: download base from HuggingFace, add your adapters")
print()
print("This is why LoRA is practical for distribution:")
print("  Share 2MB adapter instead of 250MB full model")
print("  Anyone with the same base model can use your adapter")
print("  HuggingFace Hub hosts thousands of free adapters")
Enter fullscreen mode Exit fullscreen mode

Choosing Rank and Alpha

print("LoRA Hyperparameters: What to Set")
print()

rank_guidance = {
    "r=4":   "Minimum. Very few params. Use for simple adaptations.",
    "r=8":   "Standard default. Good balance. Works for most tasks.",
    "r=16":  "More expressive. Good for complex task changes.",
    "r=32":  "High capacity. Use when r=16 underfits.",
    "r=64":  "Very high. Diminishing returns. Rarely needed.",
}

print("Rank selection:")
for r, note in rank_guidance.items():
    print(f"  {r:<8}: {note}")

print()
print("Alpha (lora_alpha):")
print("  Scaling factor. effective_scale = alpha / r")
print("  Common patterns:")
print("    alpha = r:      scale=1.0,  moderate learning")
print("    alpha = 2*r:    scale=2.0,  aggressive learning (most common)")
print("    alpha = 0.5*r:  scale=0.5,  conservative (avoid if underfitting)")
print()
print("Target modules:")
print("  q_proj, v_proj: always a good starting point")
print("  Adding k_proj:  more expressive, small cost")
print("  Adding o_proj:  covers full attention")
print("  Adding mlp:     maximum coverage, highest cost")
print()
print("Start with: r=8, alpha=16, target=[q, v projections]")
print("If performance is poor: increase r to 16, add more target modules")
Enter fullscreen mode Exit fullscreen mode

A Resource Worth Reading

The original LoRA paper "LoRA: Low-Rank Adaptation of Large Language Models" by Hu et al. (2021) from Microsoft is only 9 pages and explains the mathematics, the motivation, and the empirical results clearly. The paper shows that LoRA matches full fine-tuning on GPT-3 scale with 10,000x fewer trainable parameters. Search "Hu LoRA low-rank adaptation large language models 2021."

Tim Dettmers wrote "Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA" on his blog at timdettmers.com which explains QLoRA from a practitioner's perspective with GPU memory calculations and practical recommendations. He is the primary author of QLoRA and bitsandbytes. Search "Tim Dettmers QLoRA bitsandbytes 4-bit quantization."


Try This

Create lora_practice.py.

Part 1: LoRA from scratch. Implement a single LoRA-adapted linear layer in PyTorch. Initialize B to zeros, A randomly. Forward pass: add (B @ A) * scaling to the base layer output. Verify that initially the output is identical to the frozen base layer. Train it for 100 steps on a simple regression task. Verify the adapter learns while the base stays frozen.

Part 2: PEFT fine-tuning. Use HuggingFace PEFT to fine-tune distilbert-base-uncased on any text classification dataset. Compare three settings:

  • r=4, alpha=8
  • r=8, alpha=16
  • r=16, alpha=32

Report accuracy for each. Which rank gives the best performance?

Part 3: compare to full fine-tuning. Fine-tune the same model fully (all parameters trainable). Compare accuracy, training time, and memory usage. Is the LoRA accuracy within 2% of full fine-tuning?

Part 4: adapter merging. After LoRA training, merge the adapters back into the base model using model.merge_and_unload(). Compare inference speed before and after merging. (Merged models are faster at inference since they eliminate the adapter forward pass.)


What's Next

Fine-tuning adapts models to tasks. LoRA makes that adaptation cheap. The next post is about embeddings and vector search: storing text as dense vectors and finding semantically similar content at scale. This is the foundation of RAG (Retrieval-Augmented Generation), the most practical LLM deployment pattern.

Top comments (0)