DEV Community

Cover image for 96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop
Akhilesh
Akhilesh

Posted on

96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop

GPT-2 has 117M parameters. LLaMA-2 has 7B. GPT-3 has 175B.

Full fine-tuning means updating every single parameter. For GPT-2 that's manageable. For LLaMA-2 it needs 28GB of GPU memory just to store the gradients. For GPT-3 it's basically impossible without a cluster.

LoRA (Low-Rank Adaptation) solves this. Instead of updating the full weight matrices, it adds tiny trainable modules next to them. The original weights stay frozen. Only the tiny modules train. At the end you merge them back.

You go from needing 8 A100s to needing a consumer GPU. Or sometimes just a CPU.


What You'll Learn Here

  • Why full fine-tuning doesn't scale
  • The math behind LoRA in plain English
  • Rank, alpha, and dropout: what they control
  • Which layers to apply LoRA to
  • Setting up LoRA with HuggingFace PEFT
  • QLoRA: quantization + LoRA for consumer hardware
  • Merging LoRA weights for deployment
  • Comparing LoRA to full fine-tuning

The Problem With Full Fine-Tuning at Scale

# Memory requirements for fine-tuning
def estimate_gpu_memory(n_params_billions, dtype='float32'):
    bytes_per_param = {
        'float32': 4,
        'float16': 2,
        'int8':    1,
        'int4':    0.5
    }

    bpp       = bytes_per_param[dtype]
    model_gb  = n_params_billions * 1e9 * bpp / 1e9

    # For full fine-tuning you also need:
    # - Gradients: same size as weights
    # - Adam optimizer states: 2x weight size
    # - Activations: depends on batch size (rough estimate 2x)
    total_gb = model_gb * (1 + 1 + 2 + 2)   # weights + grads + optimizer + activations

    return model_gb, total_gb

print(f"{'Model':<15} {'Params':<10} {'Weights':<12} {'Full FT Memory'}")
print("-" * 50)
for name, params in [('GPT-2', 0.117), ('LLaMA-7B', 7), ('LLaMA-13B', 13), ('GPT-3', 175)]:
    w_gb, total = estimate_gpu_memory(params, 'float32')
    print(f"{name:<15} {params:<10} {w_gb:.1f} GB      {total:.0f} GB")
Enter fullscreen mode Exit fullscreen mode

Output:

Model           Params     Weights      Full FT Memory
--------------------------------------------------
GPT-2           0.117      0.5 GB      2 GB
LLaMA-7B        7          28.0 GB     168 GB
LLaMA-13B       13         52.0 GB     312 GB
GPT-3           175        700.0 GB    4200 GB
Enter fullscreen mode Exit fullscreen mode

LLaMA-7B full fine-tuning needs 168GB of GPU memory. A single A100 has 80GB. You need at least 3 of them for $30,000+.

LoRA changes this dramatically.


How LoRA Works: The Math

A pretrained weight matrix W has shape (d_out, d_in). Full fine-tuning updates W directly:

W_new = W_pretrained + ΔW
Enter fullscreen mode Exit fullscreen mode

ΔW has the same shape as W. That's the problem. It's huge.

LoRA's insight: the update ΔW doesn't need to have full rank. Most meaningful weight changes during fine-tuning lie in a low-dimensional subspace.

Instead of learning ΔW directly, LoRA approximates it as the product of two small matrices:

ΔW ≈ B × A

where:
  A has shape (r, d_in)   - projects down to rank r
  B has shape (d_out, r)  - projects back up to d_out

r << min(d_in, d_out)
Enter fullscreen mode Exit fullscreen mode

During forward pass:

output = x @ W^T + x @ A^T @ B^T × (alpha/r)
       = (pretrained part) + (LoRA part)
Enter fullscreen mode Exit fullscreen mode

W stays frozen. Only A and B train. Total parameters: r * (d_in + d_out) instead of d_in * d_out.

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    def __init__(self, original_layer, rank=8, alpha=16, dropout=0.1):
        super().__init__()

        self.original = original_layer
        self.rank     = rank
        self.alpha    = alpha
        self.scaling  = alpha / rank

        # Freeze the original layer
        for param in self.original.parameters():
            param.requires_grad = False

        # LoRA matrices A and B
        in_features  = original_layer.in_features
        out_features = original_layer.out_features

        self.lora_A = nn.Linear(in_features,  rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)

        self.dropout = nn.Dropout(dropout)

        # Initialize: A with Gaussian, B with zeros
        # B=0 means LoRA starts as identity (no change at init)
        nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)

    def forward(self, x):
        # Original output (frozen)
        original_out = self.original(x)

        # LoRA delta
        lora_out = self.lora_B(self.lora_A(self.dropout(x))) * self.scaling

        return original_out + lora_out

    def parameter_count(self):
        original_params = sum(p.numel() for p in self.original.parameters())
        lora_params     = sum(p.numel() for p in self.lora_A.parameters()) + \
                          sum(p.numel() for p in self.lora_B.parameters())
        return original_params, lora_params

# Test LoRA layer
original_linear = nn.Linear(768, 768)  # typical BERT attention dimension
lora_linear     = LoRALayer(original_linear, rank=8, alpha=16)

original_params, lora_params = lora_linear.parameter_count()
print(f"Original parameters: {original_params:,}")
print(f"LoRA parameters:     {lora_params:,}")
print(f"Parameter reduction: {lora_params/original_params:.1%} of original")

x   = torch.randn(2, 10, 768)
out = lora_linear(x)
print(f"\nInput shape:  {x.shape}")
print(f"Output shape: {out.shape}")
Enter fullscreen mode Exit fullscreen mode

Output:

Original parameters: 590,592
LoRA parameters:     12,288
Parameter reduction: 2.1% of original

Input shape:  torch.Size([2, 10, 768])
Output shape: torch.Size([2, 10, 768])
Enter fullscreen mode Exit fullscreen mode

12,288 parameters instead of 590,592. Same output shape. 2.1% of the original.


Rank, Alpha, and What They Control

import pandas as pd

# How rank affects parameter count for a 768x768 matrix
rows = []
for rank in [1, 2, 4, 8, 16, 32, 64]:
    d_in = d_out = 768
    original = d_in * d_out
    lora     = rank * (d_in + d_out)
    rows.append({
        'Rank': rank,
        'LoRA params': lora,
        'Original params': original,
        '% of original': f"{lora/original:.2%}",
        'Reduction factor': f"{original//lora}x"
    })

print(pd.DataFrame(rows).to_string(index=False))
Enter fullscreen mode Exit fullscreen mode

Output:

 Rank  LoRA params  Original params  % of original  Reduction factor
    1         1536           589824           0.26%            384x
    2         3072           589824           0.52%            192x
    4         6144           589824           1.04%             96x
    8        12288           589824           2.08%             48x
   16        24576           589824           4.17%             24x
   32        49152           589824           8.33%             12x
   64        98304           589824          16.67%              6x
Enter fullscreen mode Exit fullscreen mode

Rank (r): how many dimensions to use in the low-rank approximation. Higher rank = more parameters = more expressive but closer to full fine-tuning.

  • r=4 or r=8: most common starting point
  • r=16 to r=32: for harder tasks that need more capacity
  • r=64+: approaching full fine-tuning territory

Alpha (α): scaling factor for the LoRA output. Controls how much influence LoRA has relative to the frozen model.

  • Usually set to alpha = rank (scaling = 1.0)
  • Or alpha = 2 * rank (scaling = 2.0, LoRA has more influence)
  • Common: rank=8, alpha=16 (scaling=2)

Dropout: regularization inside LoRA. Typically 0.05 to 0.1.


Which Layers to Apply LoRA To

In transformers, the attention mechanism has four weight matrices per layer: Q, K, V, and the output projection. The feed-forward layers have two more.

# Common LoRA target modules for different architectures

lora_targets = {
    'BERT / RoBERTa': {
        'targets': ['query', 'key', 'value', 'dense'],
        'note': 'All attention projections'
    },
    'GPT-2': {
        'targets': ['c_attn', 'c_proj'],
        'note': 'Combined QKV and output projection'
    },
    'LLaMA / Mistral': {
        'targets': ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
        'note': 'All attention projections, sometimes gate_proj too'
    },
    'Minimal (fastest)': {
        'targets': ['q_proj', 'v_proj'],
        'note': 'Only query and value, fewer params but often enough'
    }
}

for arch, info in lora_targets.items():
    print(f"\n{arch}:")
    print(f"  Targets: {info['targets']}")
    print(f"  Note:    {info['note']}")
Enter fullscreen mode Exit fullscreen mode

Research shows that applying LoRA to Q and V only (skipping K) often works nearly as well as all four while using fewer parameters.


LoRA With HuggingFace PEFT

pip install peft
Enter fullscreen mode Exit fullscreen mode
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = 'roberta-base'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,       # sequence classification
    r=8,                               # rank
    lora_alpha=16,                     # alpha
    lora_dropout=0.1,                  # dropout
    target_modules=['query', 'value'], # apply to Q and V only
    bias='none',                       # don't train biases
    inference_mode=False
)

# Wrap the model with LoRA
model = get_peft_model(base_model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
Enter fullscreen mode Exit fullscreen mode

Output:

trainable params: 629,764 || all params: 125,277,444 || trainable%: 0.5025
Enter fullscreen mode Exit fullscreen mode

0.5% of parameters. Everything else is frozen.

# Training with LoRA is identical to regular fine-tuning
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import load_dataset
import evaluate
import numpy as np

# Load data
dataset   = load_dataset('imdb')
small_train = dataset['train'].select(range(2000))
small_val   = dataset['test'].select(range(500))

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=256)

train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds   = small_val.map(tokenize,   batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds   = val_ds.rename_column('label', 'labels')

accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    return accuracy.compute(predictions=preds, references=eval_pred.label_ids)

training_args = TrainingArguments(
    output_dir='./lora_model',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=3e-4,          # LoRA can use higher LR than full fine-tuning
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to='none',
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics
)

trainer.train()
results = trainer.evaluate()
print(f"LoRA fine-tuning accuracy: {results['eval_accuracy']:.3f}")
Enter fullscreen mode Exit fullscreen mode

Saving and Loading LoRA Weights

LoRA's other big advantage: the saved checkpoint is tiny. You only save the LoRA matrices, not the full model.

from peft import PeftModel

# Save only the LoRA weights
model.save_pretrained('./lora_weights')   # saves adapter_config.json and adapter_model.bin
print("LoRA weights saved")

import os
for f in os.listdir('./lora_weights'):
    size = os.path.getsize(f'./lora_weights/{f}') / 1e6
    print(f"  {f}: {size:.1f} MB")
Enter fullscreen mode Exit fullscreen mode

Output:

LoRA weights saved
  adapter_config.json: 0.001 MB
  adapter_model.bin: 2.4 MB     <- only 2.4 MB instead of 500+ MB!
Enter fullscreen mode Exit fullscreen mode
# Load: start with base model, then load LoRA adapter
base_model_for_load = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3
)
loaded_lora_model = PeftModel.from_pretrained(base_model_for_load, './lora_weights')
loaded_lora_model.eval()
print("LoRA model loaded successfully")
Enter fullscreen mode Exit fullscreen mode

Merging LoRA for Deployment

After training, you can merge the LoRA weights into the base model. Then you have one clean model with no overhead at inference time.

# Merge LoRA into base model
merged_model = model.merge_and_unload()

# Now merged_model is a regular model with no LoRA overhead
print(f"Type after merge: {type(merged_model)}")

# Save the merged model
merged_model.save_pretrained('./merged_model')
tokenizer.save_pretrained('./merged_model')

# Load it like any normal model
from transformers import AutoModelForSequenceClassification
final_model = AutoModelForSequenceClassification.from_pretrained('./merged_model')
print("Merged model loaded as regular model")

# Check: no LoRA parameters, just the full model
n_params = sum(p.numel() for p in final_model.parameters())
print(f"Parameters: {n_params:,}")
Enter fullscreen mode Exit fullscreen mode

QLoRA: 4-bit Quantization + LoRA

QLoRA combines quantization (reducing weight precision to 4-bit) with LoRA. This lets you fine-tune 7B+ models on a single consumer GPU.

pip install bitsandbytes
Enter fullscreen mode Exit fullscreen mode
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # quantize to 4-bit
    bnb_4bit_quant_type='nf4',            # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.float16, # compute in fp16
    bnb_4bit_use_double_quant=True        # double quantization (saves more memory)
)

# Load model in 4-bit (much less memory)
model_name = 'gpt2'   # swap with 'meta-llama/Llama-2-7b-hf' if you have access

qlora_base = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map='auto'          # automatically handles multi-GPU or CPU offload
)

# Required for 4-bit training
qlora_base.config.use_cache           = False
qlora_base.config.pretraining_tp      = 1

# Prepare for LoRA training with quantized model
from peft import prepare_model_for_kbit_training
qlora_base = prepare_model_for_kbit_training(qlora_base)

# Apply LoRA config
qlora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['c_attn', 'c_proj'],  # GPT-2 specific
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

qlora_model = get_peft_model(qlora_base, qlora_config)
qlora_model.print_trainable_parameters()
Enter fullscreen mode Exit fullscreen mode

Output:

trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364
Enter fullscreen mode Exit fullscreen mode
# Memory savings with QLoRA
memory_estimates = {
    'Full fine-tuning (fp32)':     '~28 GB for 7B model',
    'Full fine-tuning (fp16)':     '~14 GB for 7B model',
    'LoRA (fp16)':                 '~8 GB for 7B model',
    'QLoRA (4-bit + LoRA)':       '~4 GB for 7B model',
}

print("Memory requirements for 7B parameter model:")
for method, memory in memory_estimates.items():
    print(f"  {method:<35}: {memory}")
Enter fullscreen mode Exit fullscreen mode

LoRA vs Full Fine-Tuning: Benchmark Comparison

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer, TrainingArguments, Trainer,
    DataCollatorWithPadding
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import evaluate, numpy as np, time, torch

model_name = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(model_name)

dataset    = load_dataset('imdb')
small_train = dataset['train'].select(range(1000))
small_val   = dataset['test'].select(range(300))

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=128)

train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds   = small_val.map(tokenize,   batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds   = val_ds.rename_column('label', 'labels')

accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    return accuracy.compute(predictions=preds, references=eval_pred.label_ids)

def run_experiment(use_lora, rank=8):
    base = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    if use_lora:
        config = LoraConfig(
            task_type=TaskType.SEQ_CLS, r=rank,
            lora_alpha=rank*2, lora_dropout=0.1,
            target_modules=['q_lin', 'v_lin'], bias='none'
        )
        model = get_peft_model(base, config)
        lr    = 3e-4
    else:
        model = base
        lr    = 2e-5

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())

    args = TrainingArguments(
        output_dir=f'./exp_{"lora" if use_lora else "full"}',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        learning_rate=lr,
        evaluation_strategy='epoch',
        report_to='none',
        logging_steps=999
    )

    trainer = Trainer(
        model=model, args=args,
        train_dataset=train_ds, eval_dataset=val_ds,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
        compute_metrics=compute_metrics
    )

    start   = time.time()
    trainer.train()
    elapsed = time.time() - start
    results = trainer.evaluate()

    return {
        'method':     f'LoRA (r={rank})' if use_lora else 'Full fine-tuning',
        'trainable':  f'{trainable:,} ({trainable/total:.1%})',
        'accuracy':   f"{results['eval_accuracy']:.3f}",
        'time_s':     f"{elapsed:.0f}s"
    }

print("Running comparison (this takes a few minutes)...")
results = [
    run_experiment(use_lora=False),
    run_experiment(use_lora=True, rank=4),
    run_experiment(use_lora=True, rank=8),
    run_experiment(use_lora=True, rank=16),
]

print(f"\n{'Method':<20} {'Trainable Params':<25} {'Accuracy':<12} {'Time'}")
print("-" * 70)
for r in results:
    print(f"{r['method']:<20} {r['trainable']:<25} {r['accuracy']:<12} {r['time_s']}")
Enter fullscreen mode Exit fullscreen mode

Typical output:

Method               Trainable Params          Accuracy     Time
----------------------------------------------------------------------
Full fine-tuning     66,955,010 (100%)         0.934        148s
LoRA (r=4)           147,968 (0.22%)           0.921        102s
LoRA (r=8)           295,168 (0.44%)           0.928        108s
LoRA (r=16)          589,824 (0.88%)           0.931        115s
Enter fullscreen mode Exit fullscreen mode

LoRA with r=8 gets 99.4% of full fine-tuning accuracy with 0.44% of the parameters and 73% of the training time. For larger models, the savings are even more dramatic.


When to Use LoRA vs Full Fine-Tuning

Use LoRA when:
  - Model is large (> 1B parameters)
  - GPU memory is limited
  - You want to share adapters separately from the base model
  - You want to try many different tasks with one base model
  - Quick iteration is more important than peak accuracy

Use full fine-tuning when:
  - Model is small (< 500M parameters)
  - You have plenty of GPU memory
  - Peak accuracy matters more than speed
  - You only have one task to fine-tune for
  - You'll merge and ship a single final model
Enter fullscreen mode Exit fullscreen mode

Quick Cheat Sheet

Concept What it means
Rank (r) Dimensions of LoRA matrices. r=8 is a good default.
Alpha (α) Scaling. Set to 2*r or same as r.
Target modules Which weight matrices to apply LoRA to. Start with Q and V.
Scaling factor alpha/rank. Controls LoRA strength.
Merge and unload Bake LoRA into base weights. One clean model for deployment.
QLoRA 4-bit quantization + LoRA. Fine-tune 7B on 4GB GPU.
Task Code
Configure LoRA LoraConfig(r=8, lora_alpha=16, target_modules=[...])
Apply to model get_peft_model(base_model, lora_config)
Check params model.print_trainable_parameters()
Save adapters model.save_pretrained('./lora_weights')
Load adapters PeftModel.from_pretrained(base_model, './lora_weights')
Merge weights model.merge_and_unload()
QLoRA setup BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4')

Practice Challenges

Level 1:
Apply LoRA to distilbert-base-uncased for a 3-class classification task. Use r=4 then r=16. Print the trainable parameter counts for both. Fine-tune each for 2 epochs. Compare accuracy vs parameter count.

Level 2:
Fine-tune the same dataset three ways: full fine-tuning, LoRA with r=8, and frozen backbone (only train the classification head). Plot a bar chart comparing accuracy, training time, and trainable parameter count for all three approaches.

Level 3:
Set up QLoRA with bitsandbytes on any GPT-style model. Verify it loads in 4-bit. Fine-tune on a small instruction dataset for 1 epoch. Generate 5 responses and compare quality to the non-fine-tuned base model. Report GPU memory usage before and after loading.


References


Next up, Post 97: Embeddings and Vector Search: Semantic Search That Works. How to turn sentences into vectors, find similar content with cosine similarity, and build a semantic search engine with FAISS or ChromaDB.

Top comments (0)