Tech_Nuggets

Posted on Jun 9

LoRA and QLoRA fine-tuning: what they actually do under the hood

#finetuning #lora #qlora #llm

LoRA and QLoRA fine-tuning: what they actually do under the hood

You spent three weeks curating a dataset of legal contract summaries: 12,000 pairs of dense legalese and plain-English counterparts. The model you picked -- a 7B parameter instruction-tuned Llama -- understands your prompts but produces summaries that read like a junior associate who memorized Blackstone but never saw a real merger clause. You reach for full fine-tuning, the obvious move. Then torch.cuda.OutOfMemoryError hits at step 20 on your RTX 4090. You try gradient checkpointing. You try a smaller batch. You try half-precision. Still OOM. Your colleague says "just use LoRA" and walks off, as if that explains anything.

This is the gap this post fills. You do not need another high-level "LoRA is a PEFT method" post. You need the math and the trade-offs that let you decide between LoRA, QLoRA, and full fine-tuning for your specific hardware and quality requirements.

Why parameter-efficient fine-tuning exists

The cost of full fine-tuning is straightforward: a model with P parameters requires storing, at minimum, the model weights (2P bytes for fp16), the optimizer states (8P bytes for Adam), and the gradients (2P bytes). For Llama 3 8B with fp16 parameters, that is roughly 16 GB for weights plus 64 GB for optimizer state plus 16 GB for gradients -- 96 GB total. An RTX 4090 has 24 GB. A single A100-80 has exactly enough, barely, with no room for a batch size above 1.

Parameter-efficient fine-tuning (PEFT) avoids this by keeping the vast majority of the model frozen and training only a tiny set of added parameters. The key insight is that the weight update during fine-tuning, delta W, has low intrinsic rank -- you can approximate it as a product of two much smaller matrices.

LoRA: low-rank adaptation

The LoRA paper (Hu et al., 2021, arXiv 2106.09685) proposed freezing the pretrained weight matrix W in R^(d x d) and learning a low-rank decomposition:

W' = W + BA

where B in R^(d x r), A in R^(r x d), and r << d (typically r = 8 or r = 16). Instead of updating d^2 parameters per layer, you update 2dr. For d = 4096 (a common hidden dimension) and r = 8, that is 65,536 parameters per layer instead of 16,777,216 -- a reduction of roughly 256x.

During the forward pass, the computation becomes:

h = xW' = xW + xBA

The first term uses frozen weights (no gradient needed). The second term is the adapter path. Only A and B receive gradient updates. The original W stays intact, which means you can swap adapters in and out at inference time with zero overhead: just add the adapter weights to W (or compute h = xW + xBA on the fly).

Here is what the architecture looks like for a single Transformer attention layer:

flowchart LR
    subgraph Forward pass
        X[Input x] --> W[W frozen<br/>d x d]
        X --> B_adapt[B d x r]
        B_adapt --> A_adapt[A r x d]
        W --> ADD[Add]
        A_adapt --> ADD
        ADD --> OUT[Output h]
    end

    subgraph Gradient flow
        OUT --> GRAD_B[Gradients flow<br/>to B and A only]
        GRAD_B --> NO[No gradient<br/>through W]
    end

By default, LoRA is applied to the query and value projection matrices in each attention head. You can also extend it to key, output, and the feed-forward layers. Empirically, setting r = 8 on Q and V covers most of the benefit; doubling r beyond 16 rarely beats full fine-tuning by more than a trivial margin.

QLoRA: adding 4-bit quantization

QLoRA (Dettmers et al., 2023, arXiv 2305.14314) asked: what if instead of storing W in fp16, we stored it in 4 bits and still trained adapters on top? The result is a method that can fine-tune a 65B model on a single 48 GB GPU -- something that was previously impossible.

QLoRA makes three specific contributions that work together:

NF4 data type. NormalFloat4 is a quantization scheme designed for normally distributed weights. It maps the 4-bit values to the quantiles of a normal distribution, so the discretization error is minimized exactly where most weight values fall. Informally, NF4 allocates more of its 16 representable values around zero and fewer in the tails.

Double quantization. The quantization constants (scale and offset) themselves take space. QLoRA quantizes these constants from fp32 to fp8, saving another 0.5 bits per parameter. The total is ~4.5 bits per parameter for the base model -- about 3.5 GB for a 7B model instead of 14 GB.

Paged optimizers. When GPU memory runs out during a long training run, the optimizer states are paged to CPU RAM and fetched back as needed. This prevents the OOM crash but can slow training; it is a safety net, not a performance feature.

During training, QLoRA dequantizes the 4-bit weights on the fly for each forward pass, computes the LoRA adapter contribution, and backpropagates only through the low-rank matrices. The dequantized weights never have their gradients computed, which is the whole source of memory savings.

Full comparison

Dimension	Full fine-tuning	LoRA (fp16)	QLoRA (4-bit base + LoRA)
Base model memory	16 GB (7B, fp16)	16 GB (frozen)	~3.5 GB (NF4)
Adapter memory	0	2 GB (r=8, all layers)	2 GB
Optimizer state	~32 GB (Adam)	~4 GB (only adapters)	~4 GB
Total VRAM needed	~56 GB	~22 GB	~9.5 GB
Qual. vs full FT	Baseline	On par or within 0.5%	Within 1-2% on most benchmarks
Multi-task support	One copy per task	One base + N adapters	One base + N adapters
Training speed (7B, A100)	1.0x baseline	~1.4x faster	~0.8x slower (dequant overhead)

The speed trade-off is worth calling out explicitly: QLoRA trains slower than LoRA because every forward pass must dequantize the base weights. On a 7B model with a single A100, LoRA is roughly 1.4x faster than full fine-tuning (less data movement), while QLoRA is about 0.8x the speed of full fine-tuning (dequantization overhead). The memory savings are enormous though, which is why QLoRA dominates the conversation for consumer-grade GPUs.

Common pitfalls

Rank selection is not magic. Setting r = 256 everywhere will not automatically improve results. Higher rank means more trainable parameters but also more noise in the gradient signal. The original LoRA paper found that a rank of 1 already captures meaningful adaptation for many tasks. Start with r = 8 on Q and V, evaluate, and only increase rank on layers that underfit.

Adapter merge at scale. You can merge LoRA weights into W at inference time by computing W' = W + BA for each layer and discarding A and B. This eliminates the adapter inference overhead. But if you have 50 adapters for 50 different clients, you now need 50 copies of the full weights -- trading compute for storage. The right design depends on which resource you have more of.

QLoRA is not free. The NF4 dequantization adds numerical noise. On most tasks the quality loss is within the noise floor (1-2% on MMLU, roughly 0.5% on domain-specific benchmarks). But if you are tuning a model for a precision-critical task such as medical diagnosis or code correctness verification, the trade-off may swing back to full-precision LoRA or full fine-tuning.

Bitsandbytes versions matter. QLoRA depends on the bitsandbytes library for its CUDA quantization kernels. As of June 2026, bitsandbytes is at v0.49.2 and PEFT is at v0.19.1. The API changed between v0.43 and v0.44 -- if you are using an older PEFT, pin to a compatible bitsandbytes version. A version mismatch silently falls back to CPU quantization, which runs orders of magnitude slower.

Scaling the LoRA alpha. The LoRA scaling factor alpha / r controls the magnitude of the adapter update. A common mistake is setting alpha too low (adapter contribution vanishes) or too high (training destabilizes). The paper recommends alpha = 2r as a starting point. Double-check this if your loss curve looks flat after 200 steps.

When NOT to use it

LoRA and QLoRA are the wrong choice when:

You need to change the model's internal representations fundamentally. If you are adding new knowledge that the base model does not have (a new language, a new domain with very different token statistics), low-rank updates may not have enough capacity. Continued pretraining or full fine-tuning will capture the distribution shift more effectively.

Inference latency is your binding constraint and you serve from CPU. LoRA merges into the weights easily on GPU, but on CPU with on-the-fly adapter computation, the extra matrix multiply for BA adds latency. You can merge ahead of time, but then every adapter becomes a separate weight file.

You are fine-tuning a model smaller than 1B parameters. The memory savings of PEFT are less dramatic on small models. A 350M-parameter model consumes roughly 1.4 GB in fp16 -- the adapter overhead of LoRA starts to be a significant fraction of total parameters. A simple full fine-tuning pass may fit with gradient checkpointing and a reasonable batch size.

You need deterministic training across hardware. The quantization paths in QLoRA introduce non-determinism from the dequantization kernel. If you need perfectly reproducible training runs (for auditing or compliance), stick with full-precision LoRA or full fine-tuning with a fixed seed and deterministic CUDA backend.

TL;DR

LoRA approximates the fine-tuning weight update as a product of two low-rank matrices (B in d x r, A in r x d), reducing trainable parameters by 100x-1000x per layer with minimal quality loss.
QLoRA quantizes the frozen base model to 4-bit NF4, then trains LoRA adapters on top. A 65B model fits on a single 48 GB GPU.
The practical memory equation for a 7B model: full fine-tuning ~56 GB, LoRA ~22 GB, QLoRA ~9.5 GB.
Start with r = 8 on Q and V projection layers. Increase rank only if you see clear underfitting on your validation set.
QLoRA trains slower than LoRA (dequantization overhead) but uses roughly half the memory. Pick based on whether you are GPU-bound or time-bound.
Keep bitsandbytes and PEFT versions in sync. A version mismatch causes silent CPU fallback and catastrophic slowdown.
Do not use LoRA/QLoRA for small models (under 1B), for injecting fundamentally new knowledge, or for CPU-latency-sensitive serving where merge-ahead is impractical.

We covered how to adapt an existing model efficiently. The next step is knowing when that adaptation has actually worked -- and that means evaluation. Next post: building a reliable evaluation pipeline that catches regressions before they ship, with or without a labeled test set.

If you are deciding between LoRA and QLoRA for a project right now, the key variable is your GPU budget. 24 GB or less? QLoRA. 48 GB or more? LoRA with a larger rank or full fine-tuning with LoRA on the side for rapid iteration. The code to make either choice work is a single pip install away.

DEV Community

LoRA and QLoRA fine-tuning: what they actually do under the hood

LoRA and QLoRA fine-tuning: what they actually do under the hood

Why parameter-efficient fine-tuning exists

LoRA: low-rank adaptation

QLoRA: adding 4-bit quantization

Full comparison

Common pitfalls

When NOT to use it

TL;DR

Next post

Top comments (0)