Table of Contents
- Why study LoRA? The challenge of fine-tuning massive models
- Conceptual overview: Full Fine-Tuning
- Link back to the Math Equation on Full Fine-tuning
- Conceptual Overview: Low-Rank Adaptation (LoRA)
- Mathematical Walk-through of Forward Pass with LoRA
- Mathematical Walk-through of the Backward Pass
- Link Back to the Math Equation on LoRA
- LoRA in the Self-Attention Module
- Appendix A: The "No Additional Inference Latency" Trick
- Appendix B: Why Does the Low-Rank Hypothesis Make Sense?
Why study LoRA? The challenge of fine-tuning massive models
Large Language Models (LLMs), because of their large size, present a significant challenge for training and deployment. For example, a model like GPT-3 has 175 billion parameters, would consume 350GB of storage (assuming FP16 quantization), and can consume up to 1.2TB of VRAM during training. (Note: I use GPT-3 as an example throughout this article to stay close to the original paper on LoRA.)
Fine-tuning is a way to adapt these pre-trained models to specific downstream tasks. Full fine-tuning involves updating all of the model's weights, which means that for every new task, you create a new, massive version of the model. Imagine a company wanting to offer 100 different specialized models to its customers. With full fine-tuning, this would require storing 100 separate 350GB models, consuming 35 terabytes of storage. This could make it difficult to deploy and manage customized LLMs at scale.
Low-Rank Adaptation (LoRA), introduced in the paper "LoRA: Low-Rank Adaptation of Large Language Models" by Hu et al., allows for the adaptation of massive models with a tiny fraction of the trainable parameters, significantly reducing storage costs and training overhead while maintaining satisfactory performance.
This article provides an intuitive walkthrough of how LoRA achieves this. This article was written with the assistance of Google Gemini 2.5 Pro.
Conceptual overview: Full Fine-Tuning
To understand the innovation of LoRA, we first need to look at the standard fine-tuning process. A neural network, at its core, is composed of layers, many of which perform matrix multiplication using weight matrices.
Let's imagine a single weight matrix in a pre-trained model, which we'll call . This matrix might have thousands of rows and columns.
When we fine-tune this model on a new task, we update these weights based on the new data. The process learns a "delta" or change matrix, , which is added to the original weights.
The new, fine-tuned weight matrix, , is:
The crucial point here is that the change matrix has the exact same dimensions as the original matrix . If has 100 million parameters, then we are training a that also has 100 million parameters. To save the fine-tuned model, we must save the entire matrix, which is just as large as the original.
A Simple Example
Imagine our pre-trained weight matrix is a simple 2x3 matrix:
After fine-tuning, we might learn the following update matrix:
The new, fully fine-tuned weight matrix would be:
To make this change, we had to train and store 6 new values for
. For a model like GPT-3, this means training and storing 175 billion new values for every single task.
Link back to the Math Equation on Full Fine-tuning
The paper provided the equation below the describe the standard process of full fine-tuning:
- What it means: This formula states that we want to find the best possible set of model parameters, denoted by
, that maximizes the probability of generating the correct output sequences (
y
) given the input sequences (x
) and all the preceding correct tokens . We are optimizing over the entire set of parameters in the model.
Let's dissect each component:
: This means our goal is to maximize the following expression by changing the model's parameters, denoted by the set . In full fine-tuning, represents every single weight and bias in the entire model. For GPT-3, this is a set of ~175 billion parameters.
: This tells us to sum the results over our entire training dataset, which is a set of context-target pairs . For example, in a summarization task,
x
would be a long article andy
would be its short summary.: This is for the autoregressive nature of language models. For each target sequence
y
, we sum over every single token from the beginning (t=1
) to the end (|y|
). The model tries to predict each token correctly, one by one.: We use the logarithm of the probability. This is a standard technique that makes the math more stable and turns a long product of probabilities into a more manageable sum (the "log-likelihood"). Maximizing the log-probability is the same as maximizing the probability itself.
: It represents the probability assigned by the model (with its current parameters ) to the correct next token , given the input context and all the preceding correct tokens .
In simple terms, this equation says: Adjust all 175 billion parameters ( ) to make the model as good as possible at predicting the next correct word in the sequence for all the examples in our training data. The model starts with pre-trained weights and learns an update , resulting in final weights of . The problem is that is just as large as .
- Link to this article: This equation is the mathematical representation of the concept described in the section "Conceptual overview: Full Fine-Tuning".
- The initial pre-trained weights, , correspond to our simple matrix example .
- The final, optimized parameters, , are equivalent to our fine-tuned matrix .
- The update, , which is what we learn during training, corresponds to .
- The paper's key point that
|ΔΦ| equals |Φ₀|
is exactly what we illustrated: the update matrix has the same large dimensions as the original matrix , making it expensive to train and store.
Conceptual Overview: Low-Rank Adaptation (LoRA)
LoRA is built on a key insight: the update matrix does not need to have full rank to be effective. The authors hypothesize that the change in weights during adaptation has a low "intrinsic rank". This means that the large matrix can be approximated with high fidelity by multiplying two much smaller matrices.
Instead of learning
directly, LoRA learns two smaller matrices, which we'll call A
and B
.
This is a low-rank decomposition. The "rank" (r
) is a small number we choose (like 1, 2, 8, or 64) that determines the inner dimension of these thin matrices.
- If is a matrix, then is also .
- With LoRA, matrix
A
will have dimensions , and matrixB
will have dimensions .
The number of trainable parameters is now the sum of the parameters in A
and B
(
), which is dramatically smaller than the
parameters in
, especially when
is much smaller than
and
.
Crucially, during training with LoRA, the original weights
are frozen and do not receive gradient updates. We only train the much smaller A
and B
matrices.
Figure 1 from the paper "LoRA: Low-Rank Adaptation of Large Language Models" by Hu et al. The pretrained weights W are frozen. Only the low-rank matrices A and B are trained.
The Same Example with LoRA
Let's use our 2x3 weight matrix
again.
Here, and . Let's choose a tiny rank, .
- Matrix
B
will be . - Matrix
A
will be .
Suppose after training, we learn the following A
and B
:
Now, let's compute our low-rank approximation of
:
The key comparison is the number of parameters we had to train:
- Full Fine-Tuning: The matrix had parameters.
- LoRA (r=1): We trained a
matrix
B
and a matrixA
. The total is parameters.
While the savings seem small here, for a large matrix in a real model, the difference is significant. The paper notes that for GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times (from 175B to about 17M). This means the storage for each new task drops from 350GB to just a few dozen megabytes.
Mathematical Walk-through of Forward Pass with LoRA
The standard forward pass for a layer is
, where h
is the output, W
is the weight matrix, and x
is the input.
With LoRA, the output of the frozen pre-trained weights is computed as usual, and the output of the LoRA matrices is added to it.
The modified forward pass is:
This can be written as . Let's use our example matrices to walk through the calculation.
1. Define Inputs
- Pre-trained weights
:
- Trained LoRA matrices
and
(with
r=1
): - An input vector
:
2. Calculate the original path
First, we compute the output from the frozen, pre-trained weights.
3. Calculate the LoRA path
Next, we compute the update from our trained LoRA matrices. It's more efficient to multiply
first.
Now multiply that result by B
:
4. Combine the outputs
Finally, we add the two results together to get the final output h
.
This process—keeping frozen and only passing gradients through the path—is how LoRA achieves its remarkable parameter efficiency during training.
Mathematical Walk-through of the Backward Pass
The "backward pass," or backpropagation, is the core mechanism by which a neural network learns. Its goal is to calculate how much each trainable parameter in the model contributed to the final error (or "loss"). Once we know this, we can adjust the parameters slightly to reduce that error.
In LoRA, the key efficiency gain comes from the fact that we only need to calculate these adjustments for the tiny A
and B
matrices. The massive pre-trained weight matrix, W_0
, is frozen, so we can completely ignore it during the backward pass, saving immense amounts of computation and memory.
Let's walk through how the gradients for A
and B
are calculated.
1. The Setup
First, let's recall the forward pass equation:
The backward pass starts with a gradient signal coming from the next layer of the network. This signal, which we'll call , tells us how the final loss would change with respect to a small change in our output, . Mathematically, this is the derivative . Our task is to use this incoming gradient to figure out and .
We will use the same matrices and input vector from our forward pass example:
Let's assume the incoming gradient
is:
2. The Chain Rule in Action
The gradient for the LoRA path is the same as the gradient for the total output , because is treated as a constant. The gradient flows back only through the parts of the computation that involve our trainable parameters.
3. Calculating the Gradient for B
(grad_B
)
To find how B
affects the loss, we use the chain rule. The gradient of the loss with respect to B
is found by multiplying the gradient of the output (grad_h
) by how B
affects the output.
The update term is
. The derivative of this with respect to B
involves the term it was multiplied by, which is
.
The formula is:
Let's calculate this:
- First, we need the term
. We already computed this in the forward pass:
- The transpose, , is .
- Now we multiply:
grad_B
matrix, which has the same shape asB
, tells us how to adjust each element inB
to reduce the loss.
4. Calculating the Gradient for A
(grad_A
)
Similarly, we find how A
affects the loss. This time, the gradient has to pass back through B
first.
The formula is:
Let's calculate this step-by-step:
- First, we need the transpose of
B
: - Next, we multiply
by
grad_h
: - Finally, we multiply this result by the transpose of the input,
:
grad_A
matrix, which has the same shape asA
, tells us how to adjustA
.
5. Updating the Weights
After calculating the gradients, the optimizer performs a weight update. Using a simple learning rate (lr
), the update rule is:
And that's it! The crucial part is that W_0
is never updated. It is not part of the gradient calculation and requires no memory to store its gradients or optimizer states (like momentum). This is the source of LoRA's efficiency during the training process.
Link Back to the Math Equation on LoRA
The equation below from the paper describes the optimization objective of LoRA:
The key changes of this equation compared to the previous equation for full-model fine-tuning:
: This is the most important change. Instead of optimizing over the massive set of parameters , we are now optimizing over a much smaller set of parameters, denoted by . As the paper notes, the size of can be as small as 0.01% of the size of the original parameters ( ). In our LoRA explainer, represents the collection of all the trainable values in our small
A
andB
matrices.-
: This shows how the model's weights are constructed.
- : These are the original, pre-trained weights of the large model. They are treated as a fixed constant and are not trained. This corresponds to our frozen matrix.
-
: This represents the weight update, but it's no longer a huge matrix of trainable parameters. Instead, it's a function that generates the large update matrix from the small set of parameters
. For LoRA, this function is the matrix multiplication of our small matrices:
. The parameters
are the entries of
B
andA
.
-
What it means:
- The optimization is no longer over the massive parameter set , but over a much smaller set of parameters denoted by . The paper states .
- The update to the weights, , is now a function of this small parameter set, written as . The original weights remain frozen.
-
Link to this article: This equation is the mathematical foundation for the section "Conceptual Overview: Low-Rank Adaptation (LoRA)".
- The small, trainable parameter set
represents the collection of all the elements in our low-rank matrices,
A
andB
. - The function that generates the large update,
, corresponds directly to the matrix multiplication
. This operation takes the small number of parameters in
A
andB
and uses them to produce the full-sized update matrix . - The term in the equation is precisely what we illustrated in the forward pass: .
- The small, trainable parameter set
represents the collection of all the elements in our low-rank matrices,
LoRA in the Self-Attention Module
The self-attention mechanism is a cornerstone of the Transformer architecture. For each input token, it computes Query (Q), Key (K), and Value (V) vectors. These are generated by multiplying the input embedding (x
) with three distinct weight matrices:
. After the attention scores are calculated and applied to the Value vectors, the result is passed through a final output projection matrix,
, to produce the layer's output.
While LoRA can be applied to any weight matrix, the paper's authors focus their study on only the weight matrices in the self-attention module ( ). They find that for maximum parameter-efficiency, it is often sufficient to adapt only a subset of these matrices. For example, adapting only the query ( ) and value ( ) matrices can yield strong performance. For the purpose of a complete illustration, we will describe the process as if LoRA is applied to all four:
- For , we add
- For , we add
- For , we add
- For , we add
All eight of these LoRA matrices (A_q, B_q, A_k, B_k
, etc.) are the only parameters that are trained. The original W
matrices remain frozen. The calculations for each of the four paths are independent and can be performed in parallel.
Mathematical Walk-through of the Forward Pass
To keep the example clear, we will focus on the generation of a single Query vector. The exact same logic applies simultaneously to the Key and Value vectors.
1. Define Inputs
Let's assume our model has an embedding dimension (d_model
) of 4, and the dimension of the Query/Key vectors (d_q
or d_k
) is 3.
-
Input Embedding (
x
), for a single token:
-
Pre-trained Query Weight Matrix (
W_q
):
-
Trainable LoRA Matrices for Query (
A_q
,B_q
), with rankr=1
:
2. Calculate the Original Path (Frozen)
First, we compute the output from the pre-trained W_q
matrix.
3. Calculate the LoRA Path (Trainable)
Next, we compute the update from our LoRA matrices.
4. Combine for the Final Query Vector
Finally, we add the outputs of the two paths.
This
q_final
is the Query vector that will be used in the attention score calculation.
5. The Bigger Picture
This exact process happens in parallel for W_k
(with A_k, B_k
) and W_v
(with A_v, B_v
) to produce k_final
and v_final
. Those three vectors then proceed to the standard attention calculation. The final output of the attention mechanism is then passed through the W_o
layer, which has its own LoRA update (
).
Mathematical Walk-through of the Backward Pass
During backpropagation, the gradients flow backward from the loss function. The attention mechanism will provide an incoming gradient for q_final
, k_final
, and v_final
. We'll demonstrate the process using the incoming gradient for our Query vector, grad_q
.
1. The Setup
Let's assume the incoming gradient from the rest of the network for our Query vector is:
Our goal is to use
grad_q
to calculate the gradients for A_q
and B_q
. The gradient does not flow back into W_q
.
2. Calculating the Gradient for B_q
The gradient for B_q
is found by multiplying the incoming gradient grad_q
by the term that B_q
was multiplied by in the forward pass: (A_q * x)
.
We already know from the forward pass that
A_q * x = [0.95]
.This is the adjustment signal for the
B_q
matrix.
3. Calculating the Gradient for A_q
The gradient for A_q
requires the gradient to pass back through B_q
first.
First, let's compute
B_q^T * grad_q
:Now, multiply this by
x^T
:This is the adjustment signal for the
A_q
matrix.
4. Updating All LoRA Weights
The optimizer will use these computed gradients (grad_B_q
, grad_A_q
) to update the weights of B_q
and A_q
.
Crucially, this same backward pass logic is applied independently and simultaneously for the other LoRA pairs. The incoming gradient grad_k
is used to update A_k
and B_k
, grad_v
is used to update A_v
and B_v
, and the gradient from the next layer is used to update A_o
and B_o
. The massive W_q, W_k, W_v, W_o
matrices are completely bypassed during the gradient computation, leading to massive savings in time and memory.
Appendix A: The "No Additional Inference Latency" Trick
A common drawback of other adaptation methods (like adding "adapter" layers) is that they introduce extra layers or computations that permanently increase the model's inference latency. LoRA cleverly avoids this.
During training, the forward pass involves two paths, as we saw: . This does add a small amount of extra computation.
However, once training is complete, we can prepare the model for deployment (inference). Since matrices A
and B
are now fixed, we can perform their matrix multiplication once to get our final update matrix
.
Then, we can add this final update matrix directly to the original pre-trained weights to create a new, fully merged weight matrix, .
Using our example numbers:
When the model is deployed, we just use this merged matrix. The forward pass becomes , which is a single matrix multiplication. This has the exact same computational cost and latency as the original, non-fine-tuned model.
This also makes switching between tasks incredibly fast. As the paper notes, to switch from Task 1 (with matrices ) to Task 2 (with ), you can compute , which is a very fast operation compared to loading an entirely new 350GB model from disk.
Appendix B: Why Does the Low-Rank Hypothesis Make Sense?
The idea that a massive, over-parameterized model can be adapted by changing only a small number of parameters might seem counter-intuitive, but it is supported by research into the dynamics of deep learning.
The LoRA paper builds on work by Aghajanyan et al. (2020) and others, which showed that pre-trained language models have a low "intrinsic dimension." This suggests that even though they exist in a very high-dimensional parameter space (e.g., 175 billion dimensions), they can learn new tasks effectively by moving along a much smaller, lower-dimensional manifold within that space.
LoRA takes this idea a step further by hypothesizing that the change in weights during adaptation ( ) also has a low "intrinsic rank." This means that the adjustments needed for a new task are not a complex, high-rank transformation but a simpler, low-rank one. The remarkable empirical success of LoRA, even with ranks as low as 1 or 2, provides strong evidence for this hypothesis. By decomposing the update into two small matrices, LoRA is essentially forcing the model to learn an update within this low-rank subspace, which proves to be a very effective and efficient constraint.
Top comments (0)