Lewis Won

Posted on Sep 13

How does low-rank adaptation for large language models work

#llm #machinelearning #ai #algorithms

Why study LoRA? The challenge of fine-tuning massive models
Conceptual overview: Full Fine-Tuning
Link back to the Math Equation on Full Fine-tuning
Conceptual Overview: Low-Rank Adaptation (LoRA)
Mathematical Walk-through of Forward Pass with LoRA
Mathematical Walk-through of the Backward Pass
Link Back to the Math Equation on LoRA
LoRA in the Self-Attention Module
Appendix A: The "No Additional Inference Latency" Trick
Appendix B: Why Does the Low-Rank Hypothesis Make Sense?

Why study LoRA? The challenge of fine-tuning massive models

Large Language Models (LLMs), because of their large size, present a significant challenge for training and deployment. For example, a model like GPT-3 has 175 billion parameters, would consume 350GB of storage (assuming FP16 quantization), and can consume up to 1.2TB of VRAM during training. (Note: I use GPT-3 as an example throughout this article to stay close to the original paper on LoRA.)

Fine-tuning is a way to adapt these pre-trained models to specific downstream tasks. Full fine-tuning involves updating all of the model's weights, which means that for every new task, you create a new, massive version of the model. Imagine a company wanting to offer 100 different specialized models to its customers. With full fine-tuning, this would require storing 100 separate 350GB models, consuming 35 terabytes of storage. This could make it difficult to deploy and manage customized LLMs at scale.

Low-Rank Adaptation (LoRA), introduced in the paper "LoRA: Low-Rank Adaptation of Large Language Models" by Hu et al., allows for the adaptation of massive models with a tiny fraction of the trainable parameters, significantly reducing storage costs and training overhead while maintaining satisfactory performance.

This article provides an intuitive walkthrough of how LoRA achieves this. This article was written with the assistance of Google Gemini 2.5 Pro.

Conceptual overview: Full Fine-Tuning

To understand the innovation of LoRA, we first need to look at the standard fine-tuning process. A neural network, at its core, is composed of layers, many of which perform matrix multiplication using weight matrices.

Let's imagine a single weight matrix in a pre-trained model, which we'll call $W_0$ . This matrix might have thousands of rows and columns.

\text{Pre-trained Weights } (W_0)

When we fine-tune this model on a new task, we update these weights based on the new data. The process learns a "delta" or change matrix, $\Delta W$ , which is added to the original weights.

The new, fine-tuned weight matrix, $W_{ft}$ , is:

W_{ft} = W_0 + \Delta W

The crucial point here is that the change matrix $\Delta W$ has the exact same dimensions as the original matrix $W_0$ . If $W_0$ has 100 million parameters, then we are training a $\Delta W$ that also has 100 million parameters. To save the fine-tuned model, we must save the entire $W_{ft}$ matrix, which is just as large as the original.

A Simple Example

Imagine our pre-trained weight matrix is a simple 2x3 matrix:

W_0 = \begin{bmatrix} 0.8 & 0.1 & 0.3 \newline 0.2 & 0.7 & 0.5 \end{bmatrix}

After fine-tuning, we might learn the following update matrix:

\Delta W = \begin{bmatrix} 0.1 & -0.2 & 0.05 \newline -0.05 & 0.15 & 0.1 \end{bmatrix}

The new, fully fine-tuned weight matrix would be:

W_{ft} = W_0 + \Delta W = \begin{bmatrix} 0.9 & -0.1 & 0.35 \newline 0.15 & 0.85 & 0.6 \end{bmatrix}

To make this change, we had to train and store 6 new values for $\Delta W$ . For a model like GPT-3, this means training and storing 175 billion new values for every single task.

Link back to the Math Equation on Full Fine-tuning

The paper provided the equation below the describe the standard process of full fine-tuning:

\max_{\Phi} \sum_{(x,y) \in Z} \sum_{t=1}^{|y|} \log(P_{\Phi}(y_t | x, y_{<t}))

What it means: This formula states that we want to find the best possible set of model parameters, denoted by $\Phi$ , that maximizes the probability of generating the correct output sequences (y) given the input sequences (x) and all the preceding correct tokens $y_{<t}$ . We are optimizing over the entire set of parameters in the model.

Let's dissect each component:

$\max_{\Phi}$ : This means our goal is to maximize the following expression by changing the model's parameters, denoted by the set $\Phi$ . In full fine-tuning, $\Phi$ represents every single weight and bias in the entire model. For GPT-3, this is a set of ~175 billion parameters.
$\sum_{(x,y) \in Z}$ : This tells us to sum the results over our entire training dataset, which is a set $Z$ of context-target pairs $(x, y)$ . For example, in a summarization task, x would be a long article and y would be its short summary.
$\sum_{t=1}^{|y|}$ : This is for the autoregressive nature of language models. For each target sequence y, we sum over every single token from the beginning (t=1) to the end (|y|). The model tries to predict each token correctly, one by one.
$\log(\cdot)$ : We use the logarithm of the probability. This is a standard technique that makes the math more stable and turns a long product of probabilities into a more manageable sum (the "log-likelihood"). Maximizing the log-probability is the same as maximizing the probability itself.
$P_{\Phi}(y_t | x, y_{<t})$ : It represents the probability assigned by the model $P$ (with its current parameters $\Phi$ ) to the correct next token $y_t$ , given the input context $x$ and all the preceding correct tokens $y_{<t}$ .

In simple terms, this equation says: Adjust all 175 billion parameters ( $\Phi$ ) to make the model as good as possible at predicting the next correct word in the sequence for all the examples in our training data. The model starts with pre-trained weights $\Phi_0$ and learns an update $\Delta\Phi$ , resulting in final weights of $\Phi_0 + \Delta\Phi$ . The problem is that $\Delta\Phi$ is just as large as $\Phi_0$ .

Link to this article: This equation is the mathematical representation of the concept described in the section "Conceptual overview: Full Fine-Tuning".
- The initial pre-trained weights, $\Phi_0$ , correspond to our simple matrix example $W_0$ .
- The final, optimized parameters, $\Phi$ , are equivalent to our fine-tuned matrix $W_{ft}$ .
- The update, $\Delta\Phi$ , which is what we learn during training, corresponds to $\Delta W$ .
- The paper's key point that |ΔΦ| equals |Φ₀| is exactly what we illustrated: the update matrix $\Delta W$ has the same large dimensions as the original matrix $W_0$ , making it expensive to train and store.

Conceptual Overview: Low-Rank Adaptation (LoRA)

LoRA is built on a key insight: the update matrix $\Delta W$ does not need to have full rank to be effective. The authors hypothesize that the change in weights during adaptation has a low "intrinsic rank". This means that the large $\Delta W$ matrix can be approximated with high fidelity by multiplying two much smaller matrices.

Instead of learning $\Delta W$ directly, LoRA learns two smaller matrices, which we'll call A and B.

\Delta W \approx B \cdot A

This is a low-rank decomposition. The "rank" (r) is a small number we choose (like 1, 2, 8, or 64) that determines the inner dimension of these thin matrices.

If $W_0$ is a $d \times k$ matrix, then $\Delta W$ is also $d \times k$ .
With LoRA, matrix A will have dimensions $r \times k$ , and matrix B will have dimensions $d \times r$ .

The number of trainable parameters is now the sum of the parameters in A and B ( $(d \times r) + (r \times k)$ ), which is dramatically smaller than the $d \times k$ parameters in $\Delta W$ , especially when $r$ is much smaller than $d$ and $k$ .

Crucially, during training with LoRA, the original weights $W_0$ are frozen and do not receive gradient updates. We only train the much smaller A and B matrices.

Figure 1 from the paper "LoRA: Low-Rank Adaptation of Large Language Models" by Hu et al. The pretrained weights W are frozen. Only the low-rank matrices A and B are trained.

The Same Example with LoRA

Let's use our 2x3 weight matrix $W_0$ again.

W_0 = \begin{bmatrix} 0.8 & 0.1 & 0.3 \newline 0.2 & 0.7 & 0.5 \end{bmatrix}

Here, $d=2$ and $k=3$ . Let's choose a tiny rank, $r=1$ .

Matrix B will be $d \times r \rightarrow 2 \times 1$ .
Matrix A will be $r \times k \rightarrow 1 \times 3$ .

Suppose after training, we learn the following A and B:

B = \begin{bmatrix} 0.4 \newline 0.2 \end{bmatrix} ,\quad A = \begin{bmatrix} 0.25 & -0.5 & 0.1 \end{bmatrix}

Now, let's compute our low-rank approximation of $\Delta W$ :

\begin{aligned} \Delta W &\approx B \cdot A \newline &= \begin{bmatrix} 0.4 \newline 0.2 \end{bmatrix} \cdot \begin{bmatrix} 0.25 & -0.5 & 0.1 \end{bmatrix} \newline &= \begin{bmatrix} 0.1 & -0.2 & 0.04 \newline 0.05 & -0.1 & 0.02 \end{bmatrix} \end{aligned}

The key comparison is the number of parameters we had to train:

Full Fine-Tuning: The $\Delta W$ matrix had $2 \times 3 = 6$ parameters.
LoRA (r=1): We trained a $2 \times 1$ matrix B and a $1 \times 3$ matrix A. The total is $(2 \times 1) + (1 \times 3) = 5$ parameters.

While the savings seem small here, for a large matrix in a real model, the difference is significant. The paper notes that for GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times (from 175B to about 17M). This means the storage for each new task drops from 350GB to just a few dozen megabytes.

Mathematical Walk-through of Forward Pass with LoRA

The standard forward pass for a layer is $h = W \cdot x$ , where h is the output, W is the weight matrix, and x is the input.

With LoRA, the output of the frozen pre-trained weights is computed as usual, and the output of the LoRA matrices is added to it.

The modified forward pass is:

h = W_0 \cdot x + B \cdot A \cdot x

This can be written as $h = (W_0 + B \cdot A) \cdot x$ . Let's use our example matrices to walk through the calculation.

1. Define Inputs

Pre-trained weights $W_0$ : $W_0 = \begin{bmatrix} 0.8 & 0.1 & 0.3 \newline 0.2 & 0.7 & 0.5 \end{bmatrix}$
Trained LoRA matrices $A$ and $B$ (with r=1): $B = \begin{bmatrix} 0.4 \newline 0.2 \end{bmatrix} ,\quad A = \begin{bmatrix} 0.25 & -0.5 & 0.1 \end{bmatrix}$
An input vector $x$ : $x = \begin{bmatrix} 10 \newline 20 \newline 30 \end{bmatrix}$

2. Calculate the original path
First, we compute the output from the frozen, pre-trained weights.

\begin{aligned} h_0 = W_0 \cdot x &= \begin{bmatrix} 0.8 & 0.1 & 0.3 \newline 0.2 & 0.7 & 0.5 \end{bmatrix} \cdot \begin{bmatrix} 10 \newline 20 \newline 30 \end{bmatrix} \newline &= \begin{bmatrix} (8 + 2 + 9) \newline (2 + 14 + 15) \end{bmatrix} \newline &= \begin{bmatrix} 19 \newline 31 \end{bmatrix} \end{aligned}

3. Calculate the LoRA path
Next, we compute the update from our trained LoRA matrices. It's more efficient to multiply $A \cdot x$ first.

\begin{aligned} A \cdot x &= \begin{bmatrix} 0.25 & -0.5 & 0.1 \end{bmatrix} \cdot \begin{bmatrix} 10 \newline 20 \newline 30 \end{bmatrix} \newline &= [ (2.5 - 10 + 3) ] \newline &= [-4.5] \end{aligned}

Now multiply that result by B:

\begin{aligned} \Delta h &= B \cdot (A \cdot x) \newline &= \begin{bmatrix} 0.4 \newline 0.2 \end{bmatrix} \cdot [-4.5] \newline &= \begin{bmatrix} -1.8 \newline -0.9 \end{bmatrix} \end{aligned}

4. Combine the outputs
Finally, we add the two results together to get the final output h.

\begin{aligned} h &= h_0 + \Delta h \newline &= \begin{bmatrix} 19 \newline 31 \end{bmatrix} + \begin{bmatrix} -1.8 \newline -0.9 \end{bmatrix} \newline &= \begin{bmatrix} 17.2 \newline 30.1 \end{bmatrix} \end{aligned}

This process—keeping $W_0$ frozen and only passing gradients through the $B \cdot A$ path—is how LoRA achieves its remarkable parameter efficiency during training.

Mathematical Walk-through of the Backward Pass

The "backward pass," or backpropagation, is the core mechanism by which a neural network learns. Its goal is to calculate how much each trainable parameter in the model contributed to the final error (or "loss"). Once we know this, we can adjust the parameters slightly to reduce that error.

In LoRA, the key efficiency gain comes from the fact that we only need to calculate these adjustments for the tiny A and B matrices. The massive pre-trained weight matrix, W_0, is frozen, so we can completely ignore it during the backward pass, saving immense amounts of computation and memory.

Let's walk through how the gradients for A and B are calculated.

1. The Setup

First, let's recall the forward pass equation:

h = W_0 \cdot x + B \cdot A \cdot x

The backward pass starts with a gradient signal coming from the next layer of the network. This signal, which we'll call $\text{grad\_h}$ , tells us how the final loss $(L)$ would change with respect to a small change in our output, $h$ . Mathematically, this is the derivative $\frac{\partial L}{\partial h}$ . Our task is to use this incoming gradient to figure out $\frac{\partial L}{\partial A}$ and $\frac{\partial L}{\partial B}$ .

We will use the same matrices and input vector from our forward pass example:

$A = \begin{bmatrix} 0.25 & -0.5 & 0.1 \end{bmatrix}$
$B = \begin{bmatrix} 0.4 \newline 0.2 \end{bmatrix}$
$x = \begin{bmatrix} 10 \newline 20 \newline 30 \end{bmatrix}$

Let's assume the incoming gradient $\text{grad\_h}$ is:

\begin{aligned} \text{grad\_h} &= \frac{\partial L}{\partial h} \newline &= \begin{bmatrix} 0.5 \newline -0.2 \end{bmatrix} \end{aligned}

2. The Chain Rule in Action

The gradient for the LoRA path $(\Delta h = B \cdot A \cdot x)$ is the same as the gradient for the total output $h$ , because $W_0 \cdot x$ is treated as a constant. The gradient flows back only through the parts of the computation that involve our trainable parameters.

3. Calculating the Gradient for `B` (`grad_B`)

To find how B affects the loss, we use the chain rule. The gradient of the loss with respect to B is found by multiplying the gradient of the output (grad_h) by how B affects the output.

The update term is $\Delta h = B \cdot (A \cdot x)$ . The derivative of this with respect to B involves the term it was multiplied by, which is $(A \cdot x)$ .

The formula is:

\frac{\partial L}{\partial B} = \text{grad\_h} \cdot (A \cdot x)^T

Let's calculate this:

First, we need the term $(A \cdot x)$ . We already computed this in the forward pass: $A \cdot x = [-4.5]$
The transpose, $(A \cdot x)^T$ , is $[-4.5]$ .
Now we multiply: $\begin{aligned} \text{grad\_B} &= \frac{\partial L}{\partial B} \newline &= \begin{bmatrix} 0.5 \newline -0.2 \end{bmatrix} \cdot [-4.5] \newline &= \begin{bmatrix} 0.5 \times -4.5 \newline -0.2 \times -4.5 \end{bmatrix} \newline &= \begin{bmatrix} -2.25 \newline 0.9 \end{bmatrix} \end{aligned}$ This grad_B matrix, which has the same shape as B, tells us how to adjust each element in B to reduce the loss.

4. Calculating the Gradient for `A` (`grad_A`)

Similarly, we find how A affects the loss. This time, the gradient has to pass back through B first.

The formula is:

\frac{\partial L}{\partial A} = (B^T \cdot \text{grad\_h}) \cdot x^T

Let's calculate this step-by-step:

First, we need the transpose of B: $B^T = \begin{bmatrix} 0.4 & 0.2 \end{bmatrix}$
Next, we multiply $B^T$ by grad_h: $\begin{aligned} B^T \cdot \text{grad\_h} &= \begin{bmatrix} 0.4 & 0.2 \end{bmatrix} \cdot \begin{bmatrix} 0.5 \newline -0.2 \end{bmatrix} \newline &= [ (0.4 \times 0.5) + (0.2 \times -0.2) ] \newline &= [0.2 - 0.04] \newline &= [0.16] \end{aligned}$
Finally, we multiply this result by the transpose of the input, $x^T$ : $\begin{aligned} \text{grad\_A} &= \frac{\partial L}{\partial A} \newline &= [0.16] \cdot \begin{bmatrix} 10 & 20 & 30 \end{bmatrix} \newline &= \begin{bmatrix} 1.6 & 3.2 & 4.8 \end{bmatrix} \end{aligned}$ This grad_A matrix, which has the same shape as A, tells us how to adjust A.

5. Updating the Weights

After calculating the gradients, the optimizer performs a weight update. Using a simple learning rate (lr), the update rule is:

A_{\text{new}} = A_{\text{old}} - \text{lr} \cdot \text{grad\_A}

B_{\text{new}} = B_{\text{old}} - \text{lr} \cdot \text{grad\_B}

And that's it! The crucial part is that W_0 is never updated. It is not part of the gradient calculation and requires no memory to store its gradients or optimizer states (like momentum). This is the source of LoRA's efficiency during the training process.

Link Back to the Math Equation on LoRA

The equation below from the paper describes the optimization objective of LoRA:

\max_{\Theta} \sum_{(x,y) \in Z} \sum_{t=1}^{|y|} \log(P_{\Phi_0 + \Delta\Phi(\Theta)}(y_t | x, y_{<t}))

The key changes of this equation compared to the previous equation for full-model fine-tuning:

$\max_{\Theta}$ : This is the most important change. Instead of optimizing over the massive set of parameters $\Phi$ , we are now optimizing over a much smaller set of parameters, denoted by $\Theta$ . As the paper notes, the size of $\Theta$ can be as small as 0.01% of the size of the original parameters ( $|\Theta| \ll |\Phi_0|$ ). In our LoRA explainer, $\Theta$ represents the collection of all the trainable values in our small A and B matrices.
$P_{\Phi_0 + \Delta\Phi(\Theta)}$ : This shows how the model's weights are constructed.
- $\Phi_0$ : These are the original, pre-trained weights of the large model. They are treated as a fixed constant and are not trained. This corresponds to our frozen $W_0$ matrix.
- $\Delta\Phi(\Theta)$ : This represents the weight update, but it's no longer a huge matrix of trainable parameters. Instead, it's a function that generates the large update matrix from the small set of parameters $\Theta$ . For LoRA, this function is the matrix multiplication of our small matrices: $\Delta\Phi(\Theta) = B \cdot A$ . The parameters $\Theta$ are the entries of B and A.
What it means:
1. The optimization is no longer over the massive parameter set $\Phi$ , but over a much smaller set of parameters denoted by $\Theta$ . The paper states $|\Theta| \ll |\Phi_0|$ .
2. The update to the weights, $\Delta\Phi$ , is now a function of this small parameter set, written as $\Delta\Phi(\Theta)$ . The original weights $\Phi_0$ remain frozen.
Link to this article: This equation is the mathematical foundation for the section "Conceptual Overview: Low-Rank Adaptation (LoRA)".
- The small, trainable parameter set $\Theta$ represents the collection of all the elements in our low-rank matrices, A and B.
- The function that generates the large update, $\Delta\Phi(\Theta)$ , corresponds directly to the matrix multiplication $B \cdot A$ . This operation takes the small number of parameters in A and B and uses them to produce the full-sized update matrix $\Delta W$ .
- The term $\Phi_0 + \Delta\Phi(\Theta)$ in the equation is precisely what we illustrated in the forward pass: $h = (W_0 + B \cdot A) \cdot x$ .

LoRA in the Self-Attention Module

The self-attention mechanism is a cornerstone of the Transformer architecture. For each input token, it computes Query (Q), Key (K), and Value (V) vectors. These are generated by multiplying the input embedding (x) with three distinct weight matrices: $W_q, W_k, W_v$ . After the attention scores are calculated and applied to the Value vectors, the result is passed through a final output projection matrix, $W_o$ , to produce the layer's output.

While LoRA can be applied to any weight matrix, the paper's authors focus their study on only the weight matrices in the self-attention module ( $W_q, W_k, W_v, W_o$ ). They find that for maximum parameter-efficiency, it is often sufficient to adapt only a subset of these matrices. For example, adapting only the query ( $W_q$ ) and value ( $W_v$ ) matrices can yield strong performance. For the purpose of a complete illustration, we will describe the process as if LoRA is applied to all four:

For $W_q$ , we add $B_q \cdot A_q$
For $W_k$ , we add $B_k \cdot A_k$
For $W_v$ , we add $B_v \cdot A_v$
For $W_o$ , we add $B_o \cdot A_o$

All eight of these LoRA matrices (A_q, B_q, A_k, B_k, etc.) are the only parameters that are trained. The original W matrices remain frozen. The calculations for each of the four paths are independent and can be performed in parallel.

Mathematical Walk-through of the Forward Pass

To keep the example clear, we will focus on the generation of a single Query vector. The exact same logic applies simultaneously to the Key and Value vectors.

1. Define Inputs

Let's assume our model has an embedding dimension (d_model) of 4, and the dimension of the Query/Key vectors (d_q or d_k) is 3.

Input Embedding (x), for a single token:

$x = \begin{bmatrix} 2 \newline 1 \newline 0.5 \newline 1.5 \end{bmatrix} \quad (\text{shape } 4 \times 1)$
Pre-trained Query Weight Matrix (W_q):

$W_q = \begin{bmatrix} 0.1 & 0.8 & 0.2 & 0.4 \newline 0.5 & 0.3 & 0.7 & 0.1 \newline 0.9 & 0.2 & 0.3 & 0.6 \end{bmatrix} \quad (\text{shape } 3 \times 4)$
Trainable LoRA Matrices for Query (A_q, B_q), with rank r=1:

$B_q = \begin{bmatrix} 0.5 \newline -0.2 \newline 0.1 \end{bmatrix} \quad (\text{shape } 3 \times 1) \quad A_q = \begin{bmatrix} 0.2 & -0.1 & 0.4 & 0.3 \end{bmatrix} \quad (\text{shape } 1 \times 4)$

2. Calculate the Original Path (Frozen)

First, we compute the output from the pre-trained W_q matrix.

q_0 = W_q \cdot x = \begin{bmatrix} 0.1 & 0.8 & 0.2 & 0.4 \newline 0.5 & 0.3 & 0.7 & 0.1 \newline 0.9 & 0.2 & 0.3 & 0.6 \end{bmatrix} \cdot \begin{bmatrix} 2 \newline 1 \newline 0.5 \newline 1.5 \end{bmatrix} = \begin{bmatrix} 0.2+0.8+0.1+0.6 \newline 1.0+0.3+0.35+0.15 \newline 1.8+0.2+0.15+0.9 \end{bmatrix} = \begin{bmatrix} 1.7 \newline 1.8 \newline 3.05 \end{bmatrix}

3. Calculate the LoRA Path (Trainable)

Next, we compute the update from our LoRA matrices.

A_q \cdot x = \begin{bmatrix} 0.2 & -0.1 & 0.4 & 0.3 \end{bmatrix} \cdot \begin{bmatrix} 2 \newline 1 \newline 0.5 \newline 1.5 \end{bmatrix} = [0.4 - 0.1 + 0.2 + 0.45] = [0.95]

\Delta q = B_q \cdot (A_q \cdot x) = \begin{bmatrix} 0.5 \newline -0.2 \newline 0.1 \end{bmatrix} \cdot [0.95] = \begin{bmatrix} 0.475 \newline -0.19 \newline 0.095 \end{bmatrix}

4. Combine for the Final Query Vector

Finally, we add the outputs of the two paths.

q_{final} = q_0 + \Delta q = \begin{bmatrix} 1.7 \newline 1.8 \newline 3.05 \end{bmatrix} + \begin{bmatrix} 0.475 \newline -0.19 \newline 0.095 \end{bmatrix} = \begin{bmatrix} 2.175 \newline 1.61 \newline 3.145 \end{bmatrix}

This q_final is the Query vector that will be used in the attention score calculation.

5. The Bigger Picture

This exact process happens in parallel for W_k (with A_k, B_k) and W_v (with A_v, B_v) to produce k_final and v_final. Those three vectors then proceed to the standard attention calculation. The final output of the attention mechanism is then passed through the W_o layer, which has its own LoRA update ( $h_{\text{out}} = (W_o + B_o A_o) \cdot h_{\text{in}}$ ).

Mathematical Walk-through of the Backward Pass

During backpropagation, the gradients flow backward from the loss function. The attention mechanism will provide an incoming gradient for q_final, k_final, and v_final. We'll demonstrate the process using the incoming gradient for our Query vector, grad_q.

1. The Setup

Let's assume the incoming gradient from the rest of the network for our Query vector is:

\text{grad\_q} = \frac{\partial L}{\partial q\_{final}} = \begin{bmatrix} 0.8 \newline 0.1 \newline -0.4 \end{bmatrix}

Our goal is to use grad_q to calculate the gradients for A_q and B_q. The gradient does not flow back into W_q.

2. Calculating the Gradient for `B_q`

The gradient for B_q is found by multiplying the incoming gradient grad_q by the term that B_q was multiplied by in the forward pass: (A_q * x).

\text{grad\_B\_q} = \frac{\partial L}{\partial B_q} = \text{grad\_q} \cdot (A_q \cdot x)^T

We already know from the forward pass that A_q * x = [0.95].

\text{grad\_B\_q} = \begin{bmatrix} 0.8 \newline 0.1 \newline -0.4 \end{bmatrix} \cdot [0.95] = \begin{bmatrix} 0.76 \newline 0.095 \newline -0.38 \end{bmatrix}

This is the adjustment signal for the B_q matrix.

3. Calculating the Gradient for `A_q`

The gradient for A_q requires the gradient to pass back through B_q first.

\text{grad\_A\_q} = \frac{\partial L}{\partial A_q} = B_q^T \cdot \text{grad\_q} \cdot x^T

First, let's compute B_q^T * grad_q:

B_q^T \cdot \text{grad\_q} = \begin{bmatrix} 0.5 & -0.2 & 0.1 \end{bmatrix} \cdot \begin{bmatrix} 0.8 \newline 0.1 \newline -0.4 \end{bmatrix} = [0.4 - 0.02 - 0.04] = [0.34]

Now, multiply this by x^T:

\text{grad\_A\_q} = [0.34] \cdot \begin{bmatrix} 2 & 1 & 0.5 & 1.5 \end{bmatrix} = \begin{bmatrix} 0.68 & 0.34 & 0.17 & 0.51 \end{bmatrix}

This is the adjustment signal for the A_q matrix.

4. Updating All LoRA Weights

The optimizer will use these computed gradients (grad_B_q, grad_A_q) to update the weights of B_q and A_q.

Crucially, this same backward pass logic is applied independently and simultaneously for the other LoRA pairs. The incoming gradient grad_k is used to update A_k and B_k, grad_v is used to update A_v and B_v, and the gradient from the next layer is used to update A_o and B_o. The massive W_q, W_k, W_v, W_o matrices are completely bypassed during the gradient computation, leading to massive savings in time and memory.

Appendix A: The "No Additional Inference Latency" Trick

A common drawback of other adaptation methods (like adding "adapter" layers) is that they introduce extra layers or computations that permanently increase the model's inference latency. LoRA cleverly avoids this.

During training, the forward pass involves two paths, as we saw: $h = W_0 \cdot x + B \cdot A \cdot x$ . This does add a small amount of extra computation.

However, once training is complete, we can prepare the model for deployment (inference). Since matrices A and B are now fixed, we can perform their matrix multiplication once to get our final update matrix $\Delta W$ .

\Delta W = B \cdot A

Then, we can add this final update matrix directly to the original pre-trained weights to create a new, fully merged weight matrix, $W_{ft}$ .

W_{ft} = W_0 + \Delta W

Using our example numbers:

\Delta W = \begin{bmatrix} 0.1 & -0.2 & 0.04 \newline 0.05 & -0.1 & 0.02 \end{bmatrix}

\begin{aligned} W_{ft} &= W_0 + \Delta W \newline &= \begin{bmatrix} 0.8 & 0.1 & 0.3 \newline 0.2 & 0.7 & 0.5 \end{bmatrix} + \begin{bmatrix} 0.1 & -0.2 & 0.04 \newline 0.05 & -0.1 & 0.02 \end{bmatrix} \newline &= \begin{bmatrix} 0.9 & -0.1 & 0.34 \newline 0.25 & 0.6 & 0.52 \end{bmatrix} \end{aligned}

When the model is deployed, we just use this merged $W_{ft}$ matrix. The forward pass becomes $h = W_{ft} \cdot x$ , which is a single matrix multiplication. This has the exact same computational cost and latency as the original, non-fine-tuned model.

This also makes switching between tasks incredibly fast. As the paper notes, to switch from Task 1 (with matrices $B_1, A_1$ ) to Task 2 (with $B_2, A_2$ ), you can compute $W_0 + B_2 \cdot A_2$ , which is a very fast operation compared to loading an entirely new 350GB model from disk.

Appendix B: Why Does the Low-Rank Hypothesis Make Sense?

The idea that a massive, over-parameterized model can be adapted by changing only a small number of parameters might seem counter-intuitive, but it is supported by research into the dynamics of deep learning.

The LoRA paper builds on work by Aghajanyan et al. (2020) and others, which showed that pre-trained language models have a low "intrinsic dimension." This suggests that even though they exist in a very high-dimensional parameter space (e.g., 175 billion dimensions), they can learn new tasks effectively by moving along a much smaller, lower-dimensional manifold within that space.

LoRA takes this idea a step further by hypothesizing that the change in weights during adaptation ( $\Delta W$ ) also has a low "intrinsic rank." This means that the adjustments needed for a new task are not a complex, high-rank transformation but a simpler, low-rank one. The remarkable empirical success of LoRA, even with ranks as low as 1 or 2, provides strong evidence for this hypothesis. By decomposing the update into two small matrices, LoRA is essentially forcing the model to learn an update within this low-rank subspace, which proves to be a very effective and efficient constraint.

DEV Community

How does low-rank adaptation for large language models work

Table of Contents

Why study LoRA? The challenge of fine-tuning massive models