Jun Bae

Posted on Apr 22

Why LoRA? Understanding the representative PEFT

#ai #machinelearning #llm

Why LoRA?

Low-Rank Adaptation (LoRA) has revolutionized the way we approach Large Language Models (LLMs). As the most prominent Parameter-Efficient Fine-Tuning (PEFT) method, LoRA allows developers to adapt massive models like Llama 3 or GPT-4 to specific tasks without needing a cluster of H100 GPUs.

But how exactly does it work, and why is it so effective? In this post, we’ll dive into the mathematical intuition behind LoRA, the concept of "intrinsic dimension," and why this method is a game-changer for AI engineers.

The Problem: The Cost of Scale

When it comes to traditional statistical models—like simple or generalized linear models, Tree model series (Random Forest, Light GBM, etc)—, we can train our mod

els from scratch using only their blueprints and architectures. Historically, this was the standard approach and was not particularly difficult or cumbersome.

However, the Deep Learning revolution changed the landscape. As model sizes ballooned from millions to billions of parameters, retraining or even fine-tuning a model became logistically impossible for most individuals and even many companies.

The Hypothesis of Intrinsic Dimension

Here is the crucial insight: We don't need to train all the parameters.

Research suggests that over-parameterized models reside on a low intrinsic dimension. Simply put, while a model might have billions of parameters, the "knowledge" required to solve a specific new task can be represented by a much smaller subset of variables.

This premise was validated before LoRA's invention by researchers like Li et al. (2018) and Aghajanyan et al. (2020), who showed that learning happens in a lower-dimensional subspace. LoRA operationalizes this theory.

What is LoRA? The Mechanics

So what exactly is LoRA?

The concept is remarkably straightforward. As mentioned, LoRA is a fine-tuning method that trains the model using only a fraction of the total parameters.

The Mathematical Formulation

Let’s look at the linear algebra. Suppose we have a pre-trained weight matrix $\mathbf{W}_0 \in \mathbb{R}^{d \times k}$ .

In full fine-tuning, we update the weights by learning a cumulative gradient update $\Delta \mathbf{W}$ : $\mathit{h} = (\mathbf{W}_0 + \Delta \mathbf{W})\mathit{x}$

LoRA constrains this update $\Delta \mathbf{W}$ by representing it as the product of two smaller matrices, $\mathbf{B}$ and $\mathbf{A}$ , with a low rank $r$ :

Where:

$\mathbf{B} \in \mathbb{R}^{d \times r}$
$\mathbf{A} \in \mathbb{R}^{r \times k}$
$r \ll \min(d, k)$ (The rank, usually set between 8 and 64)

All you need to do is to train A and B matrices. Then what are the A and B matrices?

Initialization Strategy

To ensure the training starts exactly where the pre-trained model left off (i.e., $\Delta \mathbf{W}$ is zero at the beginning), we initialize:

Matrix A: Random Gaussian distribution $\mathcal{N}(0, \sigma^2)$
Matrix B: Zeros

Therefore, $\mathbf{B}\mathbf{A} = 0$ initially, ensuring no destabilizing noise is added to the model at step zero.

The Scaling Factor $\alpha$

One detail often missed is the scaling factor $\alpha$ . The update is actually scaled as:

\Delta \mathbf{W} = \frac{\alpha}{r} (\mathbf{B}\mathbf{A})

This allows us to tune the learning rate effectively regardless of the rank $r$ we choose.

Then, the problem is.. is it accurate enough?

The answer is ‘yes’. And even more accurate.

Why is LoRA Effective?

Let’s talk numbers. Assume $d=10,000$ and $k=10,000$ with a rank $r=16$ .

Full Fine-Tuning: $10,000 \times 10,000 = 100,000,000$ parameters.
LoRA: $(10,000 \times 16) + (16 \times 10,000) = 320,000$ parameters.

That is a 99.68% reduction in trainable parameters. But is it accurate?

Performance vs. Efficiency

Yes, it is surprisingly accurate.

As shown in the original LoRA paper, on benchmarks like WikiSQL, LoRA performs neck-and-neck with full fine-tuning. In some cases, it even outperforms the baseline because it acts as a form of regularization, preventing the model from overfitting to the small training set.

Then why? It might seem counterintuitive that tuning less than 0.01% of the parameters is sufficient. In reality, it is.

For those with a background in statistics, they can feel a bit familiar with this concept. There is a very famous and conventional method of decreasing dimension: PCA (Principal Component Analysis).

The Intuition: PCA Analogy

Imagine a dataset containing "Biological Species" and "Number of Legs." These two features are highly correlated; knowing the species often tells you the leg count. You don't need two independent dimensions to represent this variance.

There are a few more dimensionality reduction methods like clustering.

Anyway, the important thing is that the concept—reducing the dimension can improve not only training efficiency but performance— is valid.

Similarly, in deep neural networks, weight updates often occur in a subspace with high correlations. We can project these high-dimensional updates into a lower-rank space without losing significant information.

And LoRA takes advantage of this concept and it maximizes this effect. It doesn’t have any specific mathematic formula like PCA. It just creates random subspace matrices and trains them.

According to the authors, the subspaces of LLM parameters are more similar than previously thought. As you see in the image, the first dimension shares very similar subspaces with all of other dimensions. Therefore, they said only one rank worked quite well for fine-tuning GPT-3 model.

Mitigating Catastrophic Forgetting

A unique advantage of LoRA is that it freezes the original weights ( $\mathbf{W}_0$ ).

Full Fine-Tuning: Overwrites $\mathbf{W}_0$ . If you train heavily on coding tasks, the model might "forget" how to write poetry (Catastrophic Forgetting).
LoRA: Keeps $\mathbf{W}_0$ intact. You can train one adapter for Coding and another for Poetry. You can simply swap the adapters at runtime without reloading the base model.

Conclusion

LoRA has become an indispensable tool in the AI toolkit. Today, most individuals—and even many companies—lack the resources to perform full fine-tuning on models with billions of parameters. Even with access to H100 or A100 GPUs, full fine-tuning is often prohibitively slow and, as demonstrated, rarely provides a better return on investment.

Furthermore, in generic models, an individual query effectively activates only a small portion of the parameters. This is a concept that has been further expanded upon in another famous method, Mixture of Experts (MoE), which I will cover in a future post.

You can leverage LoRA very easily because the libraries these days are so convenient such as peft or trl. Just try it. Then you will see that fine-tuning about 10 billions models takes about 3-6 hours with one H100. And the output files are only hundreds Megabytes. (Of course, you have to merge them with original model when you serve it.)

Reference

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Di mension of Objective Landscapes. arXiv:1804.08838 [cs, stat], April 2018a. URL http: //arxiv.org/abs/1804.08838. arXiv: 1804.08838.

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. arXiv:2012.13255 [cs], December 2020. URL http://arxiv.org/abs/2012.13255.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021, June 17). LORA: Low-Rank adaptation of Large Language Models. arXiv.org. https://arxiv.org/abs/2106.09685

Top comments (0)

The discussion has been locked. New comments can't be added.

Why LoRA?

The Problem: The Cost of Scale

The Hypothesis of Intrinsic Dimension

What is LoRA? The Mechanics

The Mathematical Formulation

Initialization Strategy

The Scaling Factor α\alphaα

Why is LoRA Effective?

Performance vs. Efficiency

The Intuition: PCA Analogy

Mitigating Catastrophic Forgetting

Conclusion

Reference

The Scaling Factor $\alpha$