Why LoRA?
Low-Rank Adaptation (LoRA) has revolutionized the way we approach Large Language Models (LLMs). As the most prominent Parameter-Efficient Fine-Tuning (PEFT) method, LoRA allows developers to adapt massive models like Llama 3 or GPT-4 to specific tasks without needing a cluster of H100 GPUs.
But how exactly does it work, and why is it so effective? In this post, we’ll dive into the mathematical intuition behind LoRA, the concept of "intrinsic dimension," and why this method is a game-changer for AI engineers.
The Problem: The Cost of Scale
When it comes to traditional statistical models—like simple or generalized linear models, Tree model series (Random Forest, Light GBM, etc)—, we can train our mod
els from scratch using only their blueprints and architectures. Historically, this was the standard approach and was not particularly difficult or cumbersome.
However, the Deep Learning revolution changed the landscape. As model sizes ballooned from millions to billions of parameters, retraining or even fine-tuning a model became logistically impossible for most individuals and even many companies.
The Hypothesis of Intrinsic Dimension
Here is the crucial insight: We don't need to train all the parameters.
Research suggests that over-parameterized models reside on a low intrinsic dimension. Simply put, while a model might have billions of parameters, the "knowledge" required to solve a specific new task can be represented by a much smaller subset of variables.
This premise was validated before LoRA's invention by researchers like Li et al. (2018) and Aghajanyan et al. (2020), who showed that learning happens in a lower-dimensional subspace. LoRA operationalizes this theory.
What is LoRA? The Mechanics
So what exactly is LoRA?
The concept is remarkably straightforward. As mentioned, LoRA is a fine-tuning method that trains the model using only a fraction of the total parameters.
The Mathematical Formulation
Let’s look at the linear algebra. Suppose we have a pre-trained weight matrix .
In full fine-tuning, we update the weights by learning a cumulative gradient update :
LoRA constrains this update by representing it as the product of two smaller matrices, and , with a low rank :
Where:
(The rank, usually set between 8 and 64)
All you need to do is to train A and B matrices. Then what are the A and B matrices?
Initialization Strategy
To ensure the training starts exactly where the pre-trained model left off (i.e., is zero at the beginning), we initialize:
Matrix A: Random Gaussian distribution
Matrix B: Zeros
Therefore, initially, ensuring no destabilizing noise is added to the model at step zero.
The Scaling Factor
One detail often missed is the scaling factor . The update is actually scaled as:
This allows us to tune the learning rate effectively regardless of the rank we choose.
Then, the problem is.. is it accurate enough?
The answer is ‘yes’. And even more accurate.
Why is LoRA Effective?
Let’s talk numbers. Assume and with a rank .
Full Fine-Tuning: parameters.
LoRA: parameters.
That is a 99.68% reduction in trainable parameters. But is it accurate?
Performance vs. Efficiency
Yes, it is surprisingly accurate.
As shown in the original LoRA paper, on benchmarks like WikiSQL, LoRA performs neck-and-neck with full fine-tuning. In some cases, it even outperforms the baseline because it acts as a form of regularization, preventing the model from overfitting to the small training set.
Then why? It might seem counterintuitive that tuning less than 0.01% of the parameters is sufficient. In reality, it is.
For those with a background in statistics, they can feel a bit familiar with this concept. There is a very famous and conventional method of decreasing dimension: PCA (Principal Component Analysis).
The Intuition: PCA Analogy
Imagine a dataset containing "Biological Species" and "Number of Legs." These two features are highly correlated; knowing the species often tells you the leg count. You don't need two independent dimensions to represent this variance.
There are a few more dimensionality reduction methods like clustering.
Anyway, the important thing is that the concept—reducing the dimension can improve not only training efficiency but performance— is valid.
Similarly, in deep neural networks, weight updates often occur in a subspace with high correlations. We can project these high-dimensional updates into a lower-rank space without losing significant information.
And LoRA takes advantage of this concept and it maximizes this effect. It doesn’t have any specific mathematic formula like PCA. It just creates random subspace matrices and trains them.
According to the authors, the subspaces of LLM parameters are more similar than previously thought. As you see in the image, the first dimension shares very similar subspaces with all of other dimensions. Therefore, they said only one rank worked quite well for fine-tuning GPT-3 model.
Mitigating Catastrophic Forgetting
A unique advantage of LoRA is that it freezes the original weights ( ).
Full Fine-Tuning: Overwrites . If you train heavily on coding tasks, the model might "forget" how to write poetry (Catastrophic Forgetting).
LoRA: Keeps intact. You can train one adapter for Coding and another for Poetry. You can simply swap the adapters at runtime without reloading the base model.
Conclusion
LoRA has become an indispensable tool in the AI toolkit. Today, most individuals—and even many companies—lack the resources to perform full fine-tuning on models with billions of parameters. Even with access to H100 or A100 GPUs, full fine-tuning is often prohibitively slow and, as demonstrated, rarely provides a better return on investment.
Furthermore, in generic models, an individual query effectively activates only a small portion of the parameters. This is a concept that has been further expanded upon in another famous method, Mixture of Experts (MoE), which I will cover in a future post.
You can leverage LoRA very easily because the libraries these days are so convenient such as peft or trl. Just try it. Then you will see that fine-tuning about 10 billions models takes about 3-6 hours with one H100. And the output files are only hundreds Megabytes. (Of course, you have to merge them with original model when you serve it.)
Reference
Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the Intrinsic Di mension of Objective Landscapes. arXiv:1804.08838 [cs, stat], April 2018a. URL http: //arxiv.org/abs/1804.08838. arXiv: 1804.08838.
Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. arXiv:2012.13255 [cs], December 2020. URL http://arxiv.org/abs/2012.13255.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021, June 17). LORA: Low-Rank adaptation of Large Language Models. arXiv.org. https://arxiv.org/abs/2106.09685






Top comments (0)