Bigger llm models will no longer be performant

#ai #llm #adaption #openai

Recently, I came across an essay titled "On the Death of Scaling" by Sara Hooker (Co-founder of Adaption Labs). In this essay, Sara explains the shortcomings of the simple path followed by frontier labs to lead the market. She discusses where the notion of "scaling is death" comes from and what to consider next.

In the last decade, where LLMs are emerging as the ideal path to attaining AI, or what experts call AGI, they have been found to be not so accurate. All LLM-based labs are following one brute force rule of adding more and more weights with more compute to outperform other available models, and up to a certain point, it is helping. Using more compute and data, LLMs are outperforming their predecessors and competitors. But now, the landscape is changing. It has been found that much smaller, latest models (<13B) are now outperforming previous models with enormous parameters. For example, Falcon 180B is easily outperformed by models like Llama 3 8B, Command R 35B, and Gemma 3 27B. Additionally, Aya 23 8B and Aya Expanse 8B have outperformed BLOOM 176B with 94% less weights.

From the above image of the HuggingFace OpenLLM Leaderboard, it is shown that smaller models are significantly outperforming larger ones, both reaching a performance plateau (as transformers' performance reaches its plateau). Hence, it is proven that a bigger size does not always guarantee better performance.

Reason behind such huge number of parameters ?

Surprisingly, almost all earlier LLM models are launching with an increased number of parameters (10x, 100x, etc.), but no one clearly explains the reason behind such a huge number of weights. Do we really need that many weights? Are models only focused on size as a performance metric, bigger size confirms better performance? One of the first widely adopted deep neural network architectures was Inception Net (proposed by Google in 2014), which had only around 23 million weights. But nowadays, a base model like Qwen3-235B-A22B comes with 235 billion parameters.

LLM weights began to increase when a strange observation of double descent was noted in LLM model training. Mathematically, after reaching a certain threshold, the model starts to "overfit" on data, and from this point onward, increasing model complexity leads to decreased performance. However, a surprising observation showed that neural networks do not exhibit this phenomenon. These models now have millions of parameters, more than enough to fit even random labels, yet they perform much better on many tasks than smaller models. This behavior remains a black box, with no one able to explain why it occurs mathematically.

Another observation is that after training a neural network, it is found that 95% of the work is done by a small set of weights, while the rest are additional parameters.

But now the question arises: why do we need such a large number of additional parameters? Can't we train only these few sets of weights and achieve the same performance? The answer is no. It has been shown that if you want to achieve the same performance as a trained model, you need to train the model again with the same number of weights. It is strange but true. The reason behind this behavior is the inefficient and unoptimized training mechanisms (or architectures) used by researchers. Therefore, those seemingly unnecessary weights become a critical part of the model and the reason for its large size.

Deep Neural Networks

Deep Neural Networks, or Deep Learning Networks, are at the heart of all LLM models you are using now. However, they are incredibly inefficient learners. A deep learning network can easily understand common facts but cannot comprehend rare facts in the same way. Essentially, all deep learning networks rely on data, and each data point receives equal learning space, so common facts benefit from abundance while rare facts get minimal space. AI, from its very first principle of mimicking human intelligence, fails to achieve this with LLMs. Humans can easily comprehend rare events, but AI cannot. To address this, researchers have started focusing compute on such events (an expensive process). Yet, as we know, the world is full of uncertainty, and it is impossible to learn every rare event from a static snapshot of the world. This leads to situations like "r in strawberry," illustrating the LLM-based state of AI.

We got the problems, but what about solutions?

Some solutions are emerging that are evolving the landscape of LLMs to attain new growth trajectories without being compute-hungry.

High-quality data can compensate for the compute of larger models. Various studies have shown that if the size of the training dataset can be reduced without impacting performance, training time decreases, meaning less compute is needed. Techniques such as model distillation, chain-of-thought reasoning, increased context length, retrieval-augmented generation, and preference training to align models with human feedback help reduce the need for heavy weights or expensive, prolonged training. These techniques demonstrate improved performance over models without them.

Architecture is a crucial part of AI research. Efforts to adapt new architectures by frontier labs can open up new paradigms of scalability. New model architectures, like Yann LeCun's world models, could be utilized, but this shift in architectural focus comes with the heavy cost of necessary research, the risk of losing market share to competitors, and requires careful experimental approaches (innovative approaches) to market these as effectively as LLMs to users.

Note: While we discussed how less compute can yield good results, it does not mean it reduces the environmental impact of AI. The carbon footprint created by AI compute is not primarily due to training (as it is a one-time and localized process) but rather from the widespread use of AI. As the number of AI users increases, its carbon footprint will also increase. This is an application-level issue, not a research-level issue.

Conclusion

Thank you for reading this blog, where I explain my understanding of the Sara Hooker's essay and the event held at Hugging Face ML Club. I aimed to simplify and align the text by adding my viewpoints. If you find any contradictions or incorrect claims in the blog, please let me know. Feedback is greatly appreciated.

Once Again Thank you !!!

Resources to refer for further reading: