Convergence of LLMs: 2024 Trend Solidified by Llama 3.1 Release

#ai #machinelearning #llm #genai

The recent release of Llama 3.1 was reminiscent of many releases this year. It underlined a trend that formed in the first half of 2024:

Closed SOTA LLMs (GPT-4o, Gemini 1.5, Claud 3.5) had marginal improvements over their predecessors, sometimes even falling behind (e.g. GPT-4o hallucinating more than previous versions).
Smaller open models were catching up across a range of evals.

There have been many releases this year. Open AI has introduced GPT-4o, Anthropic brought their well-received Claude 3.5 Sonnet, and Google's newer Gemini 1.5 boasted a 1 million token context window.

Among open models, we've seen CommandR, DBRX, Phi-3, Yi-1.5, Qwen2, DeepSeek v2, Mistral (NeMo, Large), Gemma 2, Llama 3, Nemotron-4. Every time I read a post about a new model there was a statement comparing evals to and challenging models from OpenAI.

Take these few evals from Llama 3.1 blog post as an example:

Notice how 7-9B models come close to or surpass the scores of GPT-3.5 - the King model behind the ChatGPT revolution. Also, see how models nearing 100B params confidently surpass GPT 3.5.

This is the pattern I noticed reading all those blog posts introducing new LLMs. Models converge to the same levels of performance judging by their evals. LLMs around 10B params converge to GPT-3.5 performance, and LLMs around 100B and larger converge to GPT-4 scores.

Another colorful picture supporting this statement is recent Aider's eval of coding capabilities:

The Ceiling

The marginal improvements, eval scores fluctuating within MoE, "vibe checks" and feedback users share on the SOTA LLMs... All of that suggests that the models' performance has hit some natural limit. LLMs do not get smarter. Take for example LLM leaderboard SEAL:

Couple this saturated LLM performance with much talk around the Gen AI bubble, and little tangible value brought by the technology... Titles like "Gen AI: too much spend, too little benefit?" or "So far the technology has had almost no economic impact"...

The technology of LLMs has hit the ceiling with no clear answer as to whether the $600B investment will ever have reasonable returns.

Efficiency, not Effectiveness

There's another evident trend, the cost of LLMs going down while the speed of generation going up, maintaining or slightly improving the performance across different evals.

Take Anthropic and OpenAI models as an example:

Model	Price (per mil. tok., input/output)	Speed (tok/sec)
Claude 3 Sonnet	$3/$15	63
Claude 3.5 Sonnet	$3/$15	79
gpt-3.5-turbo-16k-0613	$3/$4	~40-50
gpt-3.5-turbo-0125	$0.5/$1.5	83
gpt-4-32k	$60/$120	~22
gpt-4o	$5/$15	83

See how the successor either gets cheaper or faster (or both). The most drastic difference is in the GPT-4 family. The original model is 4-6 times more expensive yet it is 4 times slower.

We see the progress in efficiency - faster generation speed at lower cost. We see little improvement in effectiveness (evals).

What could be the reason? I can speculate that:

Closed models get smaller, i.e. get closer to their open-source counterparts.
- The original GPT-3.5 had 175B params. Yet the data on the internet puts the recent GPT-3.5-Turbo in range between 20B and 96B
- The original GPT-4 was rumored to have around 1.7T params. While GPT-4-Turbo can have as many as 1T params.
Closed models use the efficiency tricks the Open-source world has brought over the past years. E.g. Flash Attention, Quantisation, etc.

Can it be another manifestation of convergence? This time the movement of old-big-fat-closed models towards new-small-slim-open models.

Work through these 3 parts to earn the exclusive Google AI Studio Builder badge!

This track will guide you through Google AI Studio's new "Build apps with Gemini" feature, where you can turn a simple text prompt into a fully functional, deployed web application in minutes.

Top comments (4)

Ignacio García-Carrillo • Jul 26 '24

Agree. My customers (telco) are asking for smaller models, much more focused on specific use cases, and distributed throughout the network in smaller devices Superlarge, expensive and generic models are not that useful for the enterprise, even for chats. Looks like we may see a reshape of AI tech in the coming year. If done right, next AI winter could be mild

Maxim Saplin • Jul 26 '24 • Edited

The promise and edge of LLMs is the pre-trained state - no need to collect and label data, spend time and money training own specialised models - just prompt the LLM. Their ability to be fine tuned with few examples to be specialised in narrows task is also fascinating (transfer learning). Yet fine tuning has too high entry point compared to simple API access and prompt engineering. I hope that further distillation will happen and we will get great and capable models, perfect instruction follower in range 1-8B. So far models below 8B are way too basic compared to larger ones.

Ignacio García-Carrillo • Jul 26 '24

True, I´m guilty of mixing real LLMs with transfer learning. My point is that perhaps the way to make money out of this is not LLMs, or not only LLMs, but other creatures created by fine tuning by big corporations (or not so big corporations necessarily). Agree on the distillation and optimization of models so smaller ones become capable enough and we don´t need to spend a fortune (money and energy) on LLMs.

Akshay Ballal • Jul 26 '24

I seriously believe that small language models need to be pushed more. To solve some real-world problems today, we need to tune specialized small models. Having these large models is good, but very few fundamental issues can be solved with this.