Averaging Our Way to AGI

#llm #ai #chatgpt #programming

I’ve been thinking a lot recently about this LinkedIn post (shout out to my colleague, Michelle Frost, who first brought it to my attention), which makes the point that LLMs, initially built for translation and pattern matching, were then hyperscaled via Transformers and lots of money into the current leading example of AI intelligence amongst the general public.

Are they truly a path to AGI (Artificial General Intelligence), a broad term used as a goal meaning an LLM that matches or exceeds human capabilities? Currently, we have ANIs (Artificial Narrow Intelligence) that can far exceed humans in closed systems, such as games like Chess or Go. However, true AGI is still in the future, even as companies like OpenAI publicly point to it as a goal for their products.

Can the LLM approach ever reach AGI? I suspect no. The underlying architecture is designed to approximate, pattern match, and guess. It was initially built for language processing and translation. There is no deep inference of meaning or logic there, no matter how many hidden inputs the models create.

The algorithms used to train LLMs have been around for quite a while. What was lacking until Transformers came onto the scene in 2017 was the ability to massively scale them. ChatGPT was among the first to use Transformers and publicly release them to the general public. While there are other LLM models out there, we will focus on GPT’s history here to give a sense for how quickly these models have grown.

Brief LLM History

Let’s use GPT as a proxy for advancements in the field in general. We can track the rapid changes based on their model releases from 2018 to the present.

GPT-1, released in 2018, had 117M parameters and was trained on BookCorpus, a dataset of 7,000 self-published books. A “parameter” is a numerical value the model learns during training. Broadly speaking, more parameters mean a model can store more complex patterns and perform better, although at a certain point (see future sections) efficiency matters more than just size.

GPT-2 was released the following year, 2019, and had a large leap in parameters (1.5B) as well as training on a much larger dataset, WebText, containing 8 million documents and 40GB of text. The model had more fluid text responses but still struggled with accuracy and relevancy. This was the last GPT model that was open source; by GPT-3 the company shifted to proprietary closed models and datasets.

GPT-3 (2020) was the big breakthrough: for the first time, a publicly-available LLM generated responses that felt vaguely natural, though it still suffered from hallucinations and errors in back-and-forth conversations. It featured 175B parameters, a much larger (though proprietary) dataset that ChatGPT itself estimates at ~570GB, and a larger “context window” (the number of tokens a model can process at once). Context windows become increasingly important since LLMs are inherently stateless--each request is independent of previous ones, similar to HTTP. On the web, we use cookies to remember “state,” such as whether a user is logged in to a website. In an LLM, a context window allows the model to process more information--vital for analyzing long documents--and maintaining a back-and-forth conversation history with the user.

To highlight this point, if I am using an LLM to help with programming, it is very helpful if I can send the entire codebase to the LLM, which stores it in the context window, and therefore all future requests reference the same stored data.

Three years passed before GPT-4 was released in 2023. The number of parameters was not publicly released, but it was likely over one trillion. Instead, the focus was on efficiency and smarter parameters rather than just more of them. GPT-4 demonstrated near-human-level performance in many exams such as the US bar (law) exam, GREs, AP courses, and so on. It was able to process text and images quite well and featured a much larger context window of 8,192 tokens.

The Competition

Amidst these rapid improvements to GPT, a growing ecosystem of competing LLM models emerged. In 2023, several “open” models were released for the first time: Mistral, Falcon, and Meta’s Llama. Mistral and Falcon are available for commercial use, reveal their model weights (meaning anyone can use the models) and training code, but both have some restrictions on training data information. Llama has open weights, too, but restrictions for large companies. Llama also does not disclose its training code or training data.

In the proprietary LLM space, many competitors popped up, most notably Anthropic, founded by former OpenAI engineers. Anthropic’s Claude model was first released in March 2023 with 9,000 context window tokens (more than GPT-4), and by September 2024 featured a 500,000 token context window, allowing for GitHub integration, and other advanced features. This made it arguably better than GPT for many coding tasks, as well as analyzing entire books, legal documents, financial reports, and so on, since all of the data could be fed directly into the context window for processing by the model.

The prevailing thought at this time was that there were still-unlimited gains to be had by increasing the amount of computation and data in LLM models. As a result, billions of dollars of VC and corporate money went into this LLM arms race.

2025

In January 2025 there was a “Sputnik moment” with the release of DeepSeek from a small Chinese company. DeepSeek roughly matched the performance of leading LLM models, including GPT 4, while open sourcing its model weights, training data, and allowing commercial use. The DeepSeek app was the most downloaded one on the Apple App Store the weekend it was released. But even more amazingly, DeepSeek claimed that training costs were only $6 million, rather than the billions used for competitors.

We don’t know how accurate that $6 million figure truly is, but training costs were clearly far less. DeepSeek used a number of novel techniques including “knowledge distillation,” essentially training student models on larger teacher models, most notably Meta’s Llama and Alibaba’s Qwen.

DeepSeek’s release of a cost-effective LLM model led to a $600B drop in market value for Nvidia, the company behind the GPUs used to train most models.

The following month, February, GPT 4.5 was released, featuring a 30x cost increase over GPT 4, and so far, it seems, provides only marginal improvements.

The Future

The back-to-back releases of DeepSeek and then GPT 4.5 bring up the question of whether we are finally in an era of diminishing returns for LLMs. Has the previous approach of ever-more computation and data finally reached a limit? Gary Marcus famously predicted this back in 2022 with his piece, Deep Learning is Hitting a Wall.

So what’s next? I suspect LLMs will continue to make marginal improvements, training costs will decrease, and we’ll see more and more open-source options available, as well as more and more LLMs trained on specific datasets and use cases, such as JetBrains’ Mellum. Will open-source models that are good-enough win out over proprietary ones? Or will proprietary models continue to improve at a rate that justifies their growing price tag?

I suspect we will also see improvements in the broader LLM ecosystem, such as an ever-increasing context window, which leads to a growing use case of industries where models can be applied.

The larger issue of model inaccuracies and hallucinations persists. If a user asks a sufficiently niche question that the model wasn’t trained on, it will struggle to generate a good response. Likewise, if you ask questions after a model’s training data period, it will not handle them well. But there are techniques, such as Retrieval-Augmented Generation (RAG), that promise some solutions. RAG allows a model to interact with external data sources in real-time--for example by doing its own searches--to either supplement or double check its initial responses.

And in the AI-assisted coding realm--where I work at JetBrains--things are really heating up with the rise of agents that can use AI to not just ask questions, but update code on its own. I don’t think this means the end of programmers, something I wrote about previously in Thoughts on Vibe Coding and which Gary Marcus (again) wrote about more eloquently soon thereafter. But it is clear that programming tools are changing rapidly and likely most professional programmers will use LLMs to help write code going forward.

But while LLMs have dominated the spotlight in the quest for AGI, it’s important to remember that they are not the only approach. There are other potential pathways such as reinforcement learning, neuro-symbolic systems, and hybrid models that combine multiple approaches. It is far to say that LLMs have demonstrated the power of scale, but given they are still--by architecture and definition--trying to average their way to the next response, they are not a straight path to true AGI.

Will we ever get there? Personally, I have doubts. We still can’t even define intelligence in humans, let alone machines. Maybe there is some sort of incompleteness theorem that AGI attempts will bump up into. Or not. I am keeping an open mind. But it is clear to me that LLMs are a game-changer but not the end of the game when it comes to AI surpassing humans.

Big thanks to my colleague, Michelle Frost, for detailed feedback and suggestions on an earlier draft of this piece.