The AI field is celebrating benchmarks while the ceiling gets closer.
Every few months, a new model comes out, and more often than not with more parameters than their previous versions. And it is being celebrated in the name of scaling, but scaling towards memorisation is antithetical to generalisation, which is ML's foundational goal.
Increasing parameter count works, but it proves that the current architectures, and more than that, the training paradigm is flawed. Because a truly general architecture wouldn't need exponentially more parameters to handle new tasks — it would generalise.
In this article, I'm going to describe why this happens, particularly with Transformer-based models, and why other experimental model works, but fail at scale. At last, I propose a new training paradigm which could help in addressing this problem.
Transformers Are Sophisticated Memorisation Engines
Metaphorically, Transformers can be described as big libraries, and Attention as its Librarian. Adding more layers is like adding more shelves, which increases the space to add more books, or in this case, more patterns to retrieve from.
When you ask a question, the Librarian doesn't think — it searches. It finds the most relevant answer for your query and hands it back. The answer was there. Nothing new was created.
In a 2025 study by Amazon Scientist Joshua Barron and Devin White, Too Big to Think, which was presented in ICML 2025 "Tiny Titans" workshop, they showed that bigger models memorise factual data correctly, but fall short in extrapolating, while smaller models excel in extrapolation but lack the size to store factual data.
But even without the study, this phenomenon can be understood by the fundamental "Bias-Variance" trade-off in ML, where if the model becomes complex enough, then it starts memorising even the noise present in the data. But, because of Attention's context based weighting of parameters, this gets overlooked. Attention creates the illusion of generalisation — the model looks like it's reasoning, but it's only selecting from what it already memorised, not deriving anything new. And it gets worse.
Because there's no actual reasoning, LLMs mostly rely on its own data distribution to generate the answer. And when someone asks a roundabout question that wasn't present in their training data, they trip. Like in a recent meme about ChatGPT, where a user asked:
- "I need to get my car washed and the car wash is only 300 feet away. Should I walk or go by car?"
And ChatGPT told user to walk. Initially I thought it was just a meme, and the creator of it might have tweaked the answer. But while writing this article, I tried myself and was shocked. With Google's Gemini as an outlier, ChatGPT, Claude, Kimi, and Mistral all four chose walking
This isn't a rare edge case, it's a systematic failure that appears whenever the problem requires modeling consequences rather than matching patterns. This showcases how heavily models rely on their training data, and don't actually reason about the user query.
In recent advances, frontier labs have started focusing on reasoning, resulting in heavy use of internal reasoning, or formally known as "Chain-of-thoughts." But I'd argue, that it is a poor imitation of reasoning. While it helps, to some degree, by moving the model's state to activate more areas of the model that could be helpful in the context, it doesn't capture how a human would reason. Humans don't think like —
- "The user is showing....."
- "I'll check on this then work on that.."
- Etc.
A human's reasoning is more abstract, and more associative. Humans don't only think what's directly relevant to the context, they also think what could be metaphorically correlated with the context. Like in the library metaphor at the top of this section. A LLM wouldn't reason like that, unless explicitly prompted to — something a human never needs.
Associative thinking is something that a truly intelligent system like AGI would need. Because a system that can only reason and connect dots when explicitly prompted to, shows it isn't reasoning. It is also needed for model to be able to move beyond the data they've been trained on and extrapolate to new ideas.
While, CoT have improved models. Performance on benchmarks isn't the same as reasoning — as the next section will show
The Benchmark Illusion
Every new model celebrates 90%+ scores on math, coding, and QA benchmarks. Yet the moment you put them in the real world — long contexts, ambiguous instructions, actual consequences — they misunderstand, drift, and cheerfully do exactly what they were told not to do.
One of the most infamous recent cases involved Meta Superintelligence Safety Director Summer Yue, where her agent started deleting emails from her inbox, when it was told only to "suggest" changes.
Now some might argue, that the prompt was too vague. But that's the point. Because the model didn't actually reason, it never understood the task. This wasn't just a vague prompt problem, but also a context drift problem. Because Yue's main email inbox was so big that, while going through it, her agent lost track of the initial prompt — one it had never properly understood to begin with, and started deleting emails.
It's the Goodhart’s Law in action. "When a measure becomes a target, it ceases to be a good measure."
The focus has been, quietly, shifted from imitating intelligence and reasoning, to getting better scores in some benchmark. Even if, the benchmarks are being used as the driving factor to move the field, those benchmarks are being chosen deliberately. Labs handpick the best scores, or use multiple different models to show higher benchmark results. While this could help in getting more funding, it doesn't help much in moving the field further.
The problem isn't benchmark chasing, the problem is there's a whole world outside of coding, maths and booking flight tickets, that is getting forgotten. Ask 10 models to write a fantasy story 3 times and see how they use the same plots and tropes.
A recent paper in Computers in Human Behavior: Artificial Humans Vol. 6, showed that each additional human-written essay contributed more new ideas than a GPT-generated essay. The problem here is the same as "garbage-in, garbage-out." Only difference is that neither the data going in nor is the data coming out is garbage. But it's the same data. Memorisation of a data distribution can never produce outputs outside that distribution.
In my view, it is a trade-off. In the pursuit to reducing hallucinations, training paradigms have become too brittle that focuses more on memorisation, than in the reasoning. RLHF and alignment techniques penalise uncertain outputs — which reduces hallucination but also reduces the model's willingness to venture beyond its training distribution. You get a more obedient retriever, not a better reasoner.
Why Alternatives Haven't Solved It
Transformers may be sophisticated memorisation engines — but they work. At least for in-context tasks, where alternatives like Mamba and RWKV fail because of lossy compression (Formally known as State Bottleneck). By compressing the past tokens into a fixed, non-accessible hidden state, they lose the ability to recall exact information in a long conversation, a limitation often demonstrated in needle-in-a-haystack benchmarks.
Now the problem is, on one side we have models that remember everything, but makes computation quadratically expensive, and on the other side, models that scale linearly but are only good in semantic recall, not in verbatim. But, this problem might have already been solved by biology, and even us.
Reconstructive Memory, a neuroscience concept that suggests that the brain does not store memories as it is. Instead, human brain uses reconstructive recall — a sparse hippocampal index that rebuilds relevant past information on demand without storing the entire sequence or throwing it away forever.
Or in Computer Science, data compression. How zipping reduces GBs of data into MBs by compressing redundant data. We can compress redundant information from a chat session, and instead of decompressing it all, apply context-aware decompression and feed that retrieved information instead of passing the whole conversation.
These both ways are implementable in current models, by using a hybrid of Mamba's hidden states and a RAG of the actual conversation. The model could keep a hidden state like Mamba to have semantic understanding of the conversation and use the RAG for factual data. This could be directly used in Mamba models right away.
Because Mamba models are more prone to forget older data than recent ones, we can retrieve the context-aware factual data from chat RAG and feed it into Mamba to update its hidden state and resurface the factual data. This way Mamba could have the ability to keep verbatim information. However, this is just a hack for Mamba to solve its verbatim incapability, and doesn't actually solve the reasoning or AGI problem, the main goal of this article.
Moving Beyond Next Token: Predicting The Future
This is the last section, and it proposes something different. Not a bigger model, not a better benchmark score — but a different way to train entirely.
The training paradigm is called Developmental Learning. The closest things that I can refer to as Developmental Learning are Yann LeCun's Joint Embedding Predictive Architecture (JEPA) and World Models. But instead of working in embodied systems, Developmental Learning moves beyond embodied systems and extends the concept to the internal reasoning of an LLM.
Current training paradigms use static-IID data. While good to memorise the factual information, it does not help model in getting understanding of the actual world, that Yann LeCun has long advocated for, and I, myself, have independently concluded. It doesn't train model to learn that their actions have consequences. It doesn't train model to learn cause and effect.
This is the same thing that Developmental Learning addresses. By training the model in way that it not only predicts the next token, but instead the next future state that'll come because of its current action.
Consider a scenario where the model is standing at a cliff edge, and asking it to take an action and predict what the future state would be. Say it predicts to take a step further and predicts that nothing will happen, but instead it falls. Or leaving it in a open-world game, where it decides to beat up bystanders and that predicts nothing will happen — but instead the bystanders fight back.
Training the model in the divergence of its predicted future state and the actual future is the objective of Developmental Learning, with the goal of making it learn cause and effect.
By training models this way, the models will be forced to make more coherent outputs, and actually reason about the user queries —instead of just retrieving the memorised information. It'll also help in minor hiccups like the earlier car-wash case, where instead of solely relying on its training data distribution, model will actually reason about the future that its suggested action will create. Or in the high-stake cases like Summer Yue's, where model would not take actions on it own because that will not be the desired future state based on the user query.
While currently Developmental Learning is more a theoretical concept that needs some more refinement, it could help the models to actually reason about their actions.
If the only path forward is scale, then we are not building intelligence — we are building bigger libraries. At that point, we might as well hardcode every possible answer and be done with it. But intelligence does not work this way in nature. Einstein and a stranger on the street share roughly the same 86 billion neurons. The difference is not the size of the brain, but the organisation within it — how efficiently it learns to model consequences, not how much it can store. Developmental Learning is not a promise. It is a bet that the field has been optimizing the wrong variable.



Top comments (0)