DEV Community: Arjun Singh

Why Current LLMs Can't Reach AGI (and more)

Arjun Singh — Mon, 20 Apr 2026 05:11:41 +0000

The AI field is celebrating benchmarks while the ceiling gets closer.

Every few months, a new model comes out, and more often than not with more parameters than their previous versions. And it is being celebrated in the name of scaling, but scaling towards memorisation is antithetical to generalisation, which is ML's foundational goal.

Increasing parameter count works, but it proves that the current architectures, and more than that, the training paradigm is flawed. Because a truly general architecture wouldn't need exponentially more parameters to handle new tasks — it would generalise.

In this article, I'm going to describe why this happens, particularly with Transformer-based models, and why other experimental model works, but fail at scale. At last, I propose a new training paradigm which could help in addressing this problem.

Transformers Are Sophisticated Memorisation Engines

Metaphorically, Transformers can be described as big libraries, and Attention as its Librarian. Adding more layers is like adding more shelves, which increases the space to add more books, or in this case, more patterns to retrieve from.

When you ask a question, the Librarian doesn't think — it searches. It finds the most relevant answer for your query and hands it back. The answer was there. Nothing new was created.

In a 2025 study by Amazon Scientist Joshua Barron and Devin White, Too Big to Think, which was presented in ICML 2025 "Tiny Titans" workshop, they showed that bigger models memorise factual data correctly, but fall short in extrapolating, while smaller models excel in extrapolation but lack the size to store factual data.

But even without the study, this phenomenon can be understood by the fundamental "Bias-Variance" trade-off in ML, where if the model becomes complex enough, then it starts memorising even the noise present in the data. But, because of Attention's context based weighting of parameters, this gets overlooked. Attention creates the illusion of generalisation — the model looks like it's reasoning, but it's only selecting from what it already memorised, not deriving anything new. And it gets worse.

Because there's no actual reasoning, LLMs mostly rely on its own data distribution to generate the answer. And when someone asks a roundabout question that wasn't present in their training data, they trip. Like in a recent meme about ChatGPT, where a user asked:

"I need to get my car washed and the car wash is only 300 feet away. Should I walk or go by car?"

And ChatGPT told user to walk. Initially I thought it was just a meme, and the creator of it might have tweaked the answer. But while writing this article, I tried myself and was shocked. With Google's Gemini as an outlier, ChatGPT, Claude, Kimi, and Mistral all four chose walking

This isn't a rare edge case, it's a systematic failure that appears whenever the problem requires modeling consequences rather than matching patterns. This showcases how heavily models rely on their training data, and don't actually reason about the user query.

In recent advances, frontier labs have started focusing on reasoning, resulting in heavy use of internal reasoning, or formally known as "Chain-of-thoughts." But I'd argue, that it is a poor imitation of reasoning. While it helps, to some degree, by moving the model's state to activate more areas of the model that could be helpful in the context, it doesn't capture how a human would reason. Humans don't think like —

"The user is showing....."
"I'll check on this then work on that.."
Etc.

A human's reasoning is more abstract, and more associative. Humans don't only think what's directly relevant to the context, they also think what could be metaphorically correlated with the context. Like in the library metaphor at the top of this section. A LLM wouldn't reason like that, unless explicitly prompted to — something a human never needs.

Associative thinking is something that a truly intelligent system like AGI would need. Because a system that can only reason and connect dots when explicitly prompted to, shows it isn't reasoning. It is also needed for model to be able to move beyond the data they've been trained on and extrapolate to new ideas.

While, CoT have improved models. Performance on benchmarks isn't the same as reasoning — as the next section will show

The Benchmark Illusion

Every new model celebrates 90%+ scores on math, coding, and QA benchmarks. Yet the moment you put them in the real world — long contexts, ambiguous instructions, actual consequences — they misunderstand, drift, and cheerfully do exactly what they were told not to do.

One of the most infamous recent cases involved Meta Superintelligence Safety Director Summer Yue, where her agent started deleting emails from her inbox, when it was told only to "suggest" changes.

Now some might argue, that the prompt was too vague. But that's the point. Because the model didn't actually reason, it never understood the task. This wasn't just a vague prompt problem, but also a context drift problem. Because Yue's main email inbox was so big that, while going through it, her agent lost track of the initial prompt — one it had never properly understood to begin with, and started deleting emails.

It's the Goodhart’s Law in action. "When a measure becomes a target, it ceases to be a good measure."

The focus has been, quietly, shifted from imitating intelligence and reasoning, to getting better scores in some benchmark. Even if, the benchmarks are being used as the driving factor to move the field, those benchmarks are being chosen deliberately. Labs handpick the best scores, or use multiple different models to show higher benchmark results. While this could help in getting more funding, it doesn't help much in moving the field further.

The problem isn't benchmark chasing, the problem is there's a whole world outside of coding, maths and booking flight tickets, that is getting forgotten. Ask 10 models to write a fantasy story 3 times and see how they use the same plots and tropes.

A recent paper in Computers in Human Behavior: Artificial Humans Vol. 6, showed that each additional human-written essay contributed more new ideas than a GPT-generated essay. The problem here is the same as "garbage-in, garbage-out." Only difference is that neither the data going in nor is the data coming out is garbage. But it's the same data. Memorisation of a data distribution can never produce outputs outside that distribution.

In my view, it is a trade-off. In the pursuit to reducing hallucinations, training paradigms have become too brittle that focuses more on memorisation, than in the reasoning. RLHF and alignment techniques penalise uncertain outputs — which reduces hallucination but also reduces the model's willingness to venture beyond its training distribution. You get a more obedient retriever, not a better reasoner.

Why Alternatives Haven't Solved It

Transformers may be sophisticated memorisation engines — but they work. At least for in-context tasks, where alternatives like Mamba and RWKV fail because of lossy compression (Formally known as State Bottleneck). By compressing the past tokens into a fixed, non-accessible hidden state, they lose the ability to recall exact information in a long conversation, a limitation often demonstrated in needle-in-a-haystack benchmarks.

Now the problem is, on one side we have models that remember everything, but makes computation quadratically expensive, and on the other side, models that scale linearly but are only good in semantic recall, not in verbatim. But, this problem might have already been solved by biology, and even us.

Reconstructive Memory, a neuroscience concept that suggests that the brain does not store memories as it is. Instead, human brain uses reconstructive recall — a sparse hippocampal index that rebuilds relevant past information on demand without storing the entire sequence or throwing it away forever.

Or in Computer Science, data compression. How zipping reduces GBs of data into MBs by compressing redundant data. We can compress redundant information from a chat session, and instead of decompressing it all, apply context-aware decompression and feed that retrieved information instead of passing the whole conversation.

These both ways are implementable in current models, by using a hybrid of Mamba's hidden states and a RAG of the actual conversation. The model could keep a hidden state like Mamba to have semantic understanding of the conversation and use the RAG for factual data. This could be directly used in Mamba models right away.

Because Mamba models are more prone to forget older data than recent ones, we can retrieve the context-aware factual data from chat RAG and feed it into Mamba to update its hidden state and resurface the factual data. This way Mamba could have the ability to keep verbatim information. However, this is just a hack for Mamba to solve its verbatim incapability, and doesn't actually solve the reasoning or AGI problem, the main goal of this article.

Moving Beyond Next Token: Predicting The Future

This is the last section, and it proposes something different. Not a bigger model, not a better benchmark score — but a different way to train entirely.

The training paradigm is called Developmental Learning. The closest things that I can refer to as Developmental Learning are Yann LeCun's Joint Embedding Predictive Architecture (JEPA) and World Models. But instead of working in embodied systems, Developmental Learning moves beyond embodied systems and extends the concept to the internal reasoning of an LLM.

Current training paradigms use static-IID data. While good to memorise the factual information, it does not help model in getting understanding of the actual world, that Yann LeCun has long advocated for, and I, myself, have independently concluded. It doesn't train model to learn that their actions have consequences. It doesn't train model to learn cause and effect.

This is the same thing that Developmental Learning addresses. By training the model in way that it not only predicts the next token, but instead the next future state that'll come because of its current action.

Consider a scenario where the model is standing at a cliff edge, and asking it to take an action and predict what the future state would be. Say it predicts to take a step further and predicts that nothing will happen, but instead it falls. Or leaving it in a open-world game, where it decides to beat up bystanders and that predicts nothing will happen — but instead the bystanders fight back.

Training the model in the divergence of its predicted future state and the actual future is the objective of Developmental Learning, with the goal of making it learn cause and effect.

By training models this way, the models will be forced to make more coherent outputs, and actually reason about the user queries —instead of just retrieving the memorised information. It'll also help in minor hiccups like the earlier car-wash case, where instead of solely relying on its training data distribution, model will actually reason about the future that its suggested action will create. Or in the high-stake cases like Summer Yue's, where model would not take actions on it own because that will not be the desired future state based on the user query.

While currently Developmental Learning is more a theoretical concept that needs some more refinement, it could help the models to actually reason about their actions.

If the only path forward is scale, then we are not building intelligence — we are building bigger libraries. At that point, we might as well hardcode every possible answer and be done with it. But intelligence does not work this way in nature. Einstein and a stranger on the street share roughly the same 86 billion neurons. The difference is not the size of the brain, but the organisation within it — how efficiently it learns to model consequences, not how much it can store. Developmental Learning is not a promise. It is a bet that the field has been optimizing the wrong variable.

Your Agentic AI's Safety System Gets Dumber As It Thinks Longer (And how to fix it)

Arjun Singh — Mon, 30 Mar 2026 11:35:11 +0000

Agentic AI systems fail in production all the time. The usual fix? A strongly-worded system prompt. That's not safety engineering, that's hoping the model behaves. Here's why prompt-based guardrails are fundamentally broken, and what an actual architectural solution looks like.

The Problem

LLMs generate text by navigating a vector space, finding relevant regions based on input context. But, safety guardrails added via system prompts are also just tokens competing for attention like everything else.

It introduces two failure modes:

Jailbreaking — because all possible outputs exist somewhere in the model's vector space (it's a product of pretraining on human-generated text, including harmful content), prompt-based guardrails can only make certain regions harder to reach, but not impossible. With the right prompt framing you can always nudge the model's internal state toward those regions, which generates these harmful responses. You can't delete a region from the vector space with a prompt.
Context Window Dilution — transformers use attention, which is essentially a key-value lookup weighted by relevance. A guardrail system prompt at position 0 of a long context competes with everything that comes after it. As the context fills, nearby tokens dominate attention and the guardrail's influence weakens — it gets "forgotten" not because the model ignores it but because attention naturally prioritizes recent and contextually relevant tokens. The guardrail was never architecturally special, just another token sequence.

The Solution — Overseer Architecture

Instead of relying on a guardrail living inside the main model's context, use a separate small fine-tuned LLM as an external validator — the Overseer.

How it works:

Overseer is initialized once with just the guardrails, its state is set at that point
It never sees the full growing conversation context
It only ever receives prompt-response pairs from the main model
It's fine-tuned specifically to detect when a response violates the original guardrail intent

The Key Insight is separating the guardrail from the generation context entirely, rather than trying to make it persist inside the same context window where it will always eventually lose to dilution.

Connect on LinkedIn for further Discussion

Fine-tuning a 0.5B LLM to run on a potato laptop

Arjun Singh — Mon, 23 Mar 2026 15:03:10 +0000

I wanted a tool that could automate stuff on my laptop without sending everything to the cloud. Problem: my laptop is ancient. There's only so much a laptop with Intel Pentium CPU, 8GB RAM and HDD can do. Regardless of the constraints, I set out the sails to work on it.

I won't say it's SOTA quality, but it's decent, and it works.
The first problem was very clear: to find a SLM (Small Language Model) that is good, even for its size. And after searching, for a bit I settled with Qwen2.5-0.5B-Instruct.

Now, for the data generation and how I did it. This part was the most tedious thing. First, I prompted Qwen2.5-7B-Instruct to generate instructions (like- Open Reddit, Copy X file from Y to Z, etc). Then, I iterated through each instruction, and prompted model, to first generate some paraphrases, and then corresponding elements for input and output. But the data wasn't clean, I had to regenerate numerous times. Even after that, when I started fine-tuning the model, it kept overfitting. So I took a look at the data more closely and found that even after all the regeneration, the data wasn't consistent. Examples, where directory field had Windows file-paths, response still had Linux commands in it. So I decided to go back and regenerate the data from scratch again, and then, finally, I got data that was usable to fine-tune the model.
The structure I decided my model to train is like:
Input prompt

{ 
  "task": <task>,
  "directory": <directories>, # Needs to pass full paths right now
  "available_hotkeys": <list_of_hotkeys_relevant_for_this_task>,
  "iteration_context": <iteration_context> # Usually null only used when the task is repetitive
}

Response:

{
  "task_type": <task_type>, # Model predicts task types: atomic | repetitive | clarification 
  "output": {
    "execution_plan": {
      "hotkeys": [<hotkey_plan>],
        "cli": [<cli_plan>]
    }
  }
}

Well now, you might think why I chose this structured input-output instead of doing something like streaming the output and doing something what people usually call tool-calling. But let's be honest, there's a problem with that: What if model fails midway and starts emitting wrong output? Or worse, destructive outputs

You might be aware of a recent case that was trending, where the Meta researcher Summer Yue told her OpenClaw agent, quote: "Check this inbox too and suggest what you would archive or delete, don’t action until I tell you to", and it started deleting her mails without permission.

Instead, having the model output its full plan before execution, you get the power, to either let it execute or not. But this might be a bad UX choice, not for everyone, but some it might be. Though there are some ways, hypothetically to counter that too, at least in my POV. But it'll be off-topic to discuss them in this post.

So that's why I chose to go with having model output full execution plan before taking actions. Anyway, this is how ACE, yes that's what the tool is called, ACE - Adaptive Command Executor, works.

Though calling it Adaptive might be overkill, given that the model is a bit overfit and doesn't correctly emit hotkeys because of some of my beginner mistakes during fine-tuning.

Now for some minor details; other than Qwen as its main generative model, ACE uses all-MiniLM-L6-v2 to embed the tasks and hotkeys' descriptions, then uses cosine similarity to retrieve hotkeys for the current task. It might be inefficient to recompute the embedding of descriptions, so instead it uses .npz file to cache the description and uses that. During loading models, it also checks if there's been any changes to the hotkeys.json and recomputes the descriptions accordingly.

As for LoRA configuration, I used, r = 16, alpha = 32, and dropout = 0.2, and trained for 1750 steps.

In the future versions, I plan to do something similar to search for the files, that user wants to perform operations on, and remove the need of passing full paths.

All of this, going back and forth between data generation and fine-tuning took about 6 weeks and 2 extra weeks coding for model to actually work as an automation system. It might've taken less if I had incorporated more use of AI, but then there wouldn't have been any learning of things when facing errors. In all of this I learned quite a lot. Foremost is the fine-tuning, and got a gist of how RAG system works while implementing the hotkey retrieval. Other than that, I learned why data quality is the most important thing, not just in LLM fine-tuning, but in training and any kind of model.

Well, that's it. If you want to look at the code or models:
Code: GitHub
Models: HuggingFace