Joshua Ballanco

Posted on Mar 17 • Edited on Mar 18 • Originally published at manhattanmetric.com

LLMs - How Did They Get So Good?

#ai #machinelearning #llm #programming

In two earlier posts I covered a bit of the history of the current batch of AI models, what they are good at, and what they're not so good at. Had I published those posts a year earlier, we probably could have left the story there, but unless you've been living under a rock, it's clear that the situation has evolved rather rapidly. I will try, then, with this post to conclude the story of how we got to the place we are now (early 2026) and to provide maybe a hint of where we are going.

The Winter of Our AI Discontent

Lately it seems any moderately lengthy discussion of the current state of AI inevitably turns to the prospect of an "AI bubble". Whenever it does, I like to point out that, if it turns out that AI is being overhyped and that interest and investment in AI were to fall off a cliff at some point in the future, this wouldn't be the first time. In fact, in the field of AI there already exists a term to describe this phenomenon. It's called an "AI winter".

I also love to point out that the term "AI winter" was first coined...in 1984! If it seems like AI today came out of nowhere, it really is that classic case of an "overnight success ten years in the making". Except, in AI's case, it's been closer to eighty years.

Much of AI today is related to the idea of taking the way that neurons in biology operate and turning that into programs that computers can execute. This idea first appeared in a paper published in 1943. If that date seems familiar, it should. That just happens to be the same year that the first computer was developed by the code breakers at Bletchley Park in the UK. In other words, the idea of how to make a computer "artificially intelligent" literally predates the first computer.

This is a pattern that would be repeated throughout the history of the development of AI. Researchers and mathematicians over the next several decades continued to devise new ways computers could potentially mimic human intelligence, only to run into the harsh reality that state-of-the-art computer technology of their time was woefully inadequate to execute on their plans. Each time, starting in the 1960s, excitement around the possibilities represented by these new techniques would generate hype, only for that hype to turn into despair.

Researchers, not typically the type to be discouraged, did not throw up their hands whenever this occurred. Instead, they would begin exploring new methods, new techniques for AI that might be able to yield more immediate-term results. This resulted in approaches that go by names such as "symbolic logic", "expert systems", and "Bayesian networks". Eventually, though, in 2017 researchers at Google returned, once again, to the concept of having computers model the way that neurons work. Except this time they employed a bit of a mathematical trick that made it possible for computers to actually execute these programs. The paper "Attention is all you need" introduced the world to the concept of the "transformer" and kicked off the most recent iteration of the AI hype cycle. Shortly after, in 2018, researchers at OpenAI released GPT-1 showing the practical applicability of this technique.

Bigger is ... Better?

Almost as soon as GPT-1 was released there were those who reflexively assumed that its arrival, rather than representing a new door open in the pursuit of AI, represented the apex of a pendulum swing that would reveal that current computing power was incapable of unleashing its full promise. While the transformer was a revolutionary new technique, this line of reasoning went, it would not be the technique that realized the full promise of AI. After all, if the decades preceding had taught AI researchers one thing, it was that some new method, some new technique would always be necessary. An AI winter would arrive (by now the fourth or fifth, depending on how you count), the hype would die down, and the research community would re-enter hibernation until the next new cleverly named approach could be tried.

That was not the approach that OpenAI took. Instead, the first thing they did after releasing GPT-1 was to make it bigger. A year later, GPT-2 was released and showed even better results than GPT-1. Yes, the cost of training GPT-2 likely exceeded tens, or possibly hundreds, of thousands of dollars...but it was better! Not willing to stop there, OpenAI did something crazy, audacious, and completely unprecedented in the history of AI research: they made it even bigger. GPT-3 was released in 2020 and it was even better (and more expensive to train) than GPT-2.

Around this time researchers at OpenAI began to realize that this wasn't a fluke, but a pattern. Unlike many prior approaches to AI that eventually hit a wall of exponential requirements for incremental benefits, the transformer model could be expected to continue improving in a steady fashion as more resources were dedicated to it. What makes this observation so important is that, for perhaps the first time, it moved AI out of the realm of scientific research and into the realm of engineering. For each new data center, each advance in chip power or computer memory, we could expect concordant improvements in the capabilities of the AI models they would produce.

Of course, as I explored in my previous posts, one fundamental limitation to these systems was that they were still, ultimately, language models. As such, they could operate on language and concepts with ease, but hit on real challenges when it came to tasks that involved logic.

Fizzing the Buzz

I remember vividly the first time I interviewed a candidate for a software engineering job. I was still, myself, a very junior software engineer, but my boss came into my office one day and told me that I would be accompanying him in interviewing a new candidate for our team. To his credit, he actually let me take the lead in the interview. He didn't give me much more direction than, "ask the candidate to solve a programming problem so we can evaluate their skills." Thinking back to a problem I had faced a week or so earlier, I presented the candidate with a challenge that involved handling a stream of data, detecting certain events in the data stream, and adjusting the way the program would handle the data as a consequence. My goal was to see if this candidate could arrive at a solution that roughly approximated a state machine.

After presenting this problem to the candidate, I was met with a glassy stare. Thinking I hadn't done a good job of describing the problem, I started again from the beginning, this time trying to lay out a few more obvious hints as to the direction I was hoping they would take to arrive at a solution. As we all, my boss, the candidate, and I, stared at the white board without too many words exchanged, I continued to poke and prod the candidate toward a solution without much luck. It was at this point that, thankfully, my boss stepped in and took charge for the remainder of the interview.

Afterward, in his office, my boss said, "Josh, are you familiar with 'FizzBuzz'?" I admitted I was not, and so he described the problem to me: write a program that prints the numbers, from 1 to 100, except for every number divisible by 3 print "Fizz", for every number divisible by 5 print "Buzz", and for every number divisible by 3 and 5 print "FizzBuzz".

"That's such a ridiculously easy problem!" I replied.

"Yes," he explained, "but most of the candidates we see for junior positions like this cannot solve it."

Now why do I tell this story? Alan Turing, often regarded as the founder of the field of computer science, proposed a test to determine when a computer had achieved a human-level of intelligence. Known as the "Turing Test", the idea is rather straightforward: if a human sitting at a computer terminal and carrying on a conversation with a partner cannot reliably determine if that partner is a computer or a human, then the computer must possess human-level intelligence. There are myriad problems with using this as a definitive test of "artificial intelligence", not the least of which is that the average human is not that intelligent!

All joking aside, human intelligence differs from the strict logic-based intelligence that early computers excelled at. Ask someone to multiply two 10-digit numbers in their head and you might never in your life encounter a human capable of that task, but even the most rudimentary computer of the 1950s wouldn't blink before returning the answer. In light of this reality, one can see why Turing felt his "test" had merit. What Turing couldn't have foreseen was that in developing a computer program capable of conversing with another human in human-like terms, that computer program might actually lose the ability to perform instantaneous 10-digit number multiplication. And yet, that is precisely what was developed with transformers.

Crumbs of Logic

Many people can point to a singular teacher who had an out-sized impact on the course of their future education. For me, that was my middle-school math teacher, Mr. Ondas. It was he who let me sit in his classroom between the time the bus dropped me off in the morning and the start of classes to type out programs on his Apple IIe. It was also he who gave me my first book by Raymond Smullyan. If you're not familiar, Smullyan was most notable for being a proponent of "recreational mathematics". He wrote a number of books that presented various logic problems in fun and whimsical ways. One in particular that I remember fondly goes something like this:

Alice and Bob both have some cookies. Bob is upset because Alice has three-times as many cookies as him. So, Alice gives him one of her cookies, but Bob is still upset because now she has twice as many cookies as him. How many cookies did they each start out with?

What I love about this problem, is that there are many approaches of varying complexity and sophistication that one can take to solving it. For example, you could write an equation for the starting state, A = 3B, and then another for the final state, (A - 1) = 2(B + 1). Substituting from the first into the second you get, (3B - 1) = 2(B + 1), which, with a modest level of algebraic training, one can simplify and solve.

One can also recognize this as a system of equations that can be represented in linear algebra as:

\begin{aligned} \begin{bmatrix} 1 & -3 \\ 1 & -2 \end{bmatrix} \begin{bmatrix} A \\ B \end{bmatrix} &= \begin{bmatrix} 0 \\ 3 \end{bmatrix} \end{aligned}

Of course, there's another way as Smullyan points out. One can, quite simply guess and check! If Bob starts with one cookie, then Alice would start with three. If she gives him one then she has two and he has two, so that's not it. If Bob starts with two, then Alice starts with six, and after the exchange she has five to his three. If Bob starts with three and Alice nine, then the exchange would conclude with Bob having four to Alice's eight, or twice as many. Eureka!

Another reason I love this problem is that it fairly simply demonstrates how it is that modern LLMs have managed to get around their inherent limited ability to perform logic. In a technique known as "Chain of Thought", LLMs are trained not to simply respond to inquiries that require logical deduction with a conceptual answer, but rather to rephrase the question to themselves in simpler terms until they are able to arrive at a logical answer. If you've ever encountered a problem like the one I presented above and found yourself immediately firing off an internal monologue about how to approach and solve the problem, well, that's exactly what LLMs now do.

But this is not the only approach they can take. If, with that cookie problem, you found yourself immediately writing out equations as in the first answer I provided, that is also a thing that LLMs can now do.

I had a friend in college whose father held a PhD in physics. Back "in the day", when a large portion of scientific literature was published in German or Russian or French, one requirement for being awarded a PhD was the ability to read and understand at least three different languages. My friend's father had, cheekily, gotten around this requirement when getting his PhD by claiming that his knowledge of Fortran and C qualified and, for whatever reason, his thesis committee agreed. Really, though, there was some foresight in this conclusion. It turns out that when we play in the realm of languages and concepts, just as an English phrase can map to a concept, and then that concept can map to a German word, that concept might also map to an Algebraic equation or a Python program.

And so another way in which modern LLMs deal with difficulties in handling logical problems is that they can, quite literally, transform those problems into programs which they can then execute to arrive at the answer.

Finally, if you had a sense that the cookie problem could be represented as a linear system of equations, but you needed to call up your friendly local mathematician to figure out how to write out and solve the matrix equations, well, that is also something that modern LLMs can do. We now have the ability to give LLMs access to "tools", along with descriptions of what those tools can do, and the LLMs will delegate tasks that are beyond their reasoning ability to these tools when appropriate.

What all of these approaches have in common is that they are, fundamentally, engineered solutions to the limitations that LLMs face today. Advancing AI is not just a problem of engineering chips, memory, or data centers, but also a question of engineering clever solutions to all the various challenges that LLMs still face. Of course, advancing AI even further into the future is not just an engineering problem. There are still fundamental questions that researchers have yet to solve, and for which engineering has so far only been able to deliver less-than-ideal solutions.

Memories...

I always enjoyed acting as a child. When I was in the sixth grade, only the seventh and eighth graders were allowed to participate in drama club. So, when I entered the seventh grade, I tried out for the school play that year: "Our Town". I landed the part of the paperboy, who has all of two lines toward the beginning of the play, but I really threw myself into that part. The next year the school was putting on "Willy Wonka" and I landed the much bigger part of Mike Teevee. Being one of the last of the children to meet their untimely end in Willy Wonka's factory, Mike Teevee had a lot of lines. I had no idea how I was going to memorize my lines!

Luckily, our drama teacher had a solution. Her instructions were to sit before bed each night and read the entire play, cover to cover. The idea was not to focus on just my lines, but to memorize the entire play. That way, the reasoning went, I could fully embody my character because I wouldn't be waiting for a specific cue. Instead, I could follow the action of the entire play, right alongside the audience, and all I had to do when it was my line was to speak.

At first I wasn't sure I was up to the task of memorizing an entire play, but I followed the teacher's instructions. It took a bit more than an hour, but each night before bed I read the full script, cover to cover. Much to my amazement, I found that this repetition eventually allowed me to recall, at a moment's notice, almost any line in the play. I didn't have to drill lines, write out flash cards, or focus on cues. I had memorized the play, almost without trying.

Now, today I couldn't recite a single line from that play. The intervening decades have given me many myriad facts and other things to remember, and the pages of Willy Wonka have long since been purged despite the many nights of reading it in its entirety. What I can tell you, though, is what it felt like to play that part. I can tell you how ridiculous I felt wearing a cowboy hat and the rest of the costume, or how tedious it was to have to apply stage makeup each night. I can tell you about how I discovered a jealous streak in me I didn't know was there when it was revealed that, for the first time, sixth graders were eligible to try out that year and that I was cast alongside a sixth grader who played Mike Teevee on the nights I was not. I can tell you about how I ultimately decided that theater was not for me because, while I tried my hardest to put on the best performance I could, the "other Mike Teevee" got all the attention because he was cool and I was not.

I can tell you all these things, because I have learned and I can learn. This is, ultimately, the one thing that LLMs still cannot do.

If you asked an LLM to write an essay exploring the history of AI development, the current state of the industry, the ways in which challenges have been overcome, and where the industry is headed next, it could probably write an essay about Bletchley Park, Turing, OpenAI, and tool calling. It could do this because that information was in the dataset it was originally trained on, or because that information is retrievable from a tool. What it could not do is tell you stories about a helpful boss, or a favorite teacher, or eighth grade drama club.

What's Next?

So where do we go from here? I can tell you that "online learning", or the ability for LLMs to make adjustments to their understanding and knowledge as they run is a very active area of research. I can also tell you that engineers are every day inventing clever new ways for LLMs to work within their inherent limitations to have the appearance of "learning". But now we come to the most important question of all: is this what we want?

There's a saying that "old scientists never retire, they just become philosophers". Well, while I am older than I once was, I'm not yet at the point of delving into philosophy. That said, I know that those who perform research in the area of consciousness and sentience have suggested that LLMs cannot achieve either because to do so would require the ability to learn, the ability to forget, and the experience of impermanence, of death. I cannot really say much more about this subject, but I do wonder if consciousness, sentience, or even human-like intelligence is even what we want out of AI? After all, isn't the advantage of AI that it isn't human? AI doesn't get tired, doesn't forget, doesn't get bored. Maybe this is a good thing? Maybe not?

I cannot say what the next big advance in AI technology will be. What I can say is that, much more important a question than "Where can we go from here?" is "Where do we want to go?"

Top comments (2)

klement Gunndu • Mar 17

Solid history, though I'd push back on the framing of RLHF as a "breakthrough" — the real shift was scaling compute on the transformer architecture, with RLHF mostly handling alignment rather than capability.

Joshua Ballanco • Mar 18

Thanks! And that's a solid point. Honestly...I fudged the history of Open AI a bit because it's hard to bring them into the story in any satisfying way by focusing solely on the technology, and I didn't want to delve into all the "other" facets of that story.

If I had the time/resources, I think it would be really interesting to write an entire book on the last decade of AI's "history". (Feels odd to call it "history", but with the pace everything is moving it'll be history before it's even "news"!)