This post was written in April 2023, so some parts may now be a bit outdated. However, most of the key ideas about LLMs remain just as relevant today.
Getting Started
With the rise of ChatGPT, the buzz around Generative AI and Large Language Models (LLMs) has been explosive. Just a few months ago, terms like “generative” or “language model” felt like insider jargon—something only AI folks tossed around. But now, you’ll hear them casually dropped in coffee shop conversations.
What’s even more surprising is that these aren’t just buzzwords anymore—they’re being used in everyday life. Running an image-to-text model on your laptop? Spinning up something like LLaMA locally? That used to be rare.
Naturally, this flood of attention has also produced a flood of explainer content. So you might wonder, “Do we really need yet another article on this?” Maybe not. But I wanted to share something practical that might help you better understand what Large Language Models really are.
This article won’t dive into math equations or heavy code. Instead, we’ll cover the past, present, and future of LLMs, with some light technical explanations sprinkled in.
Foundation Models
To talk about LLMs, we first need to talk about Foundation Models. The term is pretty literal—foundation means base, so a foundation model is a “base model.”
But what does that actually mean? The best way to understand it is to look at what things were like before foundation models existed.
Life Before Foundation Models
Imagine you want to build a system that can tell whether a product review is positive or negative. Traditionally, you’d go through these steps:
Collect a huge amount of review data from the internet. If you don’t have enough, sometimes you even generate synthetic reviews.
Label the data as positive or negative. This step is called annotation or tagging.
Pick a machine learning method (these days, usually deep learning) and train on the labeled data.
-
Evaluate the trained model, then repeat until performance is good enough:
- If there’s not enough data, collect and tag more.
- If the labels aren’t accurate, fix them.
- If the algorithm isn’t working well, try another one.
- Or tune the model hyperparameters.
Now take another example: say you want to build a system that extracts names of people and places from text. Same process—except this time, instead of tagging reviews as positive/negative, you highlight names and places.
In short, every task required its own dataset and its own model. Obvious? Yes. Efficient? Not really.
Enter John and Mike
Let’s illustrate this with a conversation:
John: “Man, it’s exhausting to collect and label new data for every single task. And the stronger the algorithm, the more data it demands.”
Mike: “That’s life as an AI engineer. It’s how it’s always been.”
John: “But think about it. At the end of the day, all these tasks deal with language. Isn’t there a shared foundation?”
Mike: “Not really. Sentiment classification and name-entity recognition are totally different.”
John: “Maybe. But look—my brother’s a humanities major, and I’m an engineer. Our fields are worlds apart, but we can still chat just fine in English.”
Mike: “Right, because you both learned English.”
John: “Exactly. What if we gave an AI that same foundation—train it once on the basics of language, and then just add task-specific knowledge on top? Like how I can switch between engineer mode at work and boyfriend mode with my girlfriend, but I’m still the same person.”
Mike: “Hmm… interesting point.”
That idea? That’s a Foundation Model.
Why It Matters
Traditionally, “Engineer John” and “Boyfriend John” would be two totally separate people—born different, trained different, lived different. But in the foundation model view, it’s one John who can switch modes depending on the context.
Now imagine a NLP Foundation Model.
Want to build a sentiment classifier? Just fine-tune it with a small set of positive/negative labels.
Want to build a name extractor? Use the same foundation model, but fine-tune it with tagged sentences.
With this approach:
You need less data per task.
Or, with the same amount of data, you get better performance.
If neither is true, then there’s no point in building foundation models. But when it works, the payoff is huge: one base model, lightly tuned, powering many tasks.
And in NLP, the strongest foundation models we have today are Large Language Models.
Large Language Models
So, LLMs are foundation models—specifically, foundation models for NLP. But what exactly is a language model?
What’s a Language Model?
The phrase “language model” sounds fuzzy, doesn’t it? It feels like it should mean “a model of language,” but that’s not very precise.
In NLP, a language model has a specific definition: It predicts the next word given some text.
For example:
Input: “The flowers along the road bloomed …”
Likely next word: “beautifully.”
Unlikely next word: “punched.”
A good language model learns from tons of real sentences and picks the most natural continuation. Importantly, it doesn’t just memorize—it generalizes patterns from the data.
Two Main Uses
With a language model, you can:
Predict the next word in a sequence (auto-completion).
Judge how natural a sentence is (assign probabilities).
Examples:
“The flowers bloomed beautifully.” → High probability
“The flowers punched beautifully.” → Low probability
Why This Matters for Foundation Models
Predicting the next word may sound simple, but it requires surprisingly deep abilities:
Vocabulary knowledge: knowing many words.
Grammar knowledge: picking the right form.
Context awareness: using prior sentences to choose the right meaning.
If a model can do all that, then it has learned something pretty close to “understanding language.”
So, the next logical step? Make the model bigger. Much bigger.
Because one of the core beliefs in machine learning is: scale makes models smarter.
In the next post, I’ll take a brief look at the background that made the rise of LLMs possible.
Top comments (0)