DEV Community

Jimin Lee
Jimin Lee

Posted on

(2/4) LLM: Data, Transformers, and Relentless Compute

This post was written in April 2023, so some parts may now be a bit outdated. However, most of the key ideas about LLMs remain just as relevant today.

Large Language Models

So, what happens when a regular Language Model gets bigger? You get a Large Language Model (LLM).

But we can’t just blow these things up infinitely. Three big roadblocks stand in the way:

  • Training Data: You need a ridiculous amount of it.

  • Algorithms: Scaling requires smarter and more powerful algorithms.

  • Compute Power: Think massive clusters of top-tier GPUs/TPUs.

The fact that we can train LLMs today means these problems are being solved, at least partially. Interestingly, the same three factors—data, algorithms, and compute—are exactly what allowed the leap from traditional machine learning to deep learning. And history suggests they’ll be the levers again when the next paradigm shift comes.


Training Data

Every machine learning model needs data. And the stronger the model you want, the more data you need.

This has always been one of the hardest parts of ML: collecting data, and then labeling it with the right answers (positive/negative, named entity positions, etc.).

But here’s the twist: language models have a cheat code.


Self-Supervised Learning

Labeling by hand is expensive—time, money, human effort. Which means scaling is painful.

Take a simple sentence: “I went to school yesterday.” From that one sentence, you can generate your own training examples automatically:

  • “I” → predict “went”

  • “I went” → predict “to”

  • “I went to school” → predict “yesterday”

No humans required. As long as you have text, you can create training data automatically.

This approach is called self-supervised learning. Unlike classic unsupervised learning (where no labels exist at all), self-supervised methods generate labels directly from the raw data itself. That’s why modern LM training almost always falls under the self-supervised umbrella.


Web-Scale Data

Back in the early days, available text datasets were tiny: a few MBs of news articles, some licensed books, or manually curated corpora. Even Wikipedia dumps in the single-digit GB range felt massive.

Then the internet changed everything. The web is fundamentally text-driven, and its scale is mind-boggling. Wikipedia? That’s just a drop in the bucket.

Projects like Common Crawl began collecting enormous swaths of web data—tens of terabytes and growing. And the best part? It’s freely available.

On top of that, many platforms have released their own cleaned-up datasets (within legal limits), which, while smaller, often have much higher quality than raw crawled text.


When the Two Collide

Now put these pieces together:

  • Self-supervised learning means we don’t need humans to label text.

  • Web-scale data means we suddenly have oceans of training material.

The result? A perfect storm for building today’s LLMs. That’s how we got from tiny datasets in the MB range to massive, automatically labeled corpora in the TB range—the fuel that makes GPTs, PaLMs, and LLaMAs possible.


Algorithms

Even if you have oceans of data, you still need a good algorithm to digest it.

And here enters the celebrity of modern AI: the Transformer.

Released by Google in 2017, the Transformer is both an architecture and a set of structural ideas. You’ll hear people call it the Transformer model, the Transformer architecture, or just Transformer. Doesn’t matter — it’s the engine under the hood of nearly every LLM today.

But why is this thing so powerful? To explain that, we need to take a quick detour into a core concept: Sequence-to-Sequence models.


Sequence to Sequence

First things first: what’s a sequence? It’s just a list of items in order.

  • 1, 5, 3, 2, 1 → a sequence of numbers

  • ant, frog, dog, horse → a sequence of animals

  • I, went, to, school, yesterday → a sequence of words

In NLP, our main interest is the last one: word sequences, or simply, sentences.

Now, Sequence-to-Sequence (seq2seq) literally means sentence → sentence. In other words, one sentence gets transformed into another.

The most obvious example? Translation.

  • “나는 어제 학교에 갔다.” → “I went to school yesterday.”

But translation isn’t the only seq2seq task:

  • Summarization: long text → shorter text

  • Sentiment classification: review text → “positive” or “negative” (classification framed as text output)

  • Named Entity Recognition: input sentence → same sentence but with tags/highlighted entities (also cast as seq2seq)

Once you think this way, you realize seq2seq is almost a universal recipe for NLP tasks.


How Do We Translate?

Here’s the million-dollar question: how does your brain actually translate “나는 어제 학교에 갔다” into “I went to school yesterday”?

Answer: we don’t really know.

A classic hypothesis is the Inter-lingua theory: instead of going word-by-word, the brain converts the sentence into some abstract, universal “meaning language” (the inter-lingua) and then expresses it in the target language.

Conceptually:

  1. Korean sentence → Inter-lingua

  2. Inter-lingua → English sentence

In ML terms:

  • Encoding: convert source text into a hidden representation

  • Decoding: generate the target text from that hidden representation

That hidden representation isn’t human language. It’s an abstract space — what we in ML usually call latent vectors. The “inter-lingua” metaphor is useful, but in practice it’s math.


Encoding and Decoding — With an Analogy

Think data formats:

  • Encoding: turning light into video files, or compressing files into ZIP format. Same content, different form.

  • Decoding: playing that video file, unzipping that compressed folder, or decrypting a spy’s coded message.

Sometimes we call decoding generation, especially in NLP, since the model isn’t just restoring the original — it’s producing new text in a different form.

That’s exactly what translation does:

  • Encode: Korean → latent meaning representation

  • Decode: latent meaning → English

Why does this matter?

This encode/decode structure is perfect for deep learning because ML excels at learning A → B mappings from lots of examples. With enough bilingual pairs (or any source–target pairs), a model can learn how to encode and decode by itself.

And that’s where Transformers come in: they’re the best seq2seq engine we’ve discovered so far.


Back to Transformers

At its core, a Transformer is a very clever encoder–decoder model.

If you’ve Googled “Transformer” before, you’ve almost certainly seen a diagram like this: (left = encoder stack, right = decoder stack).

Flow:

  1. The input sentence (“I went to school yesterday”) goes into the encoder stack on the left.

  2. It passes through several encoder blocks, gradually turning into a Context representation.

  3. That Context is fed into the decoder stack on the right, where it guides generation.

  4. The decoder outputs probabilities for each possible next word.

Let’s make this concrete. Imagine our vocabulary has 10,000 words:

  • ID 1 = “a”

  • ID 2 = “apple”

  • ID 8789 = “went”

If the decoder is trying to predict the next word after “I,” it might output something like:

  • Word 1 (“a”): 0.00001

  • Word 2 (“apple”): 0.0004

  • Word 8789 (“went”): 0.901

Clearly, “went” is the winner.


Stacking Blocks

Those “Encoder” and “Decoder” boxes are actually stacks of blocks:

  • Each Encoder Block feeds the next Encoder Block.

  • After stacking N of them, the final block’s output becomes the encoder’s Context.

  • The Decoder works the same way, with multiple Decoder Blocks stacked.

Why stack blocks? Same reason we make deep neural networks “deep.” More layers = more expressive power = better performance.

Caveats:

  • If you scale the model without enough data, performance can get worse.

  • Bigger models demand far more compute and memory for both training and serving.

In other words: stacking is powerful, but it comes with a bill—sometimes a very expensive one.


So, What’s the Relationship Between Transformers and Language Models?

We’re not studying Transformers in isolation — we’re asking why they made LLMs possible.

At its core, a Language Model (LM) has a very simple job:

👉 Predict the next word, given the text so far.

For example:

  • Input: “The flowers by the roadside are blooming …”

  • Bad guess: “punched.”

  • Good guess: “beautifully.”

And guess what? Transformers are ridiculously good at this game.


Back to Our Example

Take the sentence: “나는 어제 학교에 갔다.” (Korean: I went to school yesterday.)

  1. The encoder processes this input and produces a hidden Context representation (a compressed “meaning” in vector form).

  2. The decoder turns that Context into English — but it uses two inputs:

    • The Context (from the encoder)
    • Outputs (the decoder’s previously generated tokens)

Why “Outputs” as input? Because the decoder is auto-regressive — it feeds its own past predictions back in.


The Auto-Regressive Loop

Walkthrough:

  1. Encoding: The encoder converts “나는 어제 학교에 갔다” into Context vectors.

  2. First step: Context + <start> → predict “I.”

  3. Second step: Context + “I” → predict “went.”

  4. Third step: Context + “I went” → predict “to.”

  5. Repeat until “the school yesterday.”

  6. Stop at <end>.

Result: “I went to the school yesterday.”


Two Key Ideas

  1. One word at a time: The decoder generates token by token (why ChatGPT looks like it’s “typing”).

  2. Predictions feed back in: Each new token becomes input for the next step (auto-regression).

Perfect alignment:

  • The job of an LM is “predict the next word.”

  • The mechanism of the Transformer decoder is auto-regressive next-word prediction.

If we want a plain LM that predicts the next word in the same language, we make a small but important tweak…

In the next post, I’ll cover the background that made LLMs possible, including a closer look at Transformers that I couldn’t fully explore this time.

Top comments (0)