DEV Community

Jimin Lee
Jimin Lee

Posted on • Edited on

(2/3) LLM: Data, Transformers, and Relentless Compute

📌 Note: This article was originally written in April 2023. Even though I’ve updated parts of it, some parts may feel a bit dated by today’s standards. However, most of the key ideas about LLMs remain just as relevant today.

Large Language Models

So, what happens when a regular Language Model gets bigger? You get a Large Language Model (LLM).

But we can’t just blow these things up infinitely. Three big roadblocks stand in the way:

  • Training Data: You need a ridiculous amount of it.

  • Algorithms: Scaling requires smarter and more powerful algorithms.

  • Compute Power: Think massive clusters of top-tier GPUs/TPUs.

The fact that we can train LLMs today means these problems are being solved, at least partially. Interestingly, the same three factors—data, algorithms, and compute—are exactly what allowed the leap from traditional machine learning to deep learning. And history suggests they’ll be the levers again when the next paradigm shift comes.


Training Data

Every machine learning model needs data. And the stronger the model you want, the more data you need.

This has always been one of the hardest parts of ML: collecting data, and then labeling it with the right answers (positive/negative, named entity positions, etc.).

But here’s the twist: language models have a cheat code.


Self-Supervised Learning

Labeling by hand is expensive—time, money, human effort. Which means scaling is painful.

Take a simple sentence: “I went to school yesterday.” From that one sentence, you can generate your own training examples automatically:

  • “I” → predict “went”

  • “I went” → predict “to”

  • “I went to school” → predict “yesterday”

No humans required. As long as you have text, you can create training data automatically.

This approach is called self-supervised learning. Unlike classic unsupervised learning (where no labels exist at all), self-supervised methods generate labels directly from the raw data itself. That’s why modern LM training almost always falls under the self-supervised umbrella.


Web-Scale Data

Back in the early days, available text datasets were tiny: a few MBs of news articles, some licensed books, or manually curated corpora. Even Wikipedia dumps in the single-digit GB range felt massive.

Then the internet changed everything. The web is fundamentally text-driven, and its scale is mind-boggling. Wikipedia? That’s just a drop in the bucket.

Projects like Common Crawl began collecting enormous swaths of web data—tens of terabytes and growing. And the best part? It’s freely available.

On top of that, many platforms have released their own cleaned-up datasets (within legal limits), which, while smaller, often have much higher quality than raw crawled text.


When the Two Collide

Now put these pieces together:

  • Self-supervised learning means we don’t need humans to label text.

  • Web-scale data means we suddenly have oceans of training material.

The result? A perfect storm for building today’s LLMs. That’s how we got from tiny datasets in the MB range to massive, automatically labeled corpora in the TB range—the fuel that makes GPTs, PaLMs, and LLaMAs possible.


⚠️ Spoiler Alert
This section dives into the geeky, technical weeds. If that’s not your thing, feel free to just take away this: Transformers are insanely powerful algorithms. Then, skip ahead and scroll straight down to the Compute Power section.

Algorithms

Even if you have oceans of data, you still need a good algorithm to digest it.

And here enters the celebrity of modern AI: the Transformer.

Released by Google in 2017, the Transformer is both an architecture and a set of structural ideas. You’ll hear people call it the Transformer model, the Transformer architecture, or just Transformer. Doesn’t matter — it’s the engine under the hood of nearly every LLM today.

But why is this thing so powerful? To explain that, we need to take a quick detour into a core concept: Sequence-to-Sequence models.


Sequence to Sequence

First things first: what’s a sequence? It’s just a list of items in order.

  • 1, 5, 3, 2, 1 → a sequence of numbers

  • ant, frog, dog, horse → a sequence of animals

  • I, went, to, school, yesterday → a sequence of words

In NLP, our main interest is the last one: word sequences, or simply, sentences.

Now, Sequence-to-Sequence (seq2seq) literally means sentence → sentence. In other words, one sentence gets transformed into another.

The most obvious example? Translation.

  • “나는 어제 학교에 갔다.” → “I went to school yesterday.”

But translation isn’t the only seq2seq task:

  • Summarization: long text → shorter text

  • Sentiment classification: review text → “positive” or “negative” (classification framed as text output)

  • Named Entity Recognition: input sentence → same sentence but with tags/highlighted entities (also cast as seq2seq)

Once you think this way, you realize seq2seq is almost a universal recipe for NLP tasks.


How Do We Translate?

Here’s the million-dollar question: how does your brain actually translate “나는 어제 학교에 갔다” into “I went to school yesterday”?

Answer: we don’t really know.

A classic hypothesis is the Inter-lingua theory: instead of going word-by-word, the brain converts the sentence into some abstract, universal “meaning language” (the inter-lingua) and then expresses it in the target language.

Conceptually:

  1. Korean sentence → Inter-lingua

  2. Inter-lingua → English sentence

In ML terms:

  • Encoding: convert source text into a hidden representation

  • Decoding: generate the target text from that hidden representation

That hidden representation isn’t human language. It’s an abstract space — what we in ML usually call latent vectors. The “inter-lingua” metaphor is useful, but in practice it’s math.


Encoding and Decoding — With an Analogy

Think data formats:

  • Encoding: turning light into video files, or compressing files into ZIP format. Same content, different form.

  • Decoding: playing that video file, unzipping that compressed folder, or decrypting a spy’s coded message.

Sometimes we call decoding generation, especially in NLP, since the model isn’t just restoring the original — it’s producing new text in a different form.

That’s exactly what translation does:

  • Encode: Korean → latent meaning representation

  • Decode: latent meaning → English

Why does this matter?

This encode/decode structure is perfect for deep learning because ML excels at learning A → B mappings from lots of examples. With enough bilingual pairs (or any source–target pairs), a model can learn how to encode and decode by itself.

And that’s where Transformers come in: they’re the best seq2seq engine we’ve discovered so far.


Back to Transformers

At its core, a Transformer is a very clever encoder–decoder model.

If you’ve Googled “Transformer” before, you’ve almost certainly seen a diagram like this: (left = encoder stack, right = decoder stack).

Flow:

  1. The input sentence (“I went to school yesterday”) goes into the encoder stack on the left.

  2. It passes through several encoder blocks, gradually turning into a Context representation.

  3. That Context is fed into the decoder stack on the right, where it guides generation.

  4. The decoder outputs probabilities for each possible next word.

Let’s make this concrete. Imagine our vocabulary has 10,000 words:

  • ID 1 = “a”

  • ID 2 = “apple”

  • ID 8789 = “went”

If the decoder is trying to predict the next word after “I,” it might output something like:

  • Word 1 (“a”): 0.00001

  • Word 2 (“apple”): 0.0004

  • Word 8789 (“went”): 0.901

Clearly, “went” is the winner.


Stacking Blocks

Those “Encoder” and “Decoder” boxes are actually stacks of blocks:

  • Each Encoder Block feeds the next Encoder Block.

  • After stacking N of them, the final block’s output becomes the encoder’s Context.

  • The Decoder works the same way, with multiple Decoder Blocks stacked.

Why stack blocks? Same reason we make deep neural networks “deep.” More layers = more expressive power = better performance.

Caveats:

  • If you scale the model without enough data, performance can get worse.

  • Bigger models demand far more compute and memory for both training and serving.

In other words: stacking is powerful, but it comes with a bill—sometimes a very expensive one.


So, What’s the Relationship Between Transformers and Language Models?

We’re not studying Transformers in isolation — we’re asking why they made LLMs possible.

At its core, a Language Model (LM) has a very simple job:

Predict the next word, given the text so far.

For example:

  • Input: “The flowers by the roadside are blooming …”

  • Bad guess: “punched.”

  • Good guess: “beautifully.”

And guess what? Transformers are ridiculously good at this game.


Back to Our Example

Take the sentence: “나는 어제 학교에 갔다.” (Korean: I went to school yesterday.)

  1. The encoder processes this input and produces a hidden Context representation (a compressed “meaning” in vector form).

  2. The decoder turns that Context into English — but it uses two inputs:

    • The Context (from the encoder)
    • Outputs (the decoder’s previously generated tokens)

Why “Outputs” as input? Because the decoder is auto-regressive — it feeds its own past predictions back in.


The Auto-Regressive Loop

Walkthrough:

  1. Encoding: The encoder converts “나는 어제 학교에 갔다” into Context vectors.

  2. First step: Context + <start> → predict “I.”

  3. Second step: Context + “I” → predict “went.”

  4. Third step: Context + “I went” → predict “to.”

  5. Repeat until “the school yesterday.”

  6. Stop at <end>.

Result: “I went to the school yesterday.”


Two Key Ideas

  1. One word at a time: The decoder generates token by token (why ChatGPT looks like it’s “typing”).

  2. Predictions feed back in: Each new token becomes input for the next step (auto-regression).

Perfect alignment:

  • The job of an LM is “predict the next word.”

  • The mechanism of the Transformer decoder is auto-regressive next-word prediction.

If we want a plain LM that predicts the next word in the same language, we make a small but important tweak…

Encoder-Only, Decoder-Only

The full Encoder–Decoder Transformer is powerful, but not everyone needs both halves. Researchers asked: What if we only used the encoder? What if we only used the decoder?


Encoder-Only

The most famous encoder-only model? BERT.

BERT keeps just the encoder stack. Sometimes all you need is a good representation of text (context vectors), not generation.

Great for classification tasks:

  • Is this review positive or negative?

  • Does this sentence contain a person’s name?

Classification works on embeddings. Better embeddings → better classifiers. BERT looks at text bidirectionally, encodes whole sentences, and produces rich representations. Plug them into a classifier and accuracy jumps.

Is BERT a language model? Strictly, no — it doesn’t do auto-regressive next-word prediction. It’s trained as a masked language model (predict the missing word), which is different from traditional LMs.


Decoder-Only

On the other side: GPT.

GPT (GPT-2/3, ChatGPT, GPT-4…) keeps only the decoder stack.

Why drop the encoder? If your goal is just next-word prediction — the pure LM task — you can feed the decoder with the text so far and let it continue auto-regressively.

  • Input: “The flowers by the roadside are blooming”

  • Decoder predicts: “beautifully.”

  • That prediction feeds back in, and generation continues.

This is why GPT and its cousins (LaMDA, PaLM, LLaMA, Claude, etc.) follow the decoder-only recipe. It’s the simplest and most direct way to scale LMs into generative engines.


Encoder + Decoder

Models like T5 and BART keep the full structure and shine at clear input → output transformations (translation, summarization, etc.).


Encoder vs. Decoder

Historically, encoder-only exploded first (BERT) because many NLP tasks were classification-heavy. Decoder-only models initially looked like “nonsense generators.”

Key difference:

  • Encoder-only models can’t generate text.

  • Decoder-only models can — and with scale, their potential is enormous. Even classification can be reframed as generation (“The review is … [positive/negative]”).

That’s why decoder-only LMs became the dominant LLMs.


A Long Tradition

Transformers didn’t invent encoder–decoder. Before 2017, RNNs/LSTMs/GRUs were the standard way to build it. Transformers replaced RNNs.

Biggest reason people cite: Self-Attention.


Why Do Transformers Work So Well? Self-Attention

Two concepts are central:

  • The Encoder–Decoder structure

  • Self-Attention

Let’s start with Attention itself.


Attention

Attention first showed up in RNN-based seq2seq models. Recall the pipeline:

Input → Encoder → Context → Decoder → Output

The decoder generates tokens one by one. Early models used a fixed Context for every step, but different output words need to “look back” at different parts of the input.

Example:

“나는 어제 학교에 갔습니다.”“I went to school yesterday.”

If the model could focus on 갔습니다 (went) and 어제 (yesterday) at the right time, it would more reliably pick “went” (past tense) over “go.”

That’s Attention: at each step, re-weight which parts of the input matter most.


Self-Attention

Seq2seq Attention asks: Which parts of the source should I attend to while generating the target?

Self-Attention asks: Within a single sentence, which words should each word attend to?

Example:

“The animal didn’t cross the street because it was too tired.”

Here, “it” should link strongly to “animal”, but also relates to “tired.”

Why is this powerful for LMs?

  • To predict “bloomed” in “The flowers by the roadside … bloomed,” “flowers” should get the highest weight.

  • To pick tense, “yesterday” matters more than “school.”

Self-Attention lets the model discover this automatically.


Multi-Head Self-Attention

Language has multiple relationship types:

  • Grammatical (subject ↔ verb)

  • Semantic (animal ↔ it)

  • Attributes (it ↔ tired)

One attention map can’t capture every view. The fix: run multiple attention heads in parallel, each with a different “view.”

Under the hood, word embeddings are split into subspaces (chunks of numbers). Each head attends within a different subspace, encouraging different aspects (grammar, meaning, style) to emerge.

Instead of one spotlight, give the model a dozen flashlights, each shining on a different relationship.

That’s the magic of Multi-Head Self-Attention — one of key reasons Transformers dethroned RNNs.


175B? 540B? What Do Parameter Counts Actually Mean?

You’ll often hear sizes like 175B (GPT-3) or 540B (PaLM). These are the number of parameters — the weights in the Transformer.

More parameters → more capacity. Hence the popular (but flawed) shortcut:

Bigger model → better performance.

In reality, performance depends on more than size:

  • How much data was used?

  • How high-quality was that data?

  • Were the hyperparameters tuned well?

  • How long (and how thoroughly) was the model trained?

So why do parameter counts dominate? They’re easy to understand.

If someone asks, “Which model is better, A or B?” you could unpack data quality, training steps, and optimizers… or say:

“Model A is 70B. Model B is 200B. Model B is better.”

It’s not necessarily true — but it’s simple.

⚠️ Pro tip: If someone talks about model quality only in terms of parameter count, be cautious. They either don’t fully understand, or they’re trying to sell you something.


Transformer in a Nutshell

  • Transformers were designed for Sequence-to-Sequence tasks.

  • The most common form is the Encoder–Decoder structure.

  • Variants exist: Encoder-only (BERT), Decoder-only (GPT), Encoder+Decoder (T5, BART).

  • To generate language, you need a Decoder.

  • A core innovation is Self-Attention.

  • To capture different perspectives (grammar, semantics, style), Transformers use Multi-Head Self-Attention.


Compute Power

The last ingredient: compute.

LLMs wouldn’t exist without massive progress in hardware and infrastructure:

  • GPUs (and TPUs) unlocked massively parallel training. GPUs were the rocket fuel of the deep learning boom, and today Nvidia still dominates with CUDA, optimized libraries, and cutting-edge hardware.

  • Parallel training techniques allow hundreds (or thousands) of GPUs to train a single model in sync.

  • Cloud infrastructure made it practical. Buying racks of GPUs is brutally expensive — and they start depreciating the moment you unbox them. Renting from AWS, Azure, or GCP lets teams scale without opening a hardware graveyard in the office.

In short: faster chips + smarter software + elastic cloud = the horsepower that makes LLMs possible.


Why LLMs Happened Now

We’ve walked through the three big ingredients:

  1. Data: Web-scale text + self-supervised learning → oceans of training material.

  2. Algorithms: Transformers (self-attention, scalable stacks) replaced RNNs.

  3. Compute: GPUs/TPUs + cloud infrastructure → enough horsepower to train monster models.

Each piece alone would’ve been impressive. Put together, they sparked a step-change.

A decade ago, we had:

  • Limited datasets (a few gigabytes at most).

  • Algorithms (RNNs, LSTMs) that struggled with long sequences.

  • GPUs that couldn’t realistically handle 100B+ parameter models.

Today, we have:

  • Tens of terabytes of training data at our fingertips.

  • Transformer architectures that scale beautifully.

  • GPU/TPU clusters that can train trillion-parameter models.

No single breakthrough “invented” LLMs. It was the intersection of trends — data, algorithms, compute — that finally clicked into place.

That’s why LLMs feel like they appeared “all of a sudden.” The truth is, researchers were laying the groundwork for years. The moment the three factors aligned, the field exploded.

And that’s where we are now: riding the wave of models that are bigger, smarter, and more capable than anyone thought possible five years ago.

In the next post, I’ll dive into zero-shot, few-shot, prompting, and the rest of the story.

Top comments (0)