Jimin Lee

Posted on Sep 13

(3/4) LLM: Inside the Transformer

#deeplearning #architecture #ai #llm

This post was written in April 2023, so some parts may now be a bit outdated. However, most of the key ideas about LLMs remain just as relevant today.

Encoder-Only, Decoder-Only

The full Encoder–Decoder Transformer is powerful, but not everyone needs both halves. Researchers asked: What if we only used the encoder? What if we only used the decoder?

Encoder-Only

The most famous encoder-only model? BERT.

BERT keeps just the encoder stack. Sometimes all you need is a good representation of text (context vectors), not generation.

Great for classification tasks:

Is this review positive or negative?
Does this sentence contain a person’s name?

Classification works on embeddings. Better embeddings → better classifiers. BERT looks at text bidirectionally, encodes whole sentences, and produces rich representations. Plug them into a classifier and accuracy jumps.

Is BERT a language model? Strictly, no — it doesn’t do auto-regressive next-word prediction. It’s trained as a masked language model (predict the missing word), which is different from traditional LMs.

Decoder-Only

On the other side: GPT.

GPT (GPT-2/3, ChatGPT, GPT-4…) keeps only the decoder stack.

Why drop the encoder? If your goal is just next-word prediction — the pure LM task — you can feed the decoder with the text so far and let it continue auto-regressively.

Input: “The flowers by the roadside are blooming”
Decoder predicts: “beautifully.”
That prediction feeds back in, and generation continues.

This is why GPT and its cousins (LaMDA, PaLM, LLaMA, Claude, etc.) follow the decoder-only recipe. It’s the simplest and most direct way to scale LMs into generative engines.

Encoder + Decoder

Models like T5 and BART keep the full structure and shine at clear input → output transformations (translation, summarization, etc.).

Encoder vs. Decoder

Historically, encoder-only exploded first (BERT) because many NLP tasks were classification-heavy. Decoder-only models initially looked like “nonsense generators.”

Key difference:

Encoder-only models can’t generate text.
Decoder-only models can — and with scale, their potential is enormous. Even classification can be reframed as generation (“The review is … [positive/negative]”).

That’s why decoder-only LMs became the dominant LLMs.

A Long Tradition

Transformers didn’t invent encoder–decoder. Before 2017, RNNs/LSTMs/GRUs were the standard way to build it. Transformers replaced RNNs.

Biggest reason people cite: Self-Attention.

Why Do Transformers Work So Well? Self-Attention

Two concepts are central:

The Encoder–Decoder structure
Self-Attention

Let’s start with Attention itself.

Attention

Attention first showed up in RNN-based seq2seq models. Recall the pipeline:

Input → Encoder → Context → Decoder → Output

The decoder generates tokens one by one. Early models used a fixed Context for every step, but different output words need to “look back” at different parts of the input.

Example:

“나는 어제 학교에 갔습니다.” → “I went to school yesterday.”

If the model could focus on 갔습니다 (went) and 어제 (yesterday) at the right time, it would more reliably pick “went” (past tense) over “go.”

That’s Attention: at each step, re-weight which parts of the input matter most.

Self-Attention

Seq2seq Attention asks: Which parts of the source should I attend to while generating the target?

Self-Attention asks: Within a single sentence, which words should each word attend to?

Example:

“The animal didn’t cross the street because it was too tired.”

Here, “it” should link strongly to “animal”, but also relates to “tired.”

Why is this powerful for LMs?

To predict “bloomed” in “The flowers by the roadside … bloomed,” “flowers” should get the highest weight.
To pick tense, “yesterday” matters more than “school.”

Self-Attention lets the model discover this automatically.

Multi-Head Self-Attention

Language has multiple relationship types:

Grammatical (subject ↔ verb)
Semantic (animal ↔ it)
Attributes (it ↔ tired)

One attention map can’t capture every view. The fix: run multiple attention heads in parallel, each with a different “view.”

Under the hood, word embeddings are split into subspaces (chunks of numbers). Each head attends within a different subspace, encouraging different aspects (grammar, meaning, style) to emerge.

Instead of one spotlight, give the model a dozen flashlights, each shining on a different relationship.

That’s the magic of Multi-Head Self-Attention — one of key reasons Transformers dethroned RNNs.

175B? 540B? What Do Parameter Counts Actually Mean?

You’ll often hear sizes like 175B (GPT-3) or 540B (PaLM). These are the number of parameters — the weights in the Transformer.

More parameters → more capacity. Hence the popular (but flawed) shortcut:

Bigger model → better performance.

In reality, performance depends on more than size:

How much data was used?
How high-quality was that data?
Were the hyperparameters tuned well?
How long (and how thoroughly) was the model trained?

So why do parameter counts dominate? They’re easy to understand.

If someone asks, “Which model is better, A or B?” you could unpack data quality, training steps, and optimizers… or say:

“Model A is 70B. Model B is 200B. Model B is better.”

It’s not necessarily true — but it’s simple.

⚠️ Pro tip: If someone talks about model quality only in terms of parameter count, be cautious. They either don’t fully understand, or they’re trying to sell you something.

Transformer in a Nutshell

Transformers were designed for Sequence-to-Sequence tasks.
The most common form is the Encoder–Decoder structure.
Variants exist: Encoder-only (BERT), Decoder-only (GPT), Encoder+Decoder (T5, BART).
To generate language, you need a Decoder.
A core innovation is Self-Attention.
To capture different perspectives (grammar, semantics, style), Transformers use Multi-Head Self-Attention.

Compute Power

The last ingredient: compute.

LLMs wouldn’t exist without massive progress in hardware and infrastructure:

GPUs (and TPUs) unlocked massively parallel training. GPUs were the rocket fuel of the deep learning boom, and today Nvidia still dominates with CUDA, optimized libraries, and cutting-edge hardware.
Parallel training techniques allow hundreds (or thousands) of GPUs to train a single model in sync.
Cloud infrastructure made it practical. Buying racks of GPUs is brutally expensive — and they start depreciating the moment you unbox them. Renting from AWS, Azure, or GCP lets teams scale without opening a hardware graveyard in the office.

In short: faster chips + smarter software + elastic cloud = the horsepower that makes LLMs possible.

Why LLMs Happened Now

We’ve walked through the three big ingredients:

Data: Web-scale text + self-supervised learning → oceans of training material.
Algorithms: Transformers (self-attention, scalable stacks) replaced RNNs.
Compute: GPUs/TPUs + cloud infrastructure → enough horsepower to train monster models.

Each piece alone would’ve been impressive. Put together, they sparked a step-change.

A decade ago, we had:

Limited datasets (a few gigabytes at most).
Algorithms (RNNs, LSTMs) that struggled with long sequences.
GPUs that couldn’t realistically handle 100B+ parameter models.

Today, we have:

Tens of terabytes of training data at our fingertips.
Transformer architectures that scale beautifully.
GPU/TPU clusters that can train trillion-parameter models.

No single breakthrough “invented” LLMs. It was the intersection of trends — data, algorithms, compute — that finally clicked into place.

That’s why LLMs feel like they appeared “all of a sudden.” The truth is, researchers were laying the groundwork for years. The moment the three factors aligned, the field exploded.

And that’s where we are now: riding the wave of models that are bigger, smarter, and more capable than anyone thought possible five years ago.

In the next post, I’ll dive into zero-shot, few-shot, prompting, and the rest of the story.

DEV Community