DEV Community

Cover image for How Transformer Architecture Powers LLMs
Vaishali
Vaishali

Posted on

How Transformer Architecture Powers LLMs

We use LLMs every day, but most explanations stop at
“it’s a transformer” and move on.

What actually happens between a prompt and the next generated word?
How does the model decide what matters and what doesn’t?

This article breaks down that flow — step by step — without math,
and without hand-waving.


🧠 How Transformers Differ from Traditional Models

Older language models processed text sequentially, focusing mostly on neighboring words.

That meant:

  • Limited long-range understanding
  • Difficulty connecting distant words in a sentence

Transformers changed this by doing something radical:

They consider the relationship between every word and every other word — all at once.

Instead of asking only:
“What word comes next based on the previous one?”

They ask:
“How does every word relate to every other word in this sentence?”

This is what allows LLMs to understand context at scale.


🧩 Breakdown of the Transformer's core components

Below are the key components that transform raw text into predictions.

1. Tokenization - Turning Text Into Numbers

Before anything else, the prompt is converted into tokens.

Example:
Prompt: "Write a story about dragon"
Tokens: [9566, 261, 4869, 1078, 103944]
Enter fullscreen mode Exit fullscreen mode

Why this step exists?

Models don’t understand raw text.
They operate on numbers.

At this stage:

  • Tokens are just identifiers
  • They carry no meaning or context
  • “dragon” is just a number, not a concept

That limitation is solved in the next step.

2. Vector Embeddings - Adding Meaning Beyond Words

Vector embeddings capture semantic meaning — words with similar meanings end up closer together in vector space.

Consider these two sentences:

  • “He deposited money in the bank
  • “They sat near the river bank

Tokenization treats bank the same in both cases.

Why embeddings are needed?

Vector embeddings represent words in a multi-dimensional space where meaning depends on context.

Example:
bank (finance) → [0.82, -0.14, 0.56, 0.09]
bank (river)   → [-0.21, 0.77, -0.63, 0.48]
Enter fullscreen mode Exit fullscreen mode

The numbers themselves don’t matter.
What matters is distance and direction between vectors.

This is how the model distinguishes meaning.

3. Positional Encoding - Preserving Word Order

Embeddings capture meaning — but not order.
Without positional information, these two sentences look identical to the model:

  • “The dog chased the cat”
  • “The cat chased the dog”

Positional encoding injects order information into each word embedding.

So now we have:

Embedding + Position
Enter fullscreen mode Exit fullscreen mode

4. Self-Attention (The Core Idea)
Once embeddings + positional data are ready, they pass through the self-attention layer.

Self-attention assigns a weight to every word relative to every other word.

This allows the model to:

  • Focus on relevant relationships
  • Ignore irrelevant ones

Why self-attention exists?

Not all words matter equally.

In the sentence:

“The fisherman caught the fish with a net”

The model needs to figure out:

  • Does “with a net” describe fisherman or fish?

Image showing self attention

5. Multi-Head Self-Attention - Looking at Multiple Meanings at Once

A single attention pattern isn’t enough.
Different relationships exist at the same time:

  • grammatical
  • semantic
  • long-range dependencies

Multi-head attention solves this by running multiple attention layers in parallel.

Each head learns a different aspect of language:

  • one may focus on subject–verb relationships
  • another on modifiers
  • another on overall context

Image showing multi-head attention

6. Feed-Forward Network
After attention, the representation goes into a feed-forward network.

What happens here?

  • The feed-forward layer helps the model decide what word should come next.
  • It does this by assigning a score to every word in the model’s vocabulary.
  • If the vocabulary contains 50,000 tokens, the output is a list of 50,000 scores.
  • These scores are called logits.
Example:

For sentence: "The cat is ..."
Logits →
[2.3, 4.97, 84.21, -5.65, ...]

where: 
- “sleeping” → very high score
- “running” → medium score
- “apple” → very low score
Enter fullscreen mode Exit fullscreen mode

At this stage:

  • These are raw scores
  • They are not probabilities
  • Higher score = more likely next word

7. Softmax Output

The logits are passed through a softmax function.
Softmax:

  • converts scores into probabilities (0 → 1)
  • ensures they add up to 1

Now the model has a probability distribution over all possible next words.
The word with the highest probability is selected.


🔄 Putting It All Together: Encoder → Decoder Flow

Transformer Architecture

Transformers are split into two major parts:

  • Encoder (Left side in the above image)
  • Decoder (Right side in the above image)

Let’s walk through them using an example.

Example Prompt: 
"Write a short story about dragon"
Enter fullscreen mode Exit fullscreen mode

🔐 Encoder Flow

  1. Prompt → Tokens
  2. Tokens → Vector Embeddings
  3. Embeddings + Positional Encoding
  4. Multi-Head Self-Attention

The encoder produces a rich contextual representation.

It learns things like:

  • “story” relates to “dragon”
  • “short” modifies “story”
  • overall intent of the prompt

This output is not text — it’s meaning.


🎯 Decoder Flow (Word by Word Generation)

The decoder generates text one word at a time.

Step 1: Start Token

Initially, the decoder receives:

<START>
Enter fullscreen mode Exit fullscreen mode

Because during training, the model learned patterns like:

  • “Write a story about…”
  • “Tell a story about…”

Many stories statistically start with:

"Once upon a time"
Enter fullscreen mode Exit fullscreen mode

So the model predicts:

Once
Enter fullscreen mode Exit fullscreen mode

The same process repeats for the next word, producing:

Once upon
Enter fullscreen mode Exit fullscreen mode

Step 2: Masked Self-Attention

Masked self-attention ensures the model cannot see future words.

It allows:

  • “Once” → can see <START>
  • “upon” to look at both <START> and Once
  • but "Once" cannot attend to later tokens like upon, even though they are already part of the input

Step 3: Cross-Attention

Masked self-attention only looks at generated words.
But the model also needs to remember:

  • what the user asked for
  • what the prompt means

Why cross-attention exists?

Cross-attention allows the decoder to:

  • look at the encoder’s output
  • align generated words with the prompt’s meaning

For example, the encoder representation contains:

  • “story”
  • “dragon”

So when generating words, the decoder is reminded:

  • this is a story
  • it must involve a dragon
  • tone should match the prompt

Without cross-attention:

  • the model could drift off-topic
  • or generate generic text unrelated to the prompt

Step 4: Predict Next Word

At this stage, the decoder predicts the next word in three clear steps:

1. Feed-Forward Network (Logits Generation)
Based on the prompt and previously generated words, the feed-forward layer assigns a score to every word in the vocabulary.

2. Softmax (Probability Distribution)
The logits are passed through a softmax function, converting them into probabilities between 0 and 1, where all values sum to 1.

3. Token Selection
The word with the highest probability is chosen as the next token.

Example:
<START> Once upon
→ next token: "there"
Enter fullscreen mode Exit fullscreen mode

The decoder input now becomes:

<START> Once upon there
Enter fullscreen mode Exit fullscreen mode

This loop repeats token by token until the output is complete.


📝 Note on Modern LLMs

The original Transformer architecture includes both an encoder and a decoder.

However, many modern large language models (like GPT models) use a decoder-only architecture.

In these models:

  • The prompt is treated as part of the input sequence
  • The model uses masked self-attention
  • There is no separate encoder block

Despite this difference, the core idea — self-attention — remains the foundation.


🌱 Final Takeaway

LLMs don’t “understand” language like humans.

They:

  • learn patterns
  • assign probabilities
  • repeat this process thousands of times per response

But the Transformer architecture makes this process powerful by allowing:

  • global context
  • parallel processing
  • deep relationships between words

Seeing how fast LLM apps like ChatGPT respond,
I never imagined such a large, iterative process was running underneath.

Once you understand this flow, LLMs stop feeling magical — and start feeling engineered.

Top comments (0)