Vaishali

Posted on Feb 12

How Transformer Architecture Powers LLMs

#ai #learning #llm #architecture

We use LLMs every day, but most explanations stop at
“it’s a transformer” and move on.

What actually happens between a prompt and the next generated word?
How does the model decide what matters and what doesn’t?

This article breaks down that flow — step by step — without math,
and without hand-waving.

🧠 How Transformers Differ from Traditional Models

Older language models processed text sequentially, focusing mostly on neighboring words.

That meant:

Limited long-range understanding
Difficulty connecting distant words in a sentence

Transformers changed this by doing something radical:

They consider the relationship between every word and every other word — all at once.

Instead of asking only:
“What word comes next based on the previous one?”

They ask:
“How does every word relate to every other word in this sentence?”

This is what allows LLMs to understand context at scale.

🧩 Breakdown of the Transformer's core components

Below are the key components that transform raw text into predictions.

1. Tokenization - Turning Text Into Numbers

Before anything else, the prompt is converted into tokens.

Example:
Prompt: "Write a story about dragon"
Tokens: [9566, 261, 4869, 1078, 103944]

Why this step exists?

Models don’t understand raw text.
They operate on numbers.

At this stage:

Tokens are just identifiers
They carry no meaning or context
“dragon” is just a number, not a concept

That limitation is solved in the next step.

2. Vector Embeddings - Adding Meaning Beyond Words

Vector embeddings capture semantic meaning — words with similar meanings end up closer together in vector space.

Consider these two sentences:

“He deposited money in the bank”
“They sat near the river bank”

Tokenization treats bank the same in both cases.

Why embeddings are needed?

Vector embeddings represent words in a multi-dimensional space where meaning depends on context.

Example:
bank (finance) → [0.82, -0.14, 0.56, 0.09]
bank (river)   → [-0.21, 0.77, -0.63, 0.48]

The numbers themselves don’t matter.
What matters is distance and direction between vectors.

This is how the model distinguishes meaning.

3. Positional Encoding - Preserving Word Order

Embeddings capture meaning — but not order.
Without positional information, these two sentences look identical to the model:

“The dog chased the cat”
“The cat chased the dog”

Positional encoding injects order information into each word embedding.

So now we have:

Embedding + Position

4. Self-Attention (The Core Idea)
Once embeddings + positional data are ready, they pass through the self-attention layer.

Self-attention assigns a weight to every word relative to every other word.

This allows the model to:

Focus on relevant relationships
Ignore irrelevant ones

Why self-attention exists?

Not all words matter equally.

In the sentence:

“The fisherman caught the fish with a net”

The model needs to figure out:

Does “with a net” describe fisherman or fish?

5. Multi-Head Self-Attention - Looking at Multiple Meanings at Once

A single attention pattern isn’t enough.
Different relationships exist at the same time:

grammatical
semantic
long-range dependencies

Multi-head attention solves this by running multiple attention layers in parallel.

Each head learns a different aspect of language:

one may focus on subject–verb relationships
another on modifiers
another on overall context

6. Feed-Forward Network
After attention, the representation goes into a feed-forward network.

What happens here?

The feed-forward layer helps the model decide what word should come next.
It does this by assigning a score to every word in the model’s vocabulary.
If the vocabulary contains 50,000 tokens, the output is a list of 50,000 scores.
These scores are called logits.

Example:

For sentence: "The cat is ..."
Logits →
[2.3, 4.97, 84.21, -5.65, ...]

where: 
- “sleeping” → very high score
- “running” → medium score
- “apple” → very low score

At this stage:

These are raw scores
They are not probabilities
Higher score = more likely next word

7. Softmax Output

The logits are passed through a softmax function.
Softmax:

converts scores into probabilities (0 → 1)
ensures they add up to 1

Now the model has a probability distribution over all possible next words.
The word with the highest probability is selected.

🔄 Putting It All Together: Encoder → Decoder Flow

Transformers are split into two major parts:

Encoder (Left side in the above image)
Decoder (Right side in the above image)

Let’s walk through them using an example.

Example Prompt: 
"Write a short story about dragon"

🔐 Encoder Flow

Prompt → Tokens
Tokens → Vector Embeddings
Embeddings + Positional Encoding
Multi-Head Self-Attention

The encoder produces a rich contextual representation.

It learns things like:

“story” relates to “dragon”
“short” modifies “story”
overall intent of the prompt

This output is not text — it’s meaning.

🎯 Decoder Flow (Word by Word Generation)

The decoder generates text one word at a time.

Step 1: Start Token

Initially, the decoder receives:

<START>

Because during training, the model learned patterns like:

“Write a story about…”
“Tell a story about…”

Many stories statistically start with:

"Once upon a time"

So the model predicts:

Once

The same process repeats for the next word, producing:

Once upon

Step 2: Masked Self-Attention

Masked self-attention ensures the model cannot see future words.

It allows:

“Once” → can see <START>
“upon” to look at both <START> and Once
but "Once" cannot attend to later tokens like upon, even though they are already part of the input

Step 3: Cross-Attention

Masked self-attention only looks at generated words.
But the model also needs to remember:

what the user asked for
what the prompt means

Why cross-attention exists?

Cross-attention allows the decoder to:

look at the encoder’s output
align generated words with the prompt’s meaning

For example, the encoder representation contains:

“story”
“dragon”

So when generating words, the decoder is reminded:

this is a story
it must involve a dragon
tone should match the prompt

Without cross-attention:

the model could drift off-topic
or generate generic text unrelated to the prompt

Step 4: Predict Next Word

At this stage, the decoder predicts the next word in three clear steps:

1. Feed-Forward Network (Logits Generation)
Based on the prompt and previously generated words, the feed-forward layer assigns a score to every word in the vocabulary.

2. Softmax (Probability Distribution)
The logits are passed through a softmax function, converting them into probabilities between 0 and 1, where all values sum to 1.

3. Token Selection
The word with the highest probability is chosen as the next token.

Example:
<START> Once upon
→ next token: "there"

The decoder input now becomes:

<START> Once upon there

This loop repeats token by token until the output is complete.

📝 Note on Modern LLMs

The original Transformer architecture includes both an encoder and a decoder.

However, many modern large language models (like GPT models) use a decoder-only architecture.

In these models:

The prompt is treated as part of the input sequence
The model uses masked self-attention
There is no separate encoder block

Despite this difference, the core idea — self-attention — remains the foundation.

🌱 Final Takeaway

LLMs don’t “understand” language like humans.

They:

learn patterns
assign probabilities
repeat this process thousands of times per response

But the Transformer architecture makes this process powerful by allowing:

global context
parallel processing
deep relationships between words

Seeing how fast LLM apps like ChatGPT respond,
I never imagined such a large, iterative process was running underneath.

Once you understand this flow, LLMs stop feeling magical — and start feeling engineered.

DEV Community