zeromathai

Posted on Jun 16 • Originally published at zeromathai.com

How Transformer Architecture Works — Encoder, Decoder, Tokens, and Context

#ai #machinelearning #nlp #transformers

Transformers changed NLP because they stopped treating text as a simple left-to-right chain.

Instead of reading one token at a time, they compare tokens directly.

That shift made modern language models faster, more scalable, and better at understanding context.

Core Idea

A Transformer is a sequence-to-sequence architecture.

It maps an input sequence to an output sequence.

For example:

English sentence → Korean sentence

Question → Answer

Document → Summary

But the key idea is not “replace one word with another word.”

The key idea is:

Transformers build contextual token representations first.

Then they generate or transform output from those representations.

That is why the architecture matters.

It gives the model a structured way to understand relationships inside text.

The Key Structure

A simplified Transformer flow looks like this:

Input Text

→ Tokens

→ Word Embeddings

→ Encoder

→ Contextual Representations

→ Decoder

→ Output Tokens

More compactly:

Transformer = tokenization + embeddings + attention + encoder-decoder structure

The model first converts raw text into tokens.

Then each token becomes a vector.

Then attention updates each vector based on relationships with other tokens.

The Encoder understands the input.

The Decoder generates the output.

Implementation View

At a high level, the architecture works like this:

split input text into tokens

convert tokens into embedding vectors

pass embeddings through encoder layers

for each encoder layer:
    compute self-attention

    mix information across tokens

    apply feed-forward transformation

    produce contextual token representations

pass previous output tokens into decoder

for each decoder layer:
    apply masked self-attention

    attend to encoder output with cross-attention

    apply feed-forward transformation

    predict the next output token

This structure is practical because attention can be computed with matrix operations.

That makes Transformers much more GPU-friendly than step-by-step recurrent models.

This is one of the biggest reasons Transformers scaled so well.

Concrete Example

Take this sentence:

I love you.

An RNN reads it step by step:

I → love → you

A Transformer can compare all tokens directly.

When processing “love”, it can look at both “I” and “you” at the same time.

So “love” is not treated as an isolated word.

It becomes a contextual representation.

The model learns:

Who loves?

Who is loved?

Which tokens are related?

This matters because language is not just a sequence of words.

Language is a structure of relationships.

Sequence-to-Sequence View

A Transformer can be understood as a sequence-to-sequence model.

It receives one sequence.

It produces another sequence.

Examples:

translation
summarization
question answering
text generation
code generation

The input and output lengths do not need to match.

That is important.

A short sentence can become a long explanation.

A long document can become a short summary.

The model is not copying token positions.

It is transforming meaning.

RNN vs Transformer

This comparison explains why Transformers became dominant.

RNN:

processes tokens one by one
keeps information in a hidden state
naturally handles order
is hard to parallelize
can struggle with long-range dependencies

Transformer:

processes tokens in parallel
compares tokens directly
uses attention instead of recurrence
scales better on GPUs
models long-distance relationships more directly

The difference is simple:

RNN = memory through sequence steps

Transformer = relationships through attention

This is why Transformers are not just “faster RNNs.”

They represent sequence information in a different way.

Encoder-Decoder Architecture

The original Transformer uses an Encoder-Decoder structure.

The Encoder reads the input sequence.

The Decoder generates the output sequence.

Encoder:

receives input tokens
applies self-attention
builds contextual representations
outputs one vector per token

Decoder:

receives previously generated tokens
uses masked self-attention
attends to encoder output
predicts the next token

The Encoder answers:

What does the input mean?

The Decoder answers:

What should be generated next?

Transformer Encoder

The Transformer Encoder is a stack of repeated encoder layers.

Each layer has two main parts:

Self-Attention
Feed-Forward Network

Self-Attention lets each token look at other tokens in the same input.

The Feed-Forward Network transforms each token representation independently.

A simplified encoder layer looks like this:

Input

→ Self-Attention

→ Feed-Forward Network

→ Contextual Output

The important part is that every token representation becomes context-aware.

A word is no longer just a word vector.

It becomes a word vector shaped by the sentence around it.

Word Embedding, Tokens, and Vocabulary

A Transformer does not understand raw text directly.

It first splits text into tokens.

A token can be:

a word
a subword
a character-like unit
a special symbol

The full set of possible tokens is called the vocabulary.

Each token is mapped to a vector through an embedding layer.

The flow looks like this:

Raw text

→ Tokens

→ Token IDs

→ Embedding vectors

For example:

"I love you"

→ ["I", "love", "you"]

→ [token_id_1, token_id_2, token_id_3]

→ [vector_1, vector_2, vector_3]

This matters in practice.

When building with LLMs, tokenization affects cost, context length, latency, and output behavior.

So tokens are not just preprocessing details.

They are part of the model interface.

Transformer Decoder

The Transformer Decoder generates output tokens.

It has three main components:

Masked Self-Attention
Cross-Attention
Feed-Forward Network

Masked Self-Attention prevents the model from seeing future tokens.

This is required for autoregressive generation.

When predicting the next token, the model can only use previous tokens.

The flow looks like this:

Previous output tokens

→ Masked Self-Attention

→ Cross-Attention with Encoder Output

→ Feed-Forward Network

→ Next Token Prediction

This is how the model generates text step by step.

It predicts one token.

Then it appends that token.

Then it predicts the next token.

Cross-Attention

Cross-Attention connects the Decoder to the Encoder.

The Decoder asks:

Which part of the input should I focus on right now?

This is especially useful in translation.

The output word order may be different from the input word order.

A phrase in one language may correspond to several words in another language.

Cross-Attention helps the Decoder align output generation with the encoded input.

Without Cross-Attention, the Decoder would generate mainly from its own previous tokens.

With Cross-Attention, it can reference the input meaning directly.

Context Length

Context length means:

How many tokens the model can process at once.

A longer context allows the model to use more information.

This is useful for:

long documents
long conversations
code files
retrieval-augmented generation
summarization

But longer context is not free.

Attention compares tokens with other tokens.

So computational cost grows quickly as the sequence gets longer.

This is why context length is both powerful and expensive.

In real systems, context length affects memory usage, latency, and price.

Naive vs Practical View

Naive view:

A Transformer is a model that takes text and returns text.

Practical developer view:

A Transformer is a token-processing system with attention, context limits, and generation constraints.

Naive mindset:

input text
get output text

Practical mindset:

tokenize input

manage context length

understand attention cost

choose decoding strategy

optimize inference

control output quality

This matters because production AI systems are not only about model accuracy.

They are also about speed, memory, cost, and reliability.

Important Conditions and Limits

Transformers are powerful, but they have important constraints.

They need tokenization before processing text.

They need positional information because attention alone does not know order.

They can become expensive with long context.

Decoder generation is sequential during inference.

Context length limits how much information the model can use at once.

These limits explain why modern LLM engineering focuses so much on:

efficient attention
KV Cache
long-context optimization
better tokenization
inference speed
memory reduction

The architecture is elegant.

But scaling it requires engineering.

Transformer vs Traditional Seq2Seq

Traditional Seq2Seq:

often uses RNN-based Encoder and Decoder
compresses input into hidden states
processes sequence step by step
may lose information in long sequences

Transformer Seq2Seq:

uses attention-based Encoder and Decoder
keeps contextual representations for all tokens
supports parallel computation
models token relationships directly

The key difference:

Traditional Seq2Seq compresses through recurrence.

Transformer Seq2Seq connects through attention.

That is why Transformers became the foundation for modern NLP systems.

Takeaway

A Transformer works by turning tokens into contextual representations.

The Encoder understands the input.

The Decoder generates the output.

Self-Attention models relationships inside a sequence.

Cross-Attention connects generated output to encoded input.

Context length controls how much information the model can use.

If you remember one structure, remember this:

Text → Tokens → Embeddings → Attention → Contextual Representations → Output

That is the backbone of Transformer architecture.

Discussion

When learning Transformers, which part helped you understand the architecture fastest?

The Encoder-Decoder structure, Self-Attention, tokenization, or the generation loop?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/transformer-architecture-core-components-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community