Seenivasa Ramadurai

Posted on Mar 7

Understanding the Transformer Architecture : A Student's Journey from Classroom to Exam Hall

#machinelearning #architecture #deeplearning #ai

A student who studies well encodes knowledge. A student who writes the exam well decodes it.

Introduction

In 2017, a team of researchers at Google published a paper titled ‘Attention Is All You Need’. It introduced the Transformer an architecture so powerful that it became the foundation of almost every major AI system built since: GPT, BERT, Claude, Gemini, and beyond. Today, when you talk to an AI assistant, translate a document, or use a search engine, a Transformer is almost certainly running underneath.

Yet for most people, the Transformer remains a mystery. Diagrams full of arrows, boxes labelled with words like ‘Multi-Head Attention’ and ‘Softmax’, and explanations drowning in matrix mathematics make it feel like something only a researcher could understand.

But here is the truth every single stage of the Transformer maps almost perfectly onto something every human being has already lived through going to school, studying for an exam, and sitting in the exam hall to write the answers.

The Encoder the part that reads and understands the input is the student in school: reading the textbook, building vocabulary, connecting subjects, revising repeatedly. The Decoder the part that generates the output is that same student in the exam hall recalling what they studied, building the answer word by word, committing to each choice one at a time.

In this blog, I will walks through every stage of the Transformer using this analogy from the very first step of Tokenization all the way to the final Softmax output. No equations. No tensors. Just a student, a school, and an exam hall.

The Encoder Going to School

The Encoder’s job is to read, understand, and deeply memorize the input. Think of it as everything a student does before the exam: picking up textbooks, following the syllabus, connecting subjects, and revising multiple times.

Stage 0:

Tokenization Reading the Textbook Word by Word before any learning begins, the raw input must be broken down into smaller, manageable units called tokens. A token is not always a full word it can be a word, a part of a word, or even punctuation. For example, ‘unhappiness’ might become two tokens ‘un’ and ‘happiness’, each carrying independent meaning.

This is the very first step. The model never sees raw text. It only ever receives a sequence of token IDs each one a known entry in the vocabulary before any further processing begins.

Analogy:

When a student sits down to study a new chapter, they do not absorb the whole page in one go. Their eyes move word by word, sometimes breaking an unfamiliar long word into recognizable parts ‘photo’ and ‘synthesis’ from ‘photosynthesis’, or ‘micro’ and ‘biology’ from ‘microbiology’. They process one unit at a time, in sequence. Tokenization is exactly this natural reading process breaking the incoming text into the smallest units the model can process one at a time before understanding begins.

Stage 1:

Input Embeddings the Personal Vocabulary Dictionary
Once the input is tokenized, each token must be converted into something the model can mathematically reason about a dense numerical vector called an embedding. But where do these embeddings come from?

The model learns an Embedding Matrix during training a giant lookup table where every token in the vocabulary maps to a unique, meaningful vector. Different models build this dictionary differently and at different scales:

BERT uses WordPiece tokenization with a vocabulary of 30,522 tokens roughly 30,000 word pieces covering English and common subwords. GPT models use Byte Pair Encoding (BPE), which starts from raw bytes and merges the most frequent pairs repeatedly, building a vocabulary of around 50,257 tokens for GPT-2. Google’s models such as T5 and PaLM use SentencePiece, a language agnostic tokenizer that works directly on raw text without requiring pre tokenization, typically with vocabularies of around 32,000 tokens.

Each of these is a different strategy for building the same thing: a pre trained vocabulary dictionary that the model masters before learning anything task specific. This matrix is the language foundation the mother tongue that everything else is built upon.

Analogy:

Think of three students who each grew up speaking a different language. The English speaking student (like BERT) has memorized around 30,000 words and word fragments in their mental dictionary. The student trained on a broader international curriculum (like GPT) has a larger dictionary of over 50,000 entries, built by merging the most common letter patterns they kept encountering. The Google trained student uses a flexible script that works across any language without needing spaces or rules first. Each student has a different sized and differently built vocabulary but all three mastered their language dictionary completely before ever opening a subject textbook. That pre built mental dictionary is the Embedding Matrix.

Stage 2:

Positional Encoding Numbering the Notes
Here is a critical insight: the Transformer sees all tokens simultaneously it has no built in sense of what comes first or last. Left to itself, it would treat a sentence as an unordered bag of words. Positional Encoding solves this by injecting a unique position signal into each token’s embedding, so the model always knows token 1 from token 7, and understands that order carries meaning.

The same word in position 1 versus position 10 can mean something entirely differentand the model must know which is which.

Analogy:

A student takes notes while studying, writing down one key point after another. Smart students do one crucial thing: they number every point 1, 2, 3, 4. Without those numbers, the notes are just a pile of facts with no sequence impossible to know what led to what, what is the cause and what is the effect. The numbers do not change the content of each point, but they tell you exactly where each one sits in the story. Positional Encoding is those numbers written next to every word a small tag that tells the model ‘this is word 1’, ‘this is word 5’, ‘this is word 12’, so sequence and order are never lost.

Stage 3:

MultiHead Self-Attention Deep Study with Cross-Topic Understanding, This is the beating heart of the Encoder. Self-Attention allows every word to look at every other word in the sentence and decide how much attention to pay to each dynamically, contextually.

MultiHead means this process happens in parallel across multiple attention ‘heads’. Each head focuses on a different type of relationship: one might focus on grammatical structure, another on semantic meaning, another on coreference. The outputs are then concatenated and projected.

Analogy:

A truly sharp student does not study Biology in isolation. While reading about cells, they connect to Chemistry for reactions, Physics for energy, and Mathematics for statistics. Each subject gives a different perspective on the same material. Multi Head Attention works identically: each head reads the entire sentence from a different angle one tracking grammar, another tracking meaning, another tracking which words refer to which.

What the original analogy nailed: Multi head attention = studying more than one subject at once.

Stage 4:

Add & Norm Merging Notes & Sanity Check
After each major step Self-Attention and Feed-Forward the Transformer performs two crucial operations: a Residual Connection (Add) and Layer Normalization(Norm).

The Residual Connection adds the layer’s input back to its output. This means knowledge from before the attention step is never lost it is carried forward and merged with the new understanding. Layer Normalization then stabilizes the values, preventing any single dimension from dominating.

Analogy:

After every intense study session, a wise student does two things: they go back and merge their new notes with their original ones, so nothing previously learned is lost. Then they do a quick review to make sure their understanding is balanced and not wildly skewed by one session. This is exactly Add and Norm carry forward what was already known, then stabilize before moving to the next round.

Why this matters: Without residual connections, training very deep networks (6, 12, 96 layers) becomes mathematically impossible gradients vanish. This single component is what makes modern large language models feasible.

Stage 5:

Feed-Forward Network Deep Individual Processing
After attention connects the words, each word’s representation passes independently through a position-wise Feed-Forward Network two linear layers with a ReLU activation between them.

Attention discovers relationships. The FFN processes them. Together, they complete one full Encoder layer.

Analogy:

Once the student has connected all their subjects together, they sit down for focused, individual revision on each topic separately. Biology on its own. Chemistry on its own. The cross-subject connections are already made now it is time to go deep on each one independently. The FFN does exactly this: it processes each word’s representation in isolation, after attention has already found all the relationships.

Stage 6:

Stacked Encoder Layers (N = 6) Multiple Rounds of Revision
The original Transformer stacks six Encoder layers on top of each other. Each layer takes the output of the previous one and refines it further building progressively richer, more abstract representations.

Analogy:

No student masters a subject in one sitting. The first read gives a rough map. The second reveals nuance. The fourth uncovers underlying principles. By the sixth revision, the student has the kind of deep, flexible understanding that can handle any question thrown at them. Each Encoder layer is one of those revision passes each one refining and deepening what the previous one built.

The Decoder Writing the Exam

The Decoder’s job is to generate the output one token at a time using both the Encoder’s rich understanding of the input and everything it has already written so far. It is the student, sitting in the exam hall, translating months of preparation into a coherent, correct answer.

Stage 7:

Output Embeddings + Positional Encoding Reading Your Answer Sheet
Just as the Encoder starts by embedding its inputs, the Decoder embeds the tokens already generated and adds positional encoding. The model needs to know what it has written so far and where each token sits in the output sequence.

Analogy:

Before writing each new sentence, the student quickly re reads what they have already written. They know what point they made in paragraph one, where they are in the argument, and what logically comes next. The Decoder’s Output Embeddings and Positional Encoding do the same: they tell the model what has already been generated and exactly where each token sits in the output so far.

Stage 8:

Masked Multi-Head Self-Attention No Peeking Ahead
This is Self-Attention within the Decoder, but with one critical constraint: it is masked. The model can only attend to tokens it has already generated not to future tokens. This masking prevents the model from ‘cheating’ during training by looking at answers it has not yet generated.

Analogy:

During an exam, a student can only build on sentences they have already written they cannot reference an answer they have not yet reached. They work strictly forward. Masked Self-Attention enforces this same discipline: when generating word 7, the model can only see words 1 through 6. No looking ahead. No shortcuts. Every word is earned from what came before.

Stage 9:

Cross-Attention Recalling Study Notes Mid-Exam
This is the crucial bridge between Encoder and Decoder. The Decoder generates ‘queries’ from what it is currently writing. These queries attend to the Encoder’s ‘keys’ and ‘values’ the full, rich understanding of the input. Each output token is thus informed by the entire input context.

Analogy:

Here is the most important moment in the exam: the student reaches into memory and pulls out exactly what they studied. The question about photosynthesis instantly triggers recall of the Calvin cycle, ATP production, chlorophyll’s role. This retrieval of study knowledge while writing the exam is Cross-Attention — the Decoder actively reaching back into the Encoder’s deep understanding of the input to inform every single word it writes.

What the original analogy captured: ‘while writing the exam he gets info’ Yes. That is Cross-Attention, precisely.

Stage 10:

Feed-Forward Network + Add & Norm Formulating the Answer
Identical in structure to the Encoder: after attention, each position goes through a Feed-Forward Network, followed by a Residual Connection and Layer Normalization.

Analogy:

With the relevant knowledge recalled, the student now formulates their answer mentally before committing pen to paper organizing the ideas into coherent sentences, checking the logic, trimming the excess. The Decoder’s Feed-Forward Network does this same internal formulation: it processes and refines each position’s representation, turning raw recalled knowledge into something ready to be expressed as a word.

Stage 11:

Linear Layer Scanning the Vocabulary
The Decoder’s final output vector is projected through a Linear layer that maps the internal representation to the full vocabulary size potentially 50,000 or more dimensions, one per possible word.

Analogy:

The student’s understanding at this point is rich but still internal a complex web of ideas in their mind. To write the answer, they must translate that internal understanding into actual words. The Linear Layer does exactly this: it projects the model’s complex internal representation outward, producing a raw score for every single word in the vocabulary rating how well each one fits as the next token to write.

Stage 12:

Softmax-The Multiple Choice Moment
Softmax converts the raw vocabulary scores produced by the Linear Layer into a clean probability distribution across all possible tokens. The token with the highest probability is selected as the next output word. Every other candidate is suppressed. The model commits to exactly one word and moves on.

This is not random selection. The better the model has been trained, the more sharply its probability concentrates on the correct token and the more confidently it writes.

Analogy:

Every student knows the feeling of a well designed multiple choice question. Four options sit in front of them: A, B, C, D. For a student who studied thoroughly, one option immediately stands out as unmistakably correct the others feel obviously wrong the moment they are read. The student circles A without hesitation and moves on. For a student who barely revised, all four options feel plausible, none stands out, and they guess. Softmax is that moment of recognition: it takes the raw scores for every word in the vocabulary, converts them into a clean set of probabilities, and the one that rises clearly to the top gets written down. A well trained model circles the right answer immediately. A poorly trained one is left guessing between equally likely options.

What the original analogy nailed: ‘curate in order Softmax’ the selection and commitment aspect is precisely right.

Final Thought

What makes this analogy powerful is that it grounds an abstract mathematical architecture in a universal human experience. Every person who has crammed for an exam, drawn connections between subjects, or carefully chosen their words in an answer has intuitively lived through the Transformer architecture.

The Transformer does not just process text it studies, connects, recalls, and responds. Just like the best students do.

“The attention mechanism is not artificial intelligence imitating a machine. It is artificial intelligence imitating a student.”

Thanks
Sreeni Ramadorai