Anik Chand

Posted on Oct 9

How Google Translate & ChatGPT Work: The Transformer, Unboxed

#ai #architecture #llm

What Exactly Is a Transformer? 🤔

Ever used Google Translate 🌍 or chatted with ChatGPT 💬?

Behind both lies the same breakthrough: the Transformer ⚡.

Imagine an AI that doesn’t read sentences like a robot 🤖—one word at a time—but like a human 🧠:

instantly grasping how every word connects to every other word.

That’s the Transformer.

It’s a revolutionary AI architecture that understands and generates language by focusing on the most important words at once—no slow, step-by-step reading required.

Born in 2017 from the paper that changed everything—“Attention Is All You Need”✨—the Transformer ditched old-school methods and bet everything on one powerful idea: attention 👀.

And it worked. So well, in fact, that it now powers the smartest language tools you use every day.

📄 Curious how it all began?

Read the original paper here: “Attention Is All You Need”

High-Level Architecture: The Two Main Parts

The Transformer has two big sections that work together, like two teams: the Encoder and the Decoder.

The process works in a few simple steps:

Input Ready: We take the starting sentence (like, the German one). We give each word a number (this is Embedding), and we also add a special code (Positional Encoding) so the Transformer knows the order of the words.
The Encoder's Job: The Encoder reads the whole input sentence and figures out the complete meaning and context of every word. It creates a detailed "thought" for the sentence.
The Decoder's Job: The Decoder starts with a "Start" signal. It looks at the Encoder's "thought" and starts writing the new sentence (the English one), one word at a time.
Final Output: A simple layer (Linear and Softmax) at the end chooses the most likely word to be the next one in the sentence.

🧑‍🍳 A Quick Analogy: The Bilingual Cooking Show

Imagine a chef who must recreate a dish from a foreign recipe—but doesn’t speak the language.

The Encoder is like a team of expert tasters who read the whole original recipe at once and create a Master Flavor Map.
The Decoder is the recreating chef who:
- Starts with a <start> note,
- Can only look at what they’ve already cooked (no peeking ahead!),
- And keeps glancing at the Flavor Map to decide the next ingredient.
Finally, a pantry assistant (Linear + Softmax) picks the most likely Hindi word (ingredient) for each step.

This is exactly how the Transformer translates "How are you?" → "तुम कैसे हो?"—one smart, attentive step at a time!

High-Level Block Diagram

This shows the big picture. Now, let's open up these big blocks and see the smaller, powerful layers inside!

🏗️ Inside the Input Block (Encoder Input)

The Input Block is the very first step on both the Encoder and Decoder side of the Transformer. It takes the original words and prepares them for the attention layers.

Let’s follow the flow in the diagram — from bottom to top — to see exactly how the input sentence gets ready for the Transformer.

Step 1: Tokenizer

The sentence "How are you" goes into the Tokenizer.
It breaks the sentence into individual words (or subwords): → "How", "are", "you"

💡 Think of this like cutting a sandwich into pieces before eating it — the model works on one word at a time.

Step 2: Embedding (512 dim)

Each word ("How", "are", "you") is sent to the Embedding layer.
This turns each word into a 512-number list (called a vector).
These are called Word Embeddings: E1, E2, E3 (each is (512,0) — meaning 512 numbers).

✅ Example:

"How" → E1 = [0.2, -0.8, 0.9, ..., 0.1]

"are" → E2 = [0.7, 0.3, -0.6, ..., 0.4]

"you" → E3 = [-0.1, 0.9, 0.2, ..., -0.7]

Step 3: Positional Embeddings

At the same time, each word gets a Positional Embedding: P1, P2, P3.
These are not learned — they’re precomputed using special math (sine/cosine waves) so every position has a unique pattern.
Why? So the model knows that "How" is first, "are" is second, "you" is third.

🧩 Without this, "Cat chases dog" and "Dog chases cat" would look identical!

Step 4: Add Them Together → Positional Encoded Vectors

For each word, we add its Word Embedding and Positional Embedding:
- X1 = E1 + P1
- X2 = E2 + P2
- X3 = E3 + P3

These final vectors — X1, X2, X3 — are called Positional Encoded Vectors.

Each is still 512 numbers — but now they contain both meaning AND position.

🎯 This is the magic: the model now has all the info it needs to start paying attention!

📌 What Happens Next?

These X1, X2, X3 vectors are now ready to go into the first Encoder block — where the real “attention” begins!

🧱 Inside the Encoder Block

The Encoder takes your input sentence (like "How are you?") and turns it into a deep, contextual understanding of every word.

It does this in two main steps: first, a Multi-Head Attention Block lets each word understand its relationship to all others. Then, a Feed Forward Neural Network Block refines that meaning further.

This whole process repeats 6 times — each time making the understanding richer.

Let’s walk through one full Encoder block using your detailed diagram — from bottom to top.

➡️ Step 1: Input — Positional Encoded Vectors (`X1`, `X2`, `X3`)

Input shape: (3, 512) → 3 words, each as a 512-number vector.
These come from the Input Block (after adding Word + Positional Embeddings).

🟢 Step 2: Multi Head Attention

Each word (X1, X2, X3) looks at all other words to understand context.
Output: Contextual Embeddings → Z1, Z2, Z3 (still (3, 512)).
This is where the model learns that "you" should pay attention to "How" and "are".

➕ Step 3: Residual Connection + Layer Normalisation

Add original input back: Z1' = Z1 + X1 Z2' = Z2 + X2 Z3' = Z3 + X3
Apply Layer Normalisation → Z1norm, Z2norm, Z3norm

✅ This helps the model train better — keeps information flowing without getting lost.

🟣 Step 4: Feed Forward Neural Network (FFNN) Block

This is where each word gets its own private “thinking room”:

A. First Linear Layer + ReLU

Input: Z1norm, Z2norm, Z3norm → (3, 512)
Multiply by weight matrix W1 (size 512 × 2048)
Add bias B1
Apply ReLU → adds non-linearity → output shape: (3, 2048)

B. Second Linear Layer

Multiply by weight matrix W2 (size 2048 × 512)
Add bias B2
Output: Y1, Y2, Y3 → (3, 512)

💡 Think of this as a small brain for each word — refining its meaning after the group discussion.

➕ Step 5: Final Residual + Layer Normalisation

Add the input (Z1norm, Z2norm, Z3norm) back to the FFN output:

Y1' = Y1 + Z1norm
Y2' = Y2 + Z2norm
Y3' = Y3 + Z3norm

Then apply Layer Normalisation → Y1norm, Y2norm, Y3norm

These become the final output of one Encoder block.

📌 Important: In the original “Attention Is All You Need” paper, this entire Encoder block is repeated 6 times in a chain:
Input → Encoder 1 → Output 1 → Encoder 2 → Output 2 → Encoder 3 → Output 3 → Encoder 4 → Output 4 → Encoder 5 → Output 5 → Encoder 6 → Final Encoder Output
Each encoder takes the output of the previous one as its input, building deeper and richer understanding at every stage.

🧱 Inside the Decoder Input Block

Now that the Encoder has finished its job, it’s time for the Decoder to start writing the output sentence — but not quite yet. First, it needs its own special input.

This is where the Decoder Input Block comes in — and as diagram shows, it’s almost identical to the Encoder Input Block… with one very important twist: the Right Shift.

The Decoder Input Block prepares the target sentence (e.g., "तुम कैसे हो") so the Decoder can learn to generate it one word at a time — without cheating by looking ahead.

It does this by adding a <start> token and shifting everything right, so each step only sees what came before.

➡️ Step 1: Right Shift

Start with the target sentence: "तुम कैसे हो"
Add a special <start> token at the beginning: → "<start> तुम कैसे हो"
Then shift the entire sequence one position to the right. This means the decoder will never see the word it’s supposed to predict.

The result is a new input sequence for the decoder:

Position	1	2	3	4
Decoder Input	`<start>`	`तुम`	`कैसे`	`हो`
Target Output	`तुम`	`कैसे`	`हो`	`<end>`

💡 Why?

During training, the Decoder uses this shifted input to predict the next word:

To predict "तुम", it only sees <start>

To predict "कैसे", it sees <start> + तुम

It never sees "हो" when predicting "कैसे"

This forces the model to generate text causally—just like writing a sentence from left to right.

➡️ Step 2: Tokenizer

The shifted sequence (<start> तुम कैसे हो) goes into the Tokenizer.
It breaks it into individual tokens: → <start>, तुम, कैसे, हो

➡️ Step 3: Embedding (512 dim)

Each token gets turned into a 512-number vector via the Embedding layer.
These are called Word Embeddings: E1, E2, E3, E4

✅ Example:

<start> → E1 = [0.1, -0.9, 0.3, ..., 0.7]

तुम → E2 = [0.8, 0.2, -0.6, ..., 0.1]

कैसे → E3 = [-0.4, 0.9, 0.5, ..., -0.3]

हो → E4 = [0.6, -0.1, 0.8, ..., 0.2]

➡️ Step 4: Positional Embeddings

Just like the Encoder, each token also gets a Positional Embedding: P1, P2, P3, P4
These are precomputed (using sine/cosine waves) to tell the model the position of each token.