DEV Community

Cover image for How Google Translate & ChatGPT Work: The Transformer, Unboxed
Anik Chand
Anik Chand

Posted on

How Google Translate & ChatGPT Work: The Transformer, Unboxed

What Exactly Is a Transformer? 🤔

Ever used Google Translate 🌍 or chatted with ChatGPT 💬?

Behind both lies the same breakthrough: the Transformer ⚡.

Imagine an AI that doesn’t read sentences like a robot 🤖—one word at a time—but like a human 🧠:

instantly grasping how every word connects to every other word.

That’s the Transformer.

It’s a revolutionary AI architecture that understands and generates language by focusing on the most important words at once—no slow, step-by-step reading required.

Born in 2017 from the paper that changed everything—“Attention Is All You Need”✨—the Transformer ditched old-school methods and bet everything on one powerful idea: attention 👀.

And it worked. So well, in fact, that it now powers the smartest language tools you use every day.

📄 Curious how it all began?

Read the original paper here: “Attention Is All You Need”


High-Level Architecture: The Two Main Parts

The Transformer has two big sections that work together, like two teams: the Encoder and the Decoder.

The process works in a few simple steps:

  1. Input Ready: We take the starting sentence (like, the German one). We give each word a number (this is Embedding), and we also add a special code (Positional Encoding) so the Transformer knows the order of the words.
  2. The Encoder's Job: The Encoder reads the whole input sentence and figures out the complete meaning and context of every word. It creates a detailed "thought" for the sentence.
  3. The Decoder's Job: The Decoder starts with a "Start" signal. It looks at the Encoder's "thought" and starts writing the new sentence (the English one), one word at a time.
  4. Final Output: A simple layer (Linear and Softmax) at the end chooses the most likely word to be the next one in the sentence.

🧑‍🍳 A Quick Analogy: The Bilingual Cooking Show

Imagine a chef who must recreate a dish from a foreign recipe—but doesn’t speak the language.

  • The Encoder is like a team of expert tasters who read the whole original recipe at once and create a Master Flavor Map.
  • The Decoder is the recreating chef who:
    • Starts with a <start> note,
    • Can only look at what they’ve already cooked (no peeking ahead!),
    • And keeps glancing at the Flavor Map to decide the next ingredient.
  • Finally, a pantry assistant (Linear + Softmax) picks the most likely Hindi word (ingredient) for each step.

This is exactly how the Transformer translates "How are you?""तुम कैसे हो?"—one smart, attentive step at a time!

High-Level Block Diagram

This shows the big picture. Now, let's open up these big blocks and see the smaller, powerful layers inside!


🏗️ Inside the Input Block (Encoder Input)

The Input Block is the very first step on both the Encoder and Decoder side of the Transformer. It takes the original words and prepares them for the attention layers.

Let’s follow the flow in the diagram — from bottom to top — to see exactly how the input sentence gets ready for the Transformer.

Step 1: Tokenizer

  • The sentence "How are you" goes into the Tokenizer.
  • It breaks the sentence into individual words (or subwords): → "How", "are", "you"

💡 Think of this like cutting a sandwich into pieces before eating it — the model works on one word at a time.

Step 2: Embedding (512 dim)

  • Each word ("How", "are", "you") is sent to the Embedding layer.
  • This turns each word into a 512-number list (called a vector).
  • These are called Word Embeddings: E1, E2, E3 (each is (512,0) — meaning 512 numbers).

✅ Example:

  • "How"E1 = [0.2, -0.8, 0.9, ..., 0.1]
  • "are"E2 = [0.7, 0.3, -0.6, ..., 0.4]
  • "you"E3 = [-0.1, 0.9, 0.2, ..., -0.7]

Step 3: Positional Embeddings

  • At the same time, each word gets a Positional Embedding: P1, P2, P3.
  • These are not learned — they’re precomputed using special math (sine/cosine waves) so every position has a unique pattern.
  • Why? So the model knows that "How" is first, "are" is second, "you" is third.

🧩 Without this, "Cat chases dog" and "Dog chases cat" would look identical!

Step 4: Add Them Together → Positional Encoded Vectors

  • For each word, we add its Word Embedding and Positional Embedding:
    • X1 = E1 + P1
    • X2 = E2 + P2
    • X3 = E3 + P3

These final vectors — X1, X2, X3 — are called Positional Encoded Vectors.

  • Each is still 512 numbers — but now they contain both meaning AND position.

🎯 This is the magic: the model now has all the info it needs to start paying attention!

📌 What Happens Next?

These X1, X2, X3 vectors are now ready to go into the first Encoder block — where the real “attention” begins!


🧱 Inside the Encoder Block

The Encoder takes your input sentence (like "How are you?") and turns it into a deep, contextual understanding of every word.

It does this in two main steps: first, a Multi-Head Attention Block lets each word understand its relationship to all others. Then, a Feed Forward Neural Network Block refines that meaning further.

This whole process repeats 6 times — each time making the understanding richer.

Let’s walk through one full Encoder block using your detailed diagram — from bottom to top.

➡️ Step 1: Input — Positional Encoded Vectors (X1, X2, X3)

  • Input shape: (3, 512) → 3 words, each as a 512-number vector.
  • These come from the Input Block (after adding Word + Positional Embeddings).

🟢 Step 2: Multi Head Attention

  • Each word (X1, X2, X3) looks at all other words to understand context.
  • Output: Contextual EmbeddingsZ1, Z2, Z3 (still (3, 512)).
  • This is where the model learns that "you" should pay attention to "How" and "are".

➕ Step 3: Residual Connection + Layer Normalisation

  • Add original input back: Z1' = Z1 + X1 Z2' = Z2 + X2 Z3' = Z3 + X3
  • Apply Layer NormalisationZ1norm, Z2norm, Z3norm

✅ This helps the model train better — keeps information flowing without getting lost.

🟣 Step 4: Feed Forward Neural Network (FFNN) Block

This is where each word gets its own private “thinking room”:

A. First Linear Layer + ReLU

  • Input: Z1norm, Z2norm, Z3norm(3, 512)
  • Multiply by weight matrix W1 (size 512 × 2048)
  • Add bias B1
  • Apply ReLU → adds non-linearity → output shape: (3, 2048)

B. Second Linear Layer

  • Multiply by weight matrix W2 (size 2048 × 512)
  • Add bias B2
  • Output: Y1, Y2, Y3(3, 512)

💡 Think of this as a small brain for each word — refining its meaning after the group discussion.

➕ Step 5: Final Residual + Layer Normalisation

  • Add the input (Z1norm, Z2norm, Z3norm) back to the FFN output:

Y1' = Y1 + Z1norm
Y2' = Y2 + Z2norm
Y3' = Y3 + Z3norm

  • Then apply Layer NormalisationY1norm, Y2norm, Y3norm

These become the final output of one Encoder block.

📌 Important: In the original “Attention Is All You Need” paper, this entire Encoder block is repeated 6 times in a chain:

Input → Encoder 1 → Output 1 → Encoder 2 → Output 2 → Encoder 3 → Output 3 → Encoder 4 → Output 4 → Encoder 5 → Output 5 → Encoder 6 → Final Encoder Output

Each encoder takes the output of the previous one as its input, building deeper and richer understanding at every stage.


🧱 Inside the Decoder Input Block

Now that the Encoder has finished its job, it’s time for the Decoder to start writing the output sentence — but not quite yet. First, it needs its own special input.

This is where the Decoder Input Block comes in — and as diagram shows, it’s almost identical to the Encoder Input Block… with one very important twist: the Right Shift.

The Decoder Input Block prepares the target sentence (e.g., "तुम कैसे हो") so the Decoder can learn to generate it one word at a time — without cheating by looking ahead.

It does this by adding a <start> token and shifting everything right, so each step only sees what came before.

➡️ Step 1: Right Shift

  • Start with the target sentence: "तुम कैसे हो"
  • Add a special <start> token at the beginning: → "<start> तुम कैसे हो"
  • Then shift the entire sequence one position to the right. This means the decoder will never see the word it’s supposed to predict.

The result is a new input sequence for the decoder:

Position 1 2 3 4
Decoder Input <start> तुम कैसे हो
Target Output तुम कैसे हो <end>

💡 Why?

During training, the Decoder uses this shifted input to predict the next word:

  • To predict "तुम", it only sees <start>
  • To predict "कैसे", it sees <start> + तुम
  • It never sees "हो" when predicting "कैसे"

This forces the model to generate text causally—just like writing a sentence from left to right.

➡️ Step 2: Tokenizer

  • The shifted sequence (<start> तुम कैसे हो) goes into the Tokenizer.
  • It breaks it into individual tokens: → <start>, तुम, कैसे, हो

➡️ Step 3: Embedding (512 dim)

  • Each token gets turned into a 512-number vector via the Embedding layer.
  • These are called Word Embeddings: E1, E2, E3, E4

✅ Example:

  • <start>E1 = [0.1, -0.9, 0.3, ..., 0.7]
  • तुमE2 = [0.8, 0.2, -0.6, ..., 0.1]
  • कैसेE3 = [-0.4, 0.9, 0.5, ..., -0.3]
  • होE4 = [0.6, -0.1, 0.8, ..., 0.2]

➡️ Step 4: Positional Embeddings

  • Just like the Encoder, each token also gets a Positional Embedding: P1, P2, P3, P4
  • These are precomputed (using sine/cosine waves) to tell the model the position of each token.

➡️ Step 5: Add Them Together → Positional Encoded Vectors

  • For each token, we add its Word Embedding and Positional Embedding:
  • X1 = E1 + P1 → for <start>
  • X2 = E2 + P2 → for तुम
  • X3 = E3 + P3 → for कैसे
  • X4 = E4 + P4 → for हो

These final vectors — X1, X2, X3, X4 — are called Positional Encoded Vectors.

  • Each is still 512 numbers — but now they contain both meaning AND position.

🎯 This is the magic: the Decoder now has all the info it needs to start generating — one word at a time, without peeking ahead!

📌 What Happens Next?

These X1, X2, X3, X4 vectors are now ready to enter the first Decoder Block — where they’ll meet the Encoder’s “thought” through Cross-Attention.


🧱 Inside the Decoder Block

Now that we have our Positional Encoded Vectors (X1, X2, X3, X4) from the Decoder Input Block, they’re ready to enter the Decoder Block.

This block has three main parts, stacked one after another:

  1. Masked Self-Attention Block
  2. Cross Attention Block
  3. Feed Forward Neural Network Block

And just like the Encoder, this whole structure is repeated 6 times (Decoder 1 → Decoder 6).

Let’s walk through one full Decoder block — from bottom to top — using detailed diagram.

➡️ Step 1: Input — Positional Encoded Vectors (X1, X2, X3, X4)

  • Input shape: (4, 512) → 4 words (including <start>), each as a 512-number vector.
  • These come from the Decoder Input Block (after adding Word + Positional Embeddings).

🟢 Step 2: Masked Multi Head Attention

  • Each word (X1, X2, X3, X4) looks at all previous words — but not future ones.
  • Why? Because during training, the model must predict the next word without seeing it!
  • This is called Masked Self-Attention — the “mask” blocks out future positions.
  • Output: Contextual EmbeddingsZ1, Z2, Z3, Z4

💡 Example:

  • When predicting "कैसे", it can see <start> and तुम — but not हो.

➕ Step 3: Residual Connection + Layer Normalisation

  • Add original input back: Z1' = Z1 + X1 Z2' = Z2 + X2 Z3' = Z3 + X3 Z4' = Z4 + X4
  • Apply Layer NormalisationZ1norm, Z2norm, Z3norm, Z4norm

✅ This helps the model train better — keeps information flowing without getting lost.

🟠 Step 4: Cross Attention

This is where the magic happens — the Decoder talks to the Encoder!

  • The Decoder takes its own normalized vectors (Z1norm, Z2norm, Z3norm, Z4norm) as Queries.
  • It uses the Encoder’s final output (from Encoder 6) as Keys and Values.
  • This lets the Decoder focus on the most relevant parts of the input sentence.
    • For example: when generating "हो", it might look back at the Encoder’s understanding of "you".

Output: Cross-Attention EmbeddingsZc1, Zc2, Zc3, Zc4

💡 Think of this as the Decoder asking: “Hey Encoder — what part of the English sentence should I focus on right now?”

➕ Step 5: Residual Connection + Layer Normalisation

  • Add the input (Z1norm, Z2norm, Z3norm, Z4norm) back to the cross-attention output:

Zc1' = Zc1 + Z1norm
Zc2' = Zc2 + Z2norm
Zc3' = Zc3 + Z3norm
Zc4' = Zc4 + Z4norm

  • Apply Layer NormalisationZc1norm, Zc2norm, Zc3norm, Zc4norm

🟣 Step 6: Feed Forward Neural Network (FFNN) Block

This is where each word gets its own private “thinking room” — same as in the Encoder:

A. First Linear Layer + ReLU

  • Input: Zc1norm, Zc2norm, Zc3norm, Zc4norm(4, 512)
  • Multiply by weight matrix W1 (size 512 × 2048)
  • Add bias B1
  • Apply ReLU → adds non-linearity → output shape: (4, 2048)

B. Second Linear Layer

  • Multiply by weight matrix W2 (size 2048 × 512)
  • Add bias B2
  • Output: Y1, Y2, Y3, Y4(4, 512)

💡 Think of this as refining each word’s meaning after listening to both itself (self-attention) and the Encoder (cross-attention).

➕ Step 7: Final Residual + Layer Normalisation

  • Add the input (Zc1norm, Zc2norm, Zc3norm, Zc4norm) back to the FFN output:

Y1' = Y1 + Zc1norm
Y2' = Y2 + Zc2norm
Y3' = Y3 + Zc3norm
Y4' = Y4 + Zc4norm

  • Apply Layer NormalisationY1norm, Y2norm, Y3norm, Y4norm

These become the final output of one Decoder block.

🔄 Repeat 6 Times

This entire process — Masked Self-Attention → Residual → Norm → Cross-Attention → Residual → Norm → FFN → Residual → Norm — happens 6 times in a row.

After Decoder 6, the model has a rich, context-aware understanding of what to generate next — ready for the Final Output Block.


🎯 Final Output Block: Turning Numbers into Words

After the last Decoder block (Decoder 6), we have four final vectors: Y1fnorm, Y2fnorm, Y3fnorm, Y4fnorm — each of shape (512,0).

These vectors are the model’s “best guess” for what each word in the output sentence should be. But they’re still just numbers. To turn them into actual words like "तुम", "कैसे", "हो", and <end>, we need the Final Output Block.

This block is repeated once for each output position — so there are 4 identical blocks here, one for each word.

Let’s walk through the first block — the one that predicts the very first word: "तुम".

➡️ Step 1: Input — Y1fnorm

  • This is the final vector from Decoder 6 for the first position.
  • Shape: (512,0) → 512 numbers.

🟣 Step 2: Linear Layer (512 → V)

  • The vector goes into a linear layer with weights of size 512 × V.
  • V = number of unique words in the output vocabulary (e.g., all Hindi words + <start>, <end>).
  • Output: V values — one score for every possible word.

💡 Think of this as a giant lookup table: it asks, “Given these 512 numbers, how likely is each word to be the next one?”

🟠 Step 3: Softmax

  • The V values go through a softmax function.
  • This turns the scores into probabilities — adding up to 1.0.
  • Output: V probability values — e.g., 90% chance of "तुम", 5% of "कैसे", etc.

🟢 Step 4: Normalisation

  • A Normalisation step is applied — this ensures the probabilities are smooth and well-scaled.
  • In practice, this is often part of the softmax or a small post-processing step.

🎯 Step 5: Return Highest Probability Value

  • The model picks the word with the highest probability.
  • For the first position → it picks "तुम"

✅ This is how the Transformer generates its first word!

🔁 Repeat for All Positions

The same process happens for the other three positions:

  • Position 2 → Y2fnorm → predicts "कैसे"
  • Position 3 → Y3fnorm → predicts "हो"
  • Position 4 → Y4fnorm → predicts <end>

Each block is identical — only the input vector (Y1fnorm, Y2fnorm, etc.) changes.

📌 Why This Works

  • The model doesn’t generate all words at once — it does one at a time.
  • Each prediction is based on the full context built by the Encoder and Decoder.
  • The final linear + softmax layer is like a “vocabulary selector” — turning abstract numbers into real words.

This is the final step in the Transformer — where numbers become language!


🌟 Conclusion: The Transformer, Demystified

You’ve just walked through the entire Transformer — from raw words to fluent translation — one block at a time.

you can check the whole diagram here : https://drive.google.com/file/d/1lz68fKBnUtsqi9_9q_7J6MrikSu2oA8e/view?usp=sharing

No magic. No mystery. Just smart design:

  • Attention that sees relationships,
  • Positional codes that preserve order,
  • Residual connections that keep learning stable,
  • And parallel processing that makes it fast.

What started as a sentence — "How are you?" — became numbers, then context, then meaning, and finally: "तुम कैसे हो?"

And the best part?

You now understand how it works — not just at a high level, but deep down to the vectors, layers, and shapes.

The Transformer isn’t just a model. It’s the foundation of modern AI — from translation and chatbots to code generation and beyond.

And you?

You didn’t just read about it.

You followed the data all the way through.

Go ahead — share what you’ve learned.

Because now, you truly see the machine behind the magic. 💫

Top comments (0)