Tuntufye Mwakalasya

Posted on Nov 21

Inside ChatGPT: Deconstructing "Attention Is All You Need" (Part 1)

#architecture #ai #deeplearning #chatgpt

To understand how modern Large Language Models (LLMs) like ChatGPT work, we must first understand the architecture that changed everything: the Transformer. Before we dive into the complex layers, we need to establish why we moved away from previous methods and how the model initially processes language.

1. The Predecessor: Recurrent Neural Networks (RNNs) and Their Limitations

Before the "Attention Is All You Need" paper, the standard for processing sequential data (like text) was the Recurrent Neural Network (RNN).In an RNN, data is processed sequentially. We give the network an initial state (State 0) along with an input x1 to produce an output y1 and a hidden state. This hidden state is passed forward to the next step, allowing the network to "remember" previous inputs.

The Vanishing Gradient Problem

While intuitive, RNNs suffer from severe limitations, specifically slow computation for long sequences and the vanishing or exploding gradient problem.

To understand this, let's look at calculus, specifically the Chain Rule.

If we have a composite function

F(x) = (f \circ g)(x)

, the derivative is:

F'(x) = f'(g(x)) \cdot g'(x)

In a deep neural network, backpropagation involves multiplying gradients layer by layer (like the chain rule). If we have many layers, we are essentially multiplying many numbers together.

Imagine multiplying fractions like:

\frac{1}{2} \times \frac{1}{2} \times \frac{1}{2} \times \cdots

As the number of layers (or time steps) increases, this number becomes infinitesimally small ("vanishes") or massively large ("explodes"). This makes it incredibly difficult for the model to access or learn from information that appeared early in a long sequence.

2. The Transformer Architecture

The Transformer abandons recurrence entirely, relying instead on an Encoder-Decoder architecture. It processes the entire sequence at once, which solves the speed and long-term dependency issues of RNNs.

The Input Matrix

Let's look at how data enters the model.

If we have an input sentence of length 6 (Sequence Length) and a model dimension (

d_{model}

) of 512, our input is a matrix of size

(6, 512)

Each row represents a word, and the columns (length 512) represent that word as a vector. You might ask: Why 512 dimensions?

We need high-dimensional space to capture:

Semantic Meaning: What the word actually means.
Syntactic Role: Is it a noun, verb, or adjective?
Relationships: How it relates to other words (e.g., "King" vs "Queen").
Context: Multiple contexts the word can appear in.

Input Embedding

Computers don't understand strings; they understand numbers. We take our original sentence:

"Your cat is a lovely cat"

First, we map these to Input IDs (their position in the vocabulary):
We then map these IDs into a vector of size 512. Note that these vectors are not fixed; they are learned parameters that change during training to better represent the word's meaning.

3. Positional Encoding

Since the Transformer processes words in parallel (not sequentially like an RNN), it has no inherent concept of "order." It doesn't know that "Your" comes before "cat." We must inject this information manually using Positional Encodings.

We want the model to treat words that appear close to each other as "close" mathematically. To do this, we use trigonometric functions because they naturally represent continuous patterns that the model can easily learn to extrapolate.

We add this positional vector to our embedding vector. The formula used in the paper is:

For even positions ( $2i$ ):

PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

For odd positions ( $2i+1$ ):

PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

This ensures that every position has a unique encoding that is consistent across training and inference.

4. Self-Attention: The Core Mechanism

This is the "magic" of the architecture. Self-attention allows the model to relate words to each other within the same sentence. It determines how much "focus" the word "lovely" should have on the word "cat."

The formula for Scaled Dot-Product Attention is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

Q (Query): What I am looking for.
K (Key): What I contain.
V (Value): The actual content I will pass along.

The Matrix Math

For a sequence length of 6 and dimension 512:

We multiply Q (6 × 512) by K^T (512 × 6).
This results in a (6 × 6) matrix.
We apply the Softmax function. This turns the scores into probabilities (summing up to 1).

This (6 × 6) matrix captures the interaction between every word and every other word. When we multiply this by V, we get a weighted sum of the values, where the weights are determined by the compatibility of the Query and Key.

Key Benefits of Self-Attention:

Permutation Invariant: It treats the sequence as a set of relationships rather than a strict list.
Parameter Efficiency: Pure self-attention requires no learnable parameters (though the linear layers surrounding it do).
Long-range Dependencies: Words at the start of a sentence can attend to words at the end just as easily as adjacent words.

Key Benefits of Self-Attention:

Permutation Invariant: It treats the sequence as a set of relationships rather than a strict list.
Parameter Efficiency: Pure self-attention requires no learnable parameters (though the linear layers surrounding it do).
Long-range Dependencies: Words at the start of a sentence can attend to words at the end just as easily as adjacent words.

Summary & Looking Ahead

We have successfully moved away from the sequential limitations of RNNs and embraced the parallel nature of Transformers. We've learned how to convert text into meaningful vector spaces, inject order using positional encoding, and, most importantly, derive the mathematical foundation of how words "pay attention" to each other using Queries, Keys, and Values.

But there is a catch.

The mechanism we just described, a single pass of

\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V

, is only capable of focusing on one type of relationship at a time. For example, it might focus heavily on syntactic relationships (such as subject-verb agreement) but completely miss semantic nuances (like sarcasm or references).

Real-world language is too complex for a single "gaze." To build a model like ChatGPT, we need it to look at the sentence through multiple lenses simultaneously.

In Part 2, we will take the self-attention mechanism and clone it, creating Multi-Head Attention. We will then see how these attention scores are processed through Feed-Forward Networks to finally construct the complete Transformer block.

Stay tuned.

DEV Community