Hưng Lê Tiến

Posted on Dec 6

BIG STEPS TO TRANSFORMER (PART 2): BUILDING THE TRANSFORMER

#architecture #deeplearning #llm

This might be the end of the blogposts I made based on the videos of Andrej, so shout out to him the last time. Respect!

A word of warning: I'm currently working on projects now so things got messy (going around and fixing bugs everyday). It's hard to systematically note things down, and also sometimes I just want to talk really briefly about the topic. So maybe you won't find like a detailed explanation in my blogposts anymore, some notions will be under-explained or even wrongly explained. I'm just sharing my thoughts and some of the insights that I came up with, so don't take it serious on me okay :).

Now we know that, in the Bigram model, the model only uses just the previous token to predict the next, which really limits the context and other important things. In other words, the context window is too small, and we really want to increase it. But when we discuss about scaling the N-gram model, we encounter a big problem: The computation complexity grows exponentially.

And this is when our Attention boy comes in. Attention allows us to look at everything in the past (regarding a sequence of size block_size), and we will walkthrough it from the naive approach to the actual version.

The Naive Approach

Let me go specific: What we want here is, for each timestep (as we walk through the time dimension), we would love to see every character behind us in order to make our decision.

We do that by a fairly simple way: Just carry every of the data of the previous characters with us by a weighted sum, and in our first case, we will just take the mean. Here's how it looks like with a toy example:

B,T,C = 4,8,2
x = torch.randn(B,T,C)
xbow = torch.zeros((B,T,C))
for b in range(B):
  for t in range(T):
    xprev = x[b,:t+1] # (t,C)
    xbow[b,t] = torch.mean(xprev,0)

>>>x[0]
tensor([[ 0.3624, -1.6396],
        [-0.3599, -1.2083],
        [-2.0063,  0.8857],
        [ 0.0144,  1.5211],
        [ 0.6635, -0.5929],
        [ 1.0220,  0.8683],
        [ 1.2717, -0.6242],
        [-1.4394,  0.2805]])
>>>xbow[0]
tensor([[ 3.6235e-01, -1.6396e+00],
        [ 1.2276e-03, -1.4240e+00],
        [-6.6794e-01, -6.5406e-01],
        [-4.9734e-01, -1.1026e-01],
        [-2.6517e-01, -2.0679e-01],
        [-5.0639e-02, -2.7610e-02],
        [ 1.3827e-01, -1.1284e-01],
        [-5.8944e-02, -6.3670e-02]])

And you know what? All of those things can be done with matrix multiplication. I won't go deep into the detail here, it's the responsibility of your Algebra lecturer. So let me just put the code right under this:

torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

Few things to notice here, first we have that tril thing, which does the job of converting the matrix into a lower triangular one, and the intuition behind this is that: No token is allowed to talk to the ones in the future, it can only look to the past.

Second, notice that we divided each row by the sum. This is to ensure that we take the mean of the aggregated sum.

Now let's implement that in our real input matrix x:

wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

That will return True, trust me. And notice the comment I made: There is a bit of broadcasting there, because the dimensions don't really match (we're dealing with batches). The broadcasting will just replicate all of the TxT matrices to distribute across the batch dim, so things are good here.

In practice, though, we would get rid of that normalizing part and rather apply the softmax, which I will show you:

wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

Remember that we also want to create a lower triangular matrix and then normalize, but how can we tell softmax that there are some zeros? We put the -inf in each places where it's supposed to be a 0 there, in doing so, we will eventually create 0 because softmax take the exponential of the values.

A small note here before we jump into the Attention part: We need to somehow tell the token which position it's in. Like "He sat on the mat, with the cat" and "The cat sat on the mat" for example, the relative order between "sat" and the nouns ("he", "cat) are crucial in understanding the meaning of the sentence. So order matters, then we shall introduce: Positional Embedding, which is just adding an Embedding layer for each position, and then we add that thing to our token_embedding. Then we're all set:

# It kinda goes like this
pos_emb = posemb_matrix(torch.arrange(T))
x = tok_emd + pos_emb

The Crux of Self Attention

Now first there are some problems with our naive approach: It treats each of the previous tokens as equal contributors, while in practice, some tokens can be more valuable in the eye of other tokens. In other word, we would love to get rid of our naive everything-equal approach to adopt a kind of weighted sum.

To compute that weighted sum, we need to find the relationship of the examined token with all the previous. And we do that in a fairly easy-to-understand way: Let each token has two vectors, Q (for query), and K (for key). It goes like this:

When we march through the time step, each token will emit it's query vector, like asking the previous: I'm looking for A,B,C,... It might look for an adjective, or a noun, or a word to make it's context more specific, etc.
After emitting the query vector, we take out the each of the keys from previous tokens, which is simply the answer for the query. And we measure that answer by take the dot product of two vectors (remember that dot products actually measure the similarity between vectors). And then we have our weights!

But we won't use that directly to our x, instead, we introduce another vector: the V, represents for the values of each token. The original vectors for each token are kinda private information, and value is a way to tell "If you are interested in me, here's what I will give you, but I won't take myself with you, ain't no way bud".

And all of that, my friend, is the crux of self-attention. You might want to check a great video and visualization from 3Blue1Brown (and check his series), those videos are phenomenal: Attention in transformers

Now let's implement in the code:

head_size = 16
# This is not a layer, it's just a nice way to
# construct our weight matrices
key = nn.Linear(C,head_size,bias = False)
query = nn.Linear(C,head_size,bias = False)

k = key(x) # B,T,16
q = query(x)

# Need to tranpose k
wei = q @ k.transpose(-2,-1)
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
# Creating the value vectors
value = nn.Linear(C,head_size,bias = False)
v = value(x)
out = wei @ v

A small thing to notice here is that we often transpose the K matrix and then multiply it with Q, so it's a more convenient way to take the dot products.

That's all of the basics about Self-attention, and there are few notes from Andrej that you should read. I won't go specific because I'm lazy (sorry):

Notes:

Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
Each example across batch dimension is of course processed completely independently and never "talk" to each other
In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
"self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
"Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much.

Toward the Transformer

Now we're not done yet! We have a lot of thing to do to come up with the final model. First let's take a look at the architecture in the paper Attention Is All You Need

Multi-headed attention

We did that by connecting layers of self-attention (which we call Head). The intuition is that we want the model to ask many questions as possible, and thus understand more about the context.

For the code, oftentimes we divide the head_size by the number of heads that we want and then after the multi-head layer we concatenate them in the C dimension. It's like dividing the features into small ones for the model to process.

FeedForward layer

After some layers of attention, we need to allow our neurons to have some additional amount of time to process the information that they have, and even communicate with each other more, in order to arrive at something good. I know that I'm describing this in a overly-metaphoric way, but I think it's the intuition behind adding a MLP after the Multi-head.

So let's implement it, shall we? We just need to add a Linear layer and followed by a non-linear activation function (which in this case is the ReLU).

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd,n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

Residual connections

The Multi-head and the Feedforward are the bare bone of our model, so now we know the important parts in Transformer.

We shall move on to some smaller, yet still important details in the architecture. For the first, I would like to tell you about Residual Connection.

Remember the problem of Vanishing Gradient? It happens when the gradient has to be passed down on numerous consecutive layers, got squished and sometimes zeroed out (mainly because of some saturated activation functions), and eventually results in minimal to zero update for the parameters.

In other words, the gradient has to flow in some winding, twisty roads to bring some updates back to our layers. Well, so why don't we construct another branch of road that allows our gradient to flow smoothly without having to go through any additional pathway? And that's how we construct the residual pathway, which is really like a highway for the gradients. You can look back at the architecture illustration, and you can see there are additional arrows at the side of the model, which show that we plug the input right after calculating the output by adding those things up.

# In our block, we modify this
x = x + self.sa(x)
x = x + self.ffwd(x)

If you are not really comfortable with this metaphoric way of thinking, then you can think in mathematical term: When the gradient flows back, the addition operation distribute the gradient equally for the previous nodes. So the input still receives some unmodified, unsquashed gradients through the backward pass. And this residual connection actually opens a new architecture of neural net: Wide and Deep Neural Network. (You can search that, it's really interesting).

A small detail here, is that we also need two projection layers, one after the Multi-head, and one after the FeedForward. This acts as a mixer, we mix things up in order to allow for more exchange of information among heads. We add just a linear layer of size n_emb x n_emb(Notice that the number of input and output are equal, so basically we're just mixing things up).

#Adding a projection layer
self.proj = nn.Linear(n_emb,n_emb)

Also notice in the paper, in the FeedForward layer, the researchers multiply the inner-layer by 4 (creating a much larger working space for our neurons, I think), so we should also include that in our FeedForward layer.

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

LayerNorm

There is actually a whole family of normalization techniques, and we've already known one of them, which is the BatchNorm. But we also know that BatchNorm has some problems on its own, including the side effect of grouping instances together, and we also have to keep track of the population mean and variance. It turns out, that BatchNorm actually doesn't work well with sequential inputs like this, mainly because we don't pass data in in batches, but rather each instance one by one, and researchers have found a way to combat the issue.

It is LayerNorm, where we don't normalize data by it's batch dimension, but rather in the feature dimension, or in our case, it is the channels or the embedding dimension. So we don't even need to compute the running mean and variance anymore, because now each instance has its own mean and variance, regardless of the batch it's in.

To compute it, we just need to change the dimension when normalizing: Normalizing rows rather than columns. And that's it.

class LayerNorm:

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.momentum = momentum
    self.training = True
    # parameters (trained with backprop)
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    xmean = x.mean(1, keepdim=True) # layer mean
    xvar = x.var(1, keepdim=True) # layer variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps)
    self.out = self.gamma * xhat + self.beta
    return self.out

In recent years, there have been so many improvements and new findings. The first is that we don't add the Norm layer after the Multi-head or the Feedforward, but before them.

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        # Creating our LayerNorm
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # We apply the LN first, then others
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

They also discovered that LayerNorm has some problems on its own, mainly in the field of model's interpretability. Researchers took that seriously and invented numerous normalization techniques, namely Instance Norm, GroupNorm (which is some of LayerNorm + some of Instance Norm), RMSNorm(currently used, LayerNorm but we don't subtract the mean and we divide the inputs by the Root Mean Square, rather than the Variance).

Dropout

The last small thing, before we wrap up the model, is the Dropout. Dropout is just additional layer mainly used for regularization. It works by randomly "killing" some neurons in the process, making the collaboration game harder, and eventually, developing a kind of independence of neurons, thus making them stronger (No pain no gain, right? I often wonder why these neurons are always treated so softly, I mean we should be strict sometimes, and Dropout really does the trick).

Dropout is also added after we calculate the Key-Query weight matrix in the Attention layer, so that when multiplying with the value vectors, it prevents some communication between words.

# Dropout in the Multihead and the Linear layer
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

And that's all I want to tell.

Summary

I know that this kind of explanation won't please you, I introduced concepts rather in an abrupt way, sometimes I even skip some parts. But if you understand what I mean, then congrats, my friend, you've just walked through one of the most brilliant architecture in human history.

The Transformer is just blocks of Multihead and FeedForward and all other stuff stacking onto each other, that's it! Now I think I gotta build this as one of my projects, so goodbye for now.

Thanks for reading and, as usual, having a good day!

DEV Community