Hưng Lê Tiến

Posted on Nov 17

LANGUAGE MODELS USING MLP (Part 1)

#deeplearning #llm #tutorial #machinelearning

Welcome to the third part of the series! Just to remind, this blog follows the series from Andrej Karpathy on Youtube, and I'm just taking notes from his videos. Today we will explore deeper about the neural nets, and have great improvements regarding our language model.

We will have two chapters for this topic, mainly because the contents are way too long and there are so much things to mention in the Andrej's video. I will try my best to provide a comprehensive view of this topic, as well as explain clearly to you guys every single concept involved.

Here's the link to the video: Building makemore Part 2: MLP

MLP (Multi-layer Perceptron)

Remember the architecture you saw in the previous chapter? That was a multi-layer perceptron. Actually this term is no fancy if you're already familiar with neural nets.

But in the last chapter, we built just one layer of perceptron, that explains why the model is so simple and provides some dissapointing results. In this blog, we will build a multilayer perceptron (actually we will just build 2 layers), and we shall see how powerful the model becomes when having more hidden layers.

A Revolutionizing Approach

Before we jump right in the coding things and blindly adjusting the parameters to find the best result, I think we really need to take a step back and address the main problems of our past approach and why the model is so bad at generating names even when the loss is fairly optimized. So take a sip, and we will go through some important insights for our models:

Why Don't We Scale the Model?

This is the most naive question that we could come up with when we want to make a model more powerful. Maybe we would love to have a 4-gram or 5-gram language model rather than just a bigram, so that the model can take into account a lot of preceding characters hence arriving at better results.

But imagine how the table of counts would look like? Actually it will scale exponentially, with the base of 27. In the case of your lovely 4-gram or 5-gram, that would be 27^4 and 27^5, which is quite intimidating to deal with.

So scaling is not really good, but are there any clever approaches? Well yes, numerous research had been carried out, and the most ground-breaking one, I would say, is the one from Bengio and his friends.

A Brief Overview of Bengio's Paper

Reading paper can be boring, and intimidating at times. But those are just some kinds of approach to solve some problems that get on the nerves of scientists, so we don't need to really know the maths or some experiments to understand a paper, actually, when reading a paper, we should focus more on What problems are being addressed? and What is the intuition behind their solution for those problems. Of course, it would be great if we can further understand the implementation and the proves through long texts of formulas and theorems. For now, let's talk about just the intuition.

The paper first pointed out that there are two main problems with the current language model, the n-gram, and those problems are, to your surprised, mentioned in our previous chapters when we build the model:

The model does not take into account context farther than 1 or 2 words. Well, we talked about that.
The model does not take into account the similarity between words.

Hold up. You may question that the similarity, or the synonyms are not really relevant to our task of predicting the next word. But actually it is of paramount importance for our main task, specifically in improving the model's ability to generalize .

Imagine when the model encounter a sequence "A dog was running in the ..." and it has no previous data about the whole sequence in the training set, what will it do? Suppose that it has a training instance which is "The cat is walking in the bedroom", and it is trained a huge text corpus and has the ability to know that "cat" is similar to "dog", and "walking" is just like "running", "A" and "the" are virtually the same,... With that, maybe the model will come up with something like "room" or "living room", which share the similarities with "bedroom" from the training set. And that was a great prediction! The model performs extremely well even though it encounters data that it has never seen before. In other words, we say that the model has generalized well with the test set, just by using the similarities between words.

So how can we create a model which can perform that magical task? Now, we shall see the most ground-breaking part of the paper.

Feature vectors - Embeddings

We starts by associating each words in the sequence with a vector, usually in a lower dimension than the vocabulary size (that's actually a clever way to combat the Curse of Dimensionality), and we will work with the vector from now on.

In our previous chapter, we did this kind of conversion by the one-hot encoding, which turns the characters into a one-hot vector to feed in the neural net. But this one-hot vector captures nothing but the position of the word in the alphabet, which is of no avail to our task of predicting.

So now imagine their approach as a smart way to encode our words. They convert words to vectors just like us, but first they store the words in a lower-dimensional space, and second, they arrange the vectors so that they capture the similarities. What I mean here when saying "capture the similarity" is that in the vector space, the vectors represent words with similar meanings, or share the same semantic will end up closer to each other, and that is the main point for all of this. We will have huge clusters of synonyms, not only that but also some cool tricks playing with the meaning of words.

This "feature vector" trick is still applied in our modern world, but it have been improved significantly, now it is prevalent with the name of Embeddings, so we will call this method as embeddings from now. Embeddings itself involves some complicated mechanisms, but from the very root, it is just a modern version of the feature vectors.

Explore Bengio's model's architecture

First, in the Input layer, we still have the inputs as the index for the words, (or in our case, the characters), and this is just the lookup table we created with the mapping from a-z to 1-26. So no big-brain stuff here.

In the next layer, interesting things happen, I will call this the Embedding layer. This is different from our previous model, and it's the key point that brings about the significant improvements later on. We have a matrix C for our embedding process, and the parameters in the matrix is tweaked and tuned during the learning process, as well as it is shared across all words. Later on, in our model, we will explore some interesting patterns that our neural net learned during the training phase by looking at this matrix.

A notable thing to mention here is the embedding size, we will decide what dimension would we "squish" our words into. In the paper, they embedded a vocabulary of total 17,000 words into just 30-dimensional vector space. This is called Dimensionality Reduction, and there are numerous other methods to implement this. But you should also note that when we transform a high-dimensional vector space to a lower-dimensional ones, there would definitely be some information loss. So there's a tradeoff and we should choose carefully.

The next layer is the Hidden layer of our net. This time we can choose the size of the layer by any number we want. Note that in the layer they used an activation function called tanh, a classic one, we will talk about the whole family of the activation function perhaps in the next blog.

The last layer, is the Output layer, and you can see the note "most computation here". It is an expensive layer, as we have to calculate the probability distribution for every word in the vocabulary, that contributes to the total of 17,000 logits, and then we apply the Softmax for all of those to get the final result.

So that is the barebone of their models and we will start to rebuilt it in our small project. Let's begin!

Load the data & Libraries

Moving on to the third chapter, we might need some improvement in the structure of our project. For convenient purpose, we should import all of the libraries in the very beginning, and the load the dataset

Importing libraries

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline

Loading the datset

# download the names.txt file from github
!wget https://raw.githubusercontent.com/karpathy/makemore/master/names.txt

words = open('names.txt', 'r').read().splitlines()
words[:8]

So now we have our data!

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

Prepare the Data

Indexing

First thing first, we need to reuse the mapping that we created for the characters to index them and feed into our model.

# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0

# Also remember to map backward
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}

Build the dataset

There is an important note here: We will use a Trigram language model, not the bigram ones, so we're scaling the context window a little bit.

With that in mind, then we will need some modifications when preparing our data. We have a new variable block_size, which is basically the number of characters in the context window. We set it to 3 for our model.

We will also need a technique to capture 3 characters at a time for each iteration rather than just one. Actually it is fairly simple: We first have a window of size 3 filled with 0, and then we slide the window across the training data, storing it in our tensor X and the corresponding labels in the tensor Y.

Let's look at how we can implement it in python, it is just cropping and appending new element in a list:

# build the dataset
block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], []

# We will just deal with 5 names for now
for w in words[:5]:
  context = [0] * block_size
  for ch in w + '.':
    ix = stoi[ch]
    X.append(context)
    Y.append(ix)
    print(''.join(itos[i] for i in context), '--->', itos[ix])
    context = context[1:] + [ix] # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape)

Here's our data:

... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
...
torch.Size([32, 3]) torch.Size([32])

We have 32 windows, each of size 3, that is just enough to move to the next stage.

Embeddings & Data Manipulation with Pytorch

We will take us somewhere off the track in this section, as we will not dive into the projects or fine-tuning, but we will focus on the part of Data Manipulation, which is really interesting.

Pytorch is often the go-to place for Neural Nets and even more complicated models as it offers some really convenient operations and better use of memory when storing data. In our previous project, we just explore a bit of the magic that Pytorch offer, namely the Broadcasting rule and The backward() function. Those are just the surface, there are a whole world of convenience in Pytorch, and some interesting manipulation that I think is worth to learn, given that we will have to do a lot of stuff to our data in our projects.

So let's walkthrough some while we're building our model:

Initializing the embedding vector

The most important thing here is the size of this matrix, we need to determine the dimension that we want to project our data on. In this case, we will choose 2.

# Initilize randomly
C = torch.randn(27,2)

Now we need to have a good understanding of the size of the matrices that we create, and thus understand the underlying process. This matrix C is of the size 27x2, which means that it stores 27 components, each component is a vector that embeds our character. The reason why the number 27 appears is that we have 27 characters in total, and the number 2 denotes the corresponding vector that the character represents.

What if we want to embed a character using this matrix? There are two ways for doing this:

Using the indexing

C[5]

Multiply the C matrix with an one-hot vector of size 27

# One-hot -> Remember to convert to float
F.one_hot(torch.tensor(5), num_classes=27).float() @ C

These both return an 2-dim vector:

tensor([0.5262, 1.0655])

For convenience purposes, we will stick with the former, and actually, Pytorch has some great indexing techniques that offer even more flexibility.

Indexing with Pytorch

We can index using lists, which means that we can pass in a list of integers representing the index we want and it will return the sequence of 2-dimensional embedding vectors.

# Index using List
C[[5,6,7]]

tensor([[ 0.5262,  1.0655],
        [-2.2277, -0.5293],
        [-0.6665, -1.0212]])

The list can contain duplicates:

# Can be a tensor of int, we can repeat
C[torch.tensor([5,6,7,7,7,7,7])]

tensor([[ 0.5262,  1.0655],
        [-2.2277, -0.5293],
        [-0.6665, -1.0212],
        [-0.6665, -1.0212],
        [-0.6665, -1.0212],
        [-0.6665, -1.0212],
        [-0.6665, -1.0212]])

What if we pass in a whole matrix? Pytorch can handle that too! It will iterate through the whole matrix and assign each entries with the corresponding vector. So when indexing, it should return a new matrix with an additional dimension at the end, which is equivalent to the embedding size. Take a moment to think about it, and then look at the implementation to understand what I mean:

# Indexing with a multidimensionl tensor
C[X].shape

torch.Size([32, 3, 2])

Let's take a look at what it is actually doing inside the matrix:

>>>X[:5]
tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13],
        [ 5, 13, 13],
        [13, 13,  1]])

>>>C[X][:5]
tensor([[[-2.3175, -0.5157],
         [-2.3175, -0.5157],
         [-2.3175, -0.5157]],

        [[-2.3175, -0.5157],
         [-2.3175, -0.5157],
         [ 0.6433,  0.8121]],

        [[-2.3175, -0.5157],
         [ 0.6433,  0.8121],
         [ 1.0138, -0.7526]],

        [[ 0.6433,  0.8121],
         [ 1.0138, -0.7526],
         [ 1.0138, -0.7526]],

        [[ 1.0138, -0.7526],
         [ 1.0138, -0.7526],
         [ 1.5965, -0.8861]]])

So in the X we have a bunch of 3 dimensional array, and in C we have a bunch of 3x2 matrix, it simply takes in an additional dimension to store the embedding of each elements! I also get confused in this part so don't be shy to take a moment to think about it.

And we can get the location of a specific vector by indexing just like we're working with multidimensional array in Python:

# We get the embedding for the 13th window, at the third character
C[X][13][2]

tensor([-0.0371,  0.8457])

Now we should name the embedding of X:

emb = C[X]
emb.shape

Creating the first layer

As you can see from the architecture, a pack of 3 characters are inputted into the model, but we embedded those into 2 dimensional vectors, so that contributes to a total of 3 x 2 = 6 input values.

Okay so our layer plugs in 6 values of input, and what is the output, or more precisely, how many neurons does it have? This is totally up to us, and I will choose 100. And we're ready to go!

You may question why do we need to think about all of this in the first place. Well, it is crucial to think about the matrix that we want to create, and how can we create it, there are lots of insights when we look at the size of the matrix. Moreover, when we perform matrix multiplication, it is important to keep track of the dimensions, as we cannot perform that operation if the dimensions don't match. Also, I want to clarify each step that we're doing, so that we won't get lost by some randomly-popped-up numbers, it's a way to assure that we're on our track.

Initializing the Weights and Bias

In the previous chapter, we dealt with the weights only, but in practice, we need to additional term called Bias, so that we will have the classic formula of "wx + b" that appears in virtually every ML/DL books.

# Creating the first layer

# Number of input : 3x2 because we have 2 dim embedd and 3 chars
# Number of neurons: 100 (totally up to us)
W1 = torch.randn((6,100))
b1 = torch.randn(100)

Fitting the Dimension

So now we need to feed our data into the first layer, right? We will do that by matrix multiplication, but wait? There's something wrong.

We can't do matrix multiplication because the dimension don't fit! Note that the embedding matrix is of the size [32,3,2] while our W1 matrix if of the size [6,100]. Those are not even the matrices of the same type. So what will we do? We would love to have our embedding matrix to have the size [32,6], specifically we want to find a way to "merge" all of the embedding vectors in the 3-char window into one. In other words, the number of the trigram remains, but for each vector in it, we want to convert a sequence of 3 2-d vector, to a sequence of 6 values in order to feed in our first hidden layer. And guess what? Pytorch saves the day again!

There is a function in the Pytorch library called concat and it does the merging stuff that we discussed above. First we need to take the sequence of embedding vectors for each character in the window, then apply the concat function to it. The code goes like this:

# Take the three and concat, in the second dim
torch.cat([emb[:,0,:], emb[:,1,:],emb[:,2,:]], 1)

Let's see the shape of this vector:

torch.Size([32, 6])

Exactly what we want! But hard-coding the sequence of embedding vectors to pass in the function is not really good, and we want to optimize that too. So another function comes in the way which does exactly that task, it is called unbind:

# Unbind take the lists, return tuples of tensors
torch.cat(torch.unbind(emb,1),1)

Just spend a few seconds to contemplate how our library can help us to implement a labor-intensive task in one line of code. That's amazing.

Internals of Tensors

Actually there is even a more convenient way to implement the task above, which doesn't require any fancy functions. We will first introduce to you about the Internals of Tensors, but I would recommend you to read the blog of ezyang for a more comprehensive understanding. This is the blog..

Briefly speaking, the tensor in Pytorch has an interesting way of storing data. Specifically, it stores every values in the matrix in a one-dimensional array, irrespective of the size. So you can imagine that everything got flatten out into just one single long series. We can get access to this storage by the function storage:

emb.storage()

 -0.6637181043624878
 0.31748151779174805
 -0.6637181043624878
 0.31748151779174805
 -0.6637181043624878
 0.31748151779174805
 ...

But what is the point for all of this? It turns out, that there is this specific method called view() that can return to us a matrix of any size, using the data from our matrix. It's a much more efficient way since it's just representing the data differently, it doesn't create new tensors to work with. A note here is that your new matrix can be of any size that you want as long as the dimensions match with the original. Let's implement this black magic:

emb.view(32,6)

tensor([[-1.3302, -1.3333, -1.3302, -1.3333, -1.3302, -1.3333],
        [-1.3302, -1.3333, -1.3302, -1.3333, -0.1033, -1.4972],
        [-1.3302, -1.3333, -0.1033, -1.4972, -0.9485,  0.6885],
        [-0.1033, -1.4972, -0.9485,  0.6885, -0.9485,  0.6885],
        ......

We can also use emb.view(-1,6), it generates the same result and it also eliminates the need to hard-code the number 32 (that number won't be used later when we consider the whole dataset). But remember, we can't have matrices like emb.view(5,6), we need the multiples of the dimension equal with that of the original matrix.

Now we need to name the new matrix, and also get the tanh of the values, just like in the architecture.

# Get the tanh
h = torch.tanh(emb.view(-1,6) @ W1 + b1)

If you are thinking about the dimensions here, you may question why the b1 matrix can be added while it's just a matrix of size 100, while others are two-dimensional. This is actually a valid operation (at least in Pytorch), and it involves the thing that we already know: Broadcasting. The b1 vector is later broadcasted into a row vector, so the value in each row are added to the corresponding row in the previous matrix, and its exactly what we want!

And that is everything about the magic of the Pytorch library for manipulating data. I hope you will get a sense of how amazing this library is, and you can try some operations yourself, they are all at their webpage here.

Create the Final Layer

Think about the dimension again, what should the W2 vector be like? First, it should output the probability for 27 characters, so there should be a 27 in the size. Moreover, it has to match with the previous matrix, which is of the size 6x100. Hence, we have the size of the final weight matrix: 100x27, the bias is, surely, an array of 27 elements.

Let's implement that in our code:

W2 = torch.randn((100,27))
b2 = torch.randn(27)

And now, we're ready to calculate our probability!

logits = h @ W2 + b2

# Applying softmax
counts = logits.exp()
prob = counts / counts.sum(1,keepdims = True)

And finally, the negative log likelihood:

loss = -prob[torch.arange(32),Y].log().mean()

The result is, maybe some drum roll for this moment:

tensor(14.3920)

That is TERRIFYING. But we haven't done any optimization yet, zero zip nah. And that, my friend, is gonna be the story of the next part. In this part, we've already equipped ourselves with a whole bunch of knowledge, from the revolutionizing approach of Bengio, generating a trigram dataset, to great data manipulation techniques in Pytorch. Those are really enough for today, and congrats yourself for reaching this far.

Thanks for reading, and see you in the next blog!

DEV Community