Hưng Lê Tiến

Posted on Nov 14

BIGRAM LANGUAGE MODELS USING A NEURAL NET

#beginners #deeplearning #machinelearning

Welcome to the second chapter in the series, today I will present a different approach to our task of predicting names, based on the video from Andrej Karpathy. In this blog, we will do more than the mere task of counting and normalizing the sum to get the probability, which is not really machine-learning-alike, we will take a step further, a big step, to jump right into Neural Network.

A Brief Introduction for Neural Nets

I will give a short overview of this wonderful architecture, just some basics to make you feel well when reading to this blog. Further details will be discussed in later blog, like the MLP, or the CNN, LSTM things like that. So let's begin, shall we?

Neural network is a kind of architecture that some brilliant people created in order to replicate the neural system of human. To be more precise, we're talking about the Artificial Neural Network (ANN), and it consists of a bunch of nodes connecting with each other as you can see here:

You just need to remember that:

Each node in the neural net is called a neuron, and from the image we can see that those are nicely separated into parts that are the layers. The layers in the neural net are often the Input layer, the Output layer, and some Hidden layers in the middle (but in this blog today we will just need to construct one hidden layer, so it's not really hidden)
"What are those neurons doing and why do we need to pass information through layers of a bunch of neurons?" might be the question in your mind. Well this is hard to tell, I would say that the neurons take the inputs and apply a function to it, like wx + b, so we have a bunch of functions (sometimes those can be non-linear, we will talk about that later) and the values of w,b that we assign for each function is what we would call the parameters. A complex neural net has thousands of parameters, and those are changeable. This is perhaps the most important thing, we can change the parameters in the neural net.
Our goal is to tweak and tune the parameters in order to get our desired result, and we use a technique called backpropagation to do so. Backpropagation is just a method we use when we want to minimize the loss (the difference) between our prediction and the actual result, where we work backward from the prediction all the way to the input, probably tweaking the parameters along the way. But to get the prediction, first we need to plug in the input and pass it through layers of neurons, that's called a forward pass.
But how can we really "get the desired result" from a neural net? What do we expect from the output layer? It really depends on the task, but normally, in the classification problem or in this specific makemore project, the output layer would produce a kind of probability distribution for the possible values. We will discuss about that later when we delve into the code.

Okay if you understand all of the things above, then you are ready for the next part! (actually I think my explanation is not that good)

Building Makemore with a Neural Net

When dealing with neural nets, things get complicated. We just can't "tell" the net to do the things that we want, I mean there is a huge language barrier, right? So we need to come up with a task, a really specific task to plug to the machine and when the machine does the task, it generates the result we want.

Normally the task in ML or DL involves minimizing a function, okay so we need a function to minimize. So where? Remember the negative log-likelihood that we talked about in the previous chapter? There it is! That also explains why we use the negative log-likelihood, not the log-likelihood itself, because our objective is to maximize the likelihood, which in turn equals to minimize the negative log-likelihood.

During the previous chapter, we solved the problem in a rather explicit way, which is, we are just using mere counts or explicit data in order to get the result. In this chapter and numerous chapters later on, we will solve the task in an implicit way, which is like using a secret mechanism, a kind of black magic, to arrive at the final answer.

We need to restructure our data a bit, as we're putting it into a neural network. Now, remember that this is still a bigram language model, we still have one character as an input and we output the probability of the next character. We also have to signal the cases where the model is correct, which means they assign a high probability to the correct character, so those correct characters should be our labels, and our goal is to maximize the probability for the labels.

Create the training set

Now we shall begin:

# Create the training set
# Inputs and the labels
xs,ys = [],[]

for w in words[:1]:
  chs = ['.'] + list(w) + ['.']
  for ch1,ch2 in zip(chs,chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    xs.append(ix1)
    ys.append(ix2)

# Create tensors
xs = torch.tensor(xs)
ys = torch.tensor(ys)

So we have our inputs and labels, the characters are encoded in the form of integers:

tensor([ 0,  5, 13, 13,  1])
tensor([ 5, 13, 13,  1,  0])

One-hot encoding

But is this good data to feed in our neural net? The answer is NO. Briefly speaking, our boy doesn't like strings or integers, it prefers vectors. And if we normalize the vector then it is even better!

Just like when we have to turn text into integers to feed in our bigram language model, in this case, we need to find a way to encode integers into some sorts of vectors to feed in the neural net. A convenient way to do that is One-hot encoding.

One-hot encoding is simple, it makes an array full of zeros and turn the i-th element to 1 for every integer with the value of i. You can take a look at it in the code below

import torch.nn.functional as F

xenc = F.one_hot(xs, num_classes = 27)
xenc

tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0]])

We also need to cast the entries of our vector to the floating point number, which is convenient for computing inside a neural net.

xenc = F.one_hot(xs, num_classes = 27).float()

So the ingredients are fully prepared, let's cook!

Making the net

The parameters can be stored in a huge matrix and the input vectors are parts of a huge input matrix too. For convenience when computing, we are putting things in matrices, and if you scrutinize the behavior of the neural net, or even MLP or Attention, you can see that the operations are just a bunch of matrix multiplication.

Let's create our own matrices of the parameter w, actually this man has a name, it is the weight.

Initialize the weights

There are numerous ways initialize the weights, and even some strategic ones. But oftentimes we draw out the weight randomly from a normal distribution.

# Initialize the weights
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27,27),generator = g)
# That randn function draw from a normal distribution

Here's a piece of our weight matrix:

tensor([[ 1.5674, -0.2373, -0.0274, -1.1008,  0.2859, -0.0296, -1.5471,  0.6049,
          0.0791,  0.9046, -0.4713,  0.7868, -0.3284, -0.4330,  1.3729,  2.9334,
          1.5618, -1.6261,  0.6772, -0.8404,  0.9849, -0.1484, -1.4795,  0.4483,
         -0.0707,  2.4968,  2.4448],
        [-0.6701, -1.2199,  0.3031, -1.0725,  0.7276,  0.0511,  1.3095, -0.8022,
         -0.8504, -1.8068,  1.2523, -1.2256,  1.2165, -0.9648, -0.2321, -0.3476,
          0.3324, -1.3263,  1.1224,  0.5964,  0.4585,  0.0540, -1.7400,  0.1156,
          0.8032,  0.5411, -1.1646],

And then we will have our inputs modified by the weights, using matrix multiplication

(xenc @ W)

tensor([[-0.6701, -1.2199,  0.3031, -1.0725,  0.7276,  0.0511,  1.3095, -0.8022,
         -0.8504, -1.8068,  1.2523, -1.2256,  1.2165, -0.9648, -0.2321, -0.3476,
          0.3324, -1.3263,  1.1224,  0.5964,  0.4585,  0.0540, -1.7400,  0.1156,
          0.8032,  0.5411, -1.1646]])

Adding the non-linearity - Softmax

We won't use just that for our neural net's layers ! The problem here is that: when you keep multiply the input by a weight w, and you do that consecutively, then at the end of the day, after hundreds of multiplication, you still get a mere multiple of x. Applying the linear function several times still makes things linear, which makes our model rather simple and unable to capture any complicated patterns.

Hence, we should add some non-linearity to the process. We will use a function called Softmax in this blog.

So what is Softmax? It is created to generate the probability distribution that we want. Remember that when we do matrix multiplication, the output of the neural net can be virtually anything, while we just want it to output some positive integers that sum up to 1. Softmax does the trick, it takes the exponential of the output and normalize it by the sum of all exponentials. There are some other interesting details about this function, but I will do that in another blog, let see the formula of the function:

Now we create Softmax on our own:

logits =  (xenc @ W) # log-counts
counts = logits.exp() 
probs = counts / counts.sum(1,keepdims = True)
probs

And there it is, nice-looking positive numbers that sum up to 1:

tensor([[0.0607, 0.0100, 0.0123, 0.0042, 0.0169, 0.0126, 0.0027, 0.0232, 0.0137,
         0.0313, 0.0079, 0.0278, 0.0091, 0.0082, 0.0500, 0.2372, 0.0604, 0.0025,
         0.0250, 0.0055, 0.0339, 0.0109, 0.0029, 0.0198, 0.0118, 0.1535, 0.1458],
        [0.0290, 0.0796, 0.0248, 0.0521, 0.1983, 0.0289, 0.0094, 0.0335, 0.0097,
         0.0301, 0.0701, 0.0229, 0.0115, 0.0184, 0.0108, 0.0315, 0.0291, 0.0045,
         0.0915, 0.0215, 0.0486, 0.0300, 0.0501, 0.0027, 0.0118, 0.0022, 0.0472],

Tweak and Tune the Parameters

Gradient

Before we dive into the main part of our project today, we need to take a step back to think about a very important concept of Machine Learning and Deep Learning in general: The Gradient or Gradient Descent.

You may heard about the Gradient in your Calculus class, for a multivariable function, the gradient shows vector that points to the direction of the steepest ascent. Well, to put it simply, imagine you are on the hill, that Gradient friend is the guy who always points you to the direction that is the steepest and tells you go climb up. When doing ML stuff, we prefer to climb down, because we're minimizing not maximizing, that that's were Gradient Descent comes in.

Gradient descent is an optimization algorithm that uses the gradient to iteratively find the minimum of a function by taking steps in the direction opposite to the gradient.

This is the formula for Gradient Descent:

To be honest, this is just an over-simplified view of the matter, Gradient Descent is much more complicated and has numerous variants. But we understand the main idea: It's a way to get the loss down as much as possible.

Backpropagation

We might need a whole blog to talk about the beauty of this algorithm, this is the thing that revolutionized the world of neural nets. This lies in the core of almost every models you see in the world today.

What is it? When calculating the the gradient to minimize the loss, actually we cannot calculate it directly. The forward pass, or the computing process to get to the prediction, involves numerous consecutive computations that is really difficult for us to connect to one main function to get the gradient.

So a brilliant way to solve that problem was proposed: we adjust the parameters step by step, starting from the neurons at the prediction, and pass the gradient down by each neurons consecutively until we reach the very first one. This is conveniently achieved by the Chain rule from Calculus, and given the fact that the whole neural net is built from easy-to-differentiate functions (linear, or the sigmoid,..), we can work our way backward!

Another thing that contributes to the success of this algorithm is the autodiff, which stands for automatic differentiation, actually Andrej also has a video about this and maybe I will do it later in another blog.

You can watch the video from 3blue1brown, I really admire him!
Backpropagation (Intuitive)
Backpropagation (Calculus)

Implementation on the code

Theory's enough, let's turn back to our main project. Actually, there are not so much things that we can do, because the library already includes some really convenient functions for us. Well, at least we understand the underlying process.

We start by backpropagate just one time. Remember our forward pass?

  xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean()

Now we just need to perform a backward pass:

W.grad = None # set to zero the gradient
loss.backward() # magical things happen when applying this function

Remember to set the requires_grad = True when you initialize the W, then when we use the backward(), it applies backpropagation across the whole training set. We can have a look at our W.grad:

tensor([[ 0.0121,  0.0020,  0.0025,  0.0008,  0.0034, -0.1975,  0.0005,  0.0046,
          0.0027,  0.0063,  0.0016,  0.0056,  0.0018,  0.0016,  0.0100,  0.0476,
          0.0121,  0.0005,  0.0050,  0.0011,  0.0068,  0.0022,  0.0006,  0.0040,
          0.0024,  0.0307,  0.0292],
        [-0.1970,  0.0017,  0.0079,  0.0020,  0.0121,  0.0062,  0.0217,  0.0026,
          0.0025,  0.0010,  0.0205,  0.0017,  0.0198,  0.0022,  0.0046,  0.0041,
          0.0082,  0.0016,  0.0180,  0.0106,  0.0093,  0.0062,  0.0010,  0.0066,
          0.0131,  0.0101,  0.0018],

As you can see, it is filled with some non-zero values, which means we have successfully perform backpropagation for this training set.

Now we need to update our W matrix by subtracting the gradient, multiplied by the learning-rate (imagine it like the length of the step we want to make)

W.data += -50 * W.grad

And we repeat the whole process until we satisfy with the result. Note that there is this thing called epochs, which is simply the number of iterations we want to make.

# gradient descent
epochs = 100
for k in range(epochs):

  # forward pass
  xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
  print(loss.item())

  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()

  # update
  W.data += -50 * W.grad

And that's basically it! We should see how our model performs. Here is some of its last iterations:

2.4908623695373535
2.4906723499298096
2.4904870986938477
2.4903063774108887
2.4901304244995117
2.489959478378296

Wonderful! Actually it's even better than counting! So we're good now.

Sampling

And for the last, here is how we sample names from this model, it's just the same stuff we do to sample from the previous bigram model:

# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647)

for i in range(5):

  out = []
  ix = 0
  while True:

    # ----------
    # BEFORE:
    #p = P[ix]
    # ----------
    # NOW:
    xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
    logits = xenc @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    p = counts / counts.sum(1, keepdims=True) # probabilities for next character
    # ----------

    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[ix])
    if ix == 0:
      break
  print(''.join(out))

Let's look at our results

cexze.
momasurailezityha.
konimittain.
llayn.
ka.

It's the same with the bigram ! Well, another dissapointing result. This is just a one-layer neural network, so may be it does it best for now.

Key notes

Bigram versus Neural Net

The sample from both models are the same, means that even though they go different path, they still share huge similarities. In the neural net, the loss seems to be lower, which indicates its better performance.

Even though neural nets can be hard to understand, but it really outperforms its counterpart (remember that this is just a plain vanilla one-layer neural net, imagine the big bro MLP and Transformers come in), and it is also flexible, we can apply it in numerous tasks.

Some insights from the W matrix

So far we've treated this matrix as nothing but a complicated black box that is tweaked and tuned all the time to produce some satisfying results. But there is an intuitive way to understand these parameters, which can help us gain insights into what the models are actually doing.

The parameter w sometimes represents for the importance or the influence of a neuron to one another. As the next neuron is just a linear combination of the previous, we will have the function of numerous variables multiplied by it's corresponding weight. So when we weight is positive, this means that the neuron is gonna greatly influence the next, and vice versa.

When we want the model to produce, for example, the '.' that is the end of word, we may not want it to be greatly influenced by, let's say, the letter 'j', because there is just a few names that end with 'j', and we want it to be more influenced by 'n' or 'm' because those are the common ending characters. So what will we do ? We just need to lower the weight for j and increase the weights for n and m. And that's what our model is trying to do under the hood.

Regularization

If our model is confident about the prediction, that's exactly what we want, but is it good when the model is over-confident? Not really, sometimes we allow for a little bit of uncertainty in order to make the model perform better in real life. That is to prevent overfitting and we call that by a fancy name: Regularization

Allowing for some uncertainty in the decision-making process is equivalent to smooth out the probability distribution. And, to your surprise, it's actually similar to the Smoothing technique we discussed in the previous chapter. This time we will do it in a slightly different way.

Notice what when the W matrix is full of zeros, then the probability distribution would be uniform, so we will try to drive the values in that matrix down so we can get a more uniform distribution. This is done in the loss function, where now we have two objectives, and the technique is called Weight decay Regularization:

loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()

And when you run the code again, you can see that the loss is a bit higher than our loss without regularization. It's a small tradeoff we made to ensure that the model would perform well in real-world setting. (You can search about Bias/Variance tradeoff)

Regularization is actually an interesting topic that needs a blog post for itself (well just after 2 blogs my to-do list stacked considerably). There are numerous regularization techniques that possess their own beauty. You can search if you are curious.

Summarize

We've come so far in our journey! We learned tons of things about the neural net, about Softmax function, about Gradient Descent and Backpropagation, and Regularization also. There is just a lot of things packed in this tiny session about a small, simple language model. We broaden a whole new horizon of knowledge just by diving deep enough, and we have some fun playing with the code along the way. Life is good, my friend!

In the next chapter we will discuss about MLP (multi-layer perceptron), following Andrej series, and I think it would be a lot of fun!

Stay tuned, and thanks for reading. Having a good day!

DEV Community