DEV Community: Hưng Lê Tiến

BIG STEPS TO TRANSFORMER (PART 2): BUILDING THE TRANSFORMER

Hưng Lê Tiến — Sat, 06 Dec 2025 10:02:29 +0000

This might be the end of the blogposts I made based on the videos of Andrej, so shout out to him the last time. Respect!

A word of warning: I'm currently working on projects now so things got messy (going around and fixing bugs everyday). It's hard to systematically note things down, and also sometimes I just want to talk really briefly about the topic. So maybe you won't find like a detailed explanation in my blogposts anymore, some notions will be under-explained or even wrongly explained. I'm just sharing my thoughts and some of the insights that I came up with, so don't take it serious on me okay :).

Now we know that, in the Bigram model, the model only uses just the previous token to predict the next, which really limits the context and other important things. In other words, the context window is too small, and we really want to increase it. But when we discuss about scaling the N-gram model, we encounter a big problem: The computation complexity grows exponentially.

And this is when our Attention boy comes in. Attention allows us to look at everything in the past (regarding a sequence of size block_size), and we will walkthrough it from the naive approach to the actual version.

The Naive Approach

Let me go specific: What we want here is, for each timestep (as we walk through the time dimension), we would love to see every character behind us in order to make our decision.

We do that by a fairly simple way: Just carry every of the data of the previous characters with us by a weighted sum, and in our first case, we will just take the mean. Here's how it looks like with a toy example:

B,T,C = 4,8,2
x = torch.randn(B,T,C)
xbow = torch.zeros((B,T,C))
for b in range(B):
  for t in range(T):
    xprev = x[b,:t+1] # (t,C)
    xbow[b,t] = torch.mean(xprev,0)

>>>x[0]
tensor([[ 0.3624, -1.6396],
        [-0.3599, -1.2083],
        [-2.0063,  0.8857],
        [ 0.0144,  1.5211],
        [ 0.6635, -0.5929],
        [ 1.0220,  0.8683],
        [ 1.2717, -0.6242],
        [-1.4394,  0.2805]])
>>>xbow[0]
tensor([[ 3.6235e-01, -1.6396e+00],
        [ 1.2276e-03, -1.4240e+00],
        [-6.6794e-01, -6.5406e-01],
        [-4.9734e-01, -1.1026e-01],
        [-2.6517e-01, -2.0679e-01],
        [-5.0639e-02, -2.7610e-02],
        [ 1.3827e-01, -1.1284e-01],
        [-5.8944e-02, -6.3670e-02]])

And you know what? All of those things can be done with matrix multiplication. I won't go deep into the detail here, it's the responsibility of your Algebra lecturer. So let me just put the code right under this:

torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

Few things to notice here, first we have that tril thing, which does the job of converting the matrix into a lower triangular one, and the intuition behind this is that: No token is allowed to talk to the ones in the future, it can only look to the past.

Second, notice that we divided each row by the sum. This is to ensure that we take the mean of the aggregated sum.

Now let's implement that in our real input matrix x:

wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

That will return True, trust me. And notice the comment I made: There is a bit of broadcasting there, because the dimensions don't really match (we're dealing with batches). The broadcasting will just replicate all of the TxT matrices to distribute across the batch dim, so things are good here.

In practice, though, we would get rid of that normalizing part and rather apply the softmax, which I will show you:

wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

Remember that we also want to create a lower triangular matrix and then normalize, but how can we tell softmax that there are some zeros? We put the -inf in each places where it's supposed to be a 0 there, in doing so, we will eventually create 0 because softmax take the exponential of the values.

A small note here before we jump into the Attention part: We need to somehow tell the token which position it's in. Like "He sat on the mat, with the cat" and "The cat sat on the mat" for example, the relative order between "sat" and the nouns ("he", "cat) are crucial in understanding the meaning of the sentence. So order matters, then we shall introduce: Positional Embedding, which is just adding an Embedding layer for each position, and then we add that thing to our token_embedding. Then we're all set:

# It kinda goes like this
pos_emb = posemb_matrix(torch.arrange(T))
x = tok_emd + pos_emb

The Crux of Self Attention

Now first there are some problems with our naive approach: It treats each of the previous tokens as equal contributors, while in practice, some tokens can be more valuable in the eye of other tokens. In other word, we would love to get rid of our naive everything-equal approach to adopt a kind of weighted sum.

To compute that weighted sum, we need to find the relationship of the examined token with all the previous. And we do that in a fairly easy-to-understand way: Let each token has two vectors, Q (for query), and K (for key). It goes like this:

When we march through the time step, each token will emit it's query vector, like asking the previous: I'm looking for A,B,C,... It might look for an adjective, or a noun, or a word to make it's context more specific, etc.
After emitting the query vector, we take out the each of the keys from previous tokens, which is simply the answer for the query. And we measure that answer by take the dot product of two vectors (remember that dot products actually measure the similarity between vectors). And then we have our weights!

But we won't use that directly to our x, instead, we introduce another vector: the V, represents for the values of each token. The original vectors for each token are kinda private information, and value is a way to tell "If you are interested in me, here's what I will give you, but I won't take myself with you, ain't no way bud".

And all of that, my friend, is the crux of self-attention. You might want to check a great video and visualization from 3Blue1Brown (and check his series), those videos are phenomenal: Attention in transformers

Now let's implement in the code:

head_size = 16
# This is not a layer, it's just a nice way to
# construct our weight matrices
key = nn.Linear(C,head_size,bias = False)
query = nn.Linear(C,head_size,bias = False)

k = key(x) # B,T,16
q = query(x)

# Need to tranpose k
wei = q @ k.transpose(-2,-1)
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
# Creating the value vectors
value = nn.Linear(C,head_size,bias = False)
v = value(x)
out = wei @ v

A small thing to notice here is that we often transpose the K matrix and then multiply it with Q, so it's a more convenient way to take the dot products.

That's all of the basics about Self-attention, and there are few notes from Andrej that you should read. I won't go specific because I'm lazy (sorry):

Notes:

Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
Each example across batch dimension is of course processed completely independently and never "talk" to each other
In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
"self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
"Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much.

Toward the Transformer

Now we're not done yet! We have a lot of thing to do to come up with the final model. First let's take a look at the architecture in the paper Attention Is All You Need

Multi-headed attention

We did that by connecting layers of self-attention (which we call Head). The intuition is that we want the model to ask many questions as possible, and thus understand more about the context.

For the code, oftentimes we divide the head_size by the number of heads that we want and then after the multi-head layer we concatenate them in the C dimension. It's like dividing the features into small ones for the model to process.

FeedForward layer

After some layers of attention, we need to allow our neurons to have some additional amount of time to process the information that they have, and even communicate with each other more, in order to arrive at something good. I know that I'm describing this in a overly-metaphoric way, but I think it's the intuition behind adding a MLP after the Multi-head.

So let's implement it, shall we? We just need to add a Linear layer and followed by a non-linear activation function (which in this case is the ReLU).

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd,n_embd),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.net(x)

Residual connections

The Multi-head and the Feedforward are the bare bone of our model, so now we know the important parts in Transformer.

We shall move on to some smaller, yet still important details in the architecture. For the first, I would like to tell you about Residual Connection.

Remember the problem of Vanishing Gradient? It happens when the gradient has to be passed down on numerous consecutive layers, got squished and sometimes zeroed out (mainly because of some saturated activation functions), and eventually results in minimal to zero update for the parameters.

In other words, the gradient has to flow in some winding, twisty roads to bring some updates back to our layers. Well, so why don't we construct another branch of road that allows our gradient to flow smoothly without having to go through any additional pathway? And that's how we construct the residual pathway, which is really like a highway for the gradients. You can look back at the architecture illustration, and you can see there are additional arrows at the side of the model, which show that we plug the input right after calculating the output by adding those things up.

# In our block, we modify this
x = x + self.sa(x)
x = x + self.ffwd(x)

If you are not really comfortable with this metaphoric way of thinking, then you can think in mathematical term: When the gradient flows back, the addition operation distribute the gradient equally for the previous nodes. So the input still receives some unmodified, unsquashed gradients through the backward pass. And this residual connection actually opens a new architecture of neural net: Wide and Deep Neural Network. (You can search that, it's really interesting).

A small detail here, is that we also need two projection layers, one after the Multi-head, and one after the FeedForward. This acts as a mixer, we mix things up in order to allow for more exchange of information among heads. We add just a linear layer of size n_emb x n_emb(Notice that the number of input and output are equal, so basically we're just mixing things up).

#Adding a projection layer
self.proj = nn.Linear(n_emb,n_emb)

Also notice in the paper, in the FeedForward layer, the researchers multiply the inner-layer by 4 (creating a much larger working space for our neurons, I think), so we should also include that in our FeedForward layer.

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

LayerNorm

There is actually a whole family of normalization techniques, and we've already known one of them, which is the BatchNorm. But we also know that BatchNorm has some problems on its own, including the side effect of grouping instances together, and we also have to keep track of the population mean and variance. It turns out, that BatchNorm actually doesn't work well with sequential inputs like this, mainly because we don't pass data in in batches, but rather each instance one by one, and researchers have found a way to combat the issue.

It is LayerNorm, where we don't normalize data by it's batch dimension, but rather in the feature dimension, or in our case, it is the channels or the embedding dimension. So we don't even need to compute the running mean and variance anymore, because now each instance has its own mean and variance, regardless of the batch it's in.

To compute it, we just need to change the dimension when normalizing: Normalizing rows rather than columns. And that's it.

class LayerNorm:

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.momentum = momentum
    self.training = True
    # parameters (trained with backprop)
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    xmean = x.mean(1, keepdim=True) # layer mean
    xvar = x.var(1, keepdim=True) # layer variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps)
    self.out = self.gamma * xhat + self.beta
    return self.out

In recent years, there have been so many improvements and new findings. The first is that we don't add the Norm layer after the Multi-head or the Feedforward, but before them.

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        # Creating our LayerNorm
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        # We apply the LN first, then others
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

They also discovered that LayerNorm has some problems on its own, mainly in the field of model's interpretability. Researchers took that seriously and invented numerous normalization techniques, namely Instance Norm, GroupNorm (which is some of LayerNorm + some of Instance Norm), RMSNorm(currently used, LayerNorm but we don't subtract the mean and we divide the inputs by the Root Mean Square, rather than the Variance).

Dropout

The last small thing, before we wrap up the model, is the Dropout. Dropout is just additional layer mainly used for regularization. It works by randomly "killing" some neurons in the process, making the collaboration game harder, and eventually, developing a kind of independence of neurons, thus making them stronger (No pain no gain, right? I often wonder why these neurons are always treated so softly, I mean we should be strict sometimes, and Dropout really does the trick).

Dropout is also added after we calculate the Key-Query weight matrix in the Attention layer, so that when multiplying with the value vectors, it prevents some communication between words.

# Dropout in the Multihead and the Linear layer
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

And that's all I want to tell.

Summary

I know that this kind of explanation won't please you, I introduced concepts rather in an abrupt way, sometimes I even skip some parts. But if you understand what I mean, then congrats, my friend, you've just walked through one of the most brilliant architecture in human history.

The Transformer is just blocks of Multihead and FeedForward and all other stuff stacking onto each other, that's it! Now I think I gotta build this as one of my projects, so goodbye for now.

Thanks for reading and, as usual, having a good day!

BIG STEPS TO TRANSFORMER (PART 1): BUILDING THE BIGRAM

Hưng Lê Tiến — Sun, 30 Nov 2025 09:39:32 +0000

I'm back! It took a while for me to grasp all of the basics about Transformer, and we're now ready to jump right into that big boy. The days of playing with casual neural net stuff are gone, we shall behold the most modern architecture in this world, and let's break it down step by step.

This blogpost is, again, based from the series about Neural Nets from Andrej Karpathy. Big respect to him.

Now let's start our journey, we will start simple, really simple, by calling again our primitive language model: The Bigram Language Model.

You may ask: Why is the point of this? Then relax, my friend, even though the bigram is an extremely simple model, it provides a great setup for our big boy. To be more specific, in this blog, we will rebuild the bigram in a more general way, in which we will adjust step by step to eventually converge with a Transformer. So hold your beer for now.

Load the Data

We will consider a much bigger dataset, not a list of American names, but Scripts from Shakespeare. It's somehow preferred by Andrej, so we will stick with that. And as you can probably guess, it is way larger than the corpus we use in the previous project.

Let's download the data for our Colab:

# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

And then read the data:

# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

Tokenize

Just like we did in every of our previous projects, we always have to find a way to encode our text, so that it can be fed into the neural nets, and of course we should also have a decode step to convert the numerical representation back to the text-like format. Now first we just store the characters to construct the mapping later, and also to get the vocab_size:

# Construct the list of chars + vocab_size
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

We might want to see our chars, which is different from things in the previous projects:


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65

Alongside with alphabetical chars, we also got numeric, or even special ones. Actually the char[0] is the \newline, which is interesting, and our vocab_size is now 65, not 27 anymore.

Now we should create a mapping to convert all characters into integers, and then apply a function to encode the string into numbers (remember the inverse mapping and the decode step). Unlike the previous project, now we have to create the encode and decode function because we are dealing with a lot of characters, so things get complicated.

Here's the code:

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there

It works great! Now let's turn the text into integers:

# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:10]) # the 10 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])

Beautiful! Just one more note here: This process is called Tokenization. And the tokenization we're using is the most naive and the simplest form, which is the character-level tokenization. In practice, researchers have developed a whole bunch of methods for tokenizing text. The most groundbreaking ones, and actually it is used in LLMs today, is sub-word tokenization (like the SentencePiece from Google or Tiktoken from OpenAI). There is a lot of things to talk about Tokenization so I will do a whole new blog for that (Promise, because Andrej has a video on that too).

Splitting the dataset

From our previous project, we learned that in order to train the model effectively and reduce the risk of overfitting, we should split the dataset into subsets called "trainset", "devset", and "testset". We will apply the same thing to this project, by splitting the data into two parts, used for training or validation.

# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

Creating the Context Window

Now, remember in the previous model we have the block_size, which is simply the length of the chunk of text that we want to feed in our model. In this project, we will increase our block_size, or in the modern world people call it context_window, to 8. This allows our model to look back further, thus having a better understanding of the context that it's in.
Another thing to note here, is that we're not going to predict the char by just looking at it's previous 8 characters like in the past model, but we will predict the next char in each of the position in the context window. That means, in the chunk of 8 characters, first we show the first 1 char and the model predicts the next, then we show the first 2 chars and the model predicts, the list goes on until the model gets to predict the whole sequence of 8.
In other words, we are not using the "sliding window" technique anymore, but rather we're revealing one character at a time in a chunk of text. Now this is an important moment: Because of that sequential processing, we would rather call the block_size as the Time steps, specifying the time for our predictions and thus signaling a kind of sequential processing. (This is a cross-over with the Recurrent Neural Network(RNN) if you know it)

Now to make clear to you what I mean, we will introduce that in the code:

block_size = 8
x = train_data[:block_size]
# We also construct the target,
# which is just the x but we offset by a character
y = train_data[1:block_size +1] 

for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f'when input is {context} the target: {target}')

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58

This is a different way of prediction, and keep that in mind, because it is the barebone of the Transformer when we scrutinize deeper. Remember that we're doing the setup for that.

Creating Batches

Now we know about the time dimension, we should also know about the Batch dimension. In training, we would love to do things in parallel, means that in each time step, we want to process more than one thing, separately (It's like creating numerous parallel universe working independently). And that's actually the batch_size, the "creating parallel universe" part is just simply feeding the chunks of text in batches.

We create batches just like we did in the previous project, by initializing a tensor of random values, then use the values for indexing the chunks of text. For simplicity, the batch dimension will be just 4:

# For reproducibility, we introduce a manual seed
torch.manual_seed(1337)
batch_size = 4

def get_batch(split):
  data = train_data if split == 'train' else val_data
  ix = torch.randint(len(data) - block_size,(batch_size,))
  # The stack method creates the batch dimension for our data
  # simply by stacking chunks onto each other
  x = torch.stack([data[i:i+block_size] for i in ix]) 
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])
  return x,y
xb,yb = get_batch('train')
print(xb)
# 4x8 tensor

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

And now we can use that data for our neural nets!

Build the model

Let's look at the code first, and then I will explain later

import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):
  def __init__(self,vocab_size):
    super().__init__()
    # each token directly reads off the logits for the next token from a lookup table
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
  def forward(self,idx,targets):
    # idx and targets are both (B,T) tensor of integers
    logits = self.token_embedding_table(idx) # B,T,C
    return logits
model = BigramLanguageModel(vocab_size)
out = model(xb,yb)
print(out.shape)

First we need to import the Module from torch.nn to get some methods that are convenient for our project. The thing to look here is the Embedding method, it works just like when we embed our characters manually: It creates a lookup table, which is just a bunch of vectors (with the dimension depends on our initialization), and then when we want to embed something, we just need to put that in our lookup table, then the table will return an unique vector for it.

In this project we create a 65x65 (as the vocab_size is 65) lookup table, so each character is assigned with a vector storing 65 values, that's the embedding size. The embedding size is up to our liking, it can be any numbers. But why 65? you may ask, well, we're trying to replicate the Bigram model, remember? And in the Bigram model we also have a lookup table of vocab_sizexvocab_size, but that's just for counting the occurrence of pairs. And the hard truth is, actually we're not counting anything here.

To be clear, this is not a count-based Bigram model, but rather a neural net that replicates the bigram, just like we did in the second blogpost. But if we optimize it well, then at the end of the day, the table will eventually converge to nearly the same as the actual count table (and our table represent the log-count, not the normal count, I think we talked about that earlier).

The last thing to keep in mind here, which is really important later. Note that the dimension of the data would increase after embedding, with an addition dimension for storing the embedding vectors. So the shape of the log-counts, or the logits would be 4x8x65, and you might notice the B,T,C near it. That's important, and we will use that thing a lot later in the blog. The B and T, you might know, are the batch dimension and the time dimension of our data, what about C? The C here represents for the Channels, which denotes the depth of the meanings that each character holds. This shares a resemblance with Convolutional Neural Net (CNN), in which we have the data of the size (H,W,C), with the 2 first dimension represents Height and Width, while the last represent the Channel (in a typical RGB image, one pixel have 3 channels of red, green, and blue). Be free to pause and ponder, and think about the T and RNN also, then you will see that this is perhaps the second greatest unification in history (the greatest, in my opinion, is in The Avengers: Endgame).

And a small note here, when we plug the index of our character in the lookup table, it actually return the row of that index, with 65 values corresponding to the logits, and then we can apply softmax to find the probability of the next character.

Now, for every neural network implementation, we would like to see the loss. We might simply use:

loss = F.cross_entropy(logits, targets)

But that would cause an error (try it yourself!). The reason here is that the function cross_entropy in Pytorch expects the Channels as the second dimension, or in other words, it expects the logits to be of the size B,C,T rather than B,C,T. Kinda tired with that, now we have to adjust our logits, and the targets correspondingly.

# Evaluate the loss
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):
  def __init__(self,vocab_size):
    super().__init__()
    # each token directly reads off the logits for the next token from a lookup table
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
  def forward(self,idx,targets):
    # idx and targets are both (B,T) tensor of integers
    logits = self.token_embedding_table(idx) # B,T,C
    # loss = F.cross_entropy(logits, targets)
    # An error -> Pytorch expect the channel to be the second dim
    # Want B, T, C intead of B, C, T
    B, T, C = logits.shape
    logits = logits.view(B*T,C)
    # Stretch the 4x8 into a 1-dim of 32, preserve the channels
    # Do the same for the target (B,T)
    targets = targets.view(-1)
    loss = F.cross_entropy(logits, targets)
    return loss
model = BigramLanguageModel(vocab_size)
out = model(xb,yb)
print(out)

tensor(4.8786, grad_fn=<NllLossBackward0>)

We want to know that whether this is a good initial loss our not (remember the hockey stick). So let's consider the average case, which is when the model assign equal probabilities for everything. In that case, you can do some arithmetic in your mind, the log-loss would be -ln(1/65), which is approximately 4.17. So we're starting with a not really bad loss!

Sampling

Now we will talk about our method for sampling from the model. A key different here, though, is that we have an additional dimension called Time-step, and also, we're processing the predictions *in parallel, so when we have the prediction, it's actually a bunch of predictions in different universes and we will have to concat all of those thing.

Let's look at the code first:

 def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

First, we have the idx, which is simply the first index, note that in the last line of code we pass in the idx = torch.zeros(1,1). This means that we just have one batch, and we're starting at the time step 1. The max time step for us is predetermined by the max_new_tokens, and actually we're starting with the index 0, which is, interestingly, the newline character (Like: Given a new line, give me the full script).

We perform a forward pass with our idx, and you can see that we pass in just the idx alone, no targets. We should cover that in the forward:

def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

Then, when taking the logits, remember that the logits will show the probability for the next character in each of the given position, but as we're marching with time, then we just need the prediction for the last time step, that's why we have the -1 over there. You may see that this approach is rather ridiculous and funny because everytime we make a prediction, we just need the last one and ignore all of the previous (which is clearly a waste of computation). And talking about the bigram, remember the bigram only needs one character to predict one another? So what's the point here?

It turns out, that the Transformer works just like that (remember we're trying to replicate it?). And the waste of computation is clear in the Transformer too! But researchers worked with that, and they introduce the KV Cache, which does the trick. We won't talk about that, that's beyond the scope of this blog.

Next is the sampling and concat part, which is easy to understand. Note that in the last line of code, we have the number [0], it's because the function will return a list of predictions corresponding to the batch size. And we're having just one batch, so indexing at zero is good.

Let's take a loss at our sample, shall we?

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ

Well, a monkey can do better than that. But we haven't run the model yet, so let's make it better.

Run the Model with AdamW Optimizer

I will spend a blog to talk about this, but you just need to know that here's the thing that helps us with our learning rate and stuff, so that we can make the model stronger. Let me introduce this man Adam, he's best at optimizing models:

# Create optimizer

optimizer = torch.optim.AdamW(m.parameters(),lr = 1e-3)

We will change our implementation a bit, to fit with our Optimizer:

batch_size = 32
for steps in range(1000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

You might need to rerun this for numerous time, and at the end, here's what we get:

2.4956934452056885

That's significant! Now let's look at the sample again


As thee.
ARKis ve durre wiouro d wimys:
Thaueerrpanat thathe, s
WAnd Fithee.

Nobys
I ld mb, qun mod y wousa darketwave nghof IORitok has whodirate are atit G t hant m,
HAn
CELUSt y XENTha bu.
Wh blinouke sth thiviglecldewist, trveayokeanguror mepes ' wexfalle spprdswhyealaiate foulokiou:
BE an, dop

It's having structures! And that's good. Honestly, sometimes I think building this kind of thing is like having a baby, and you get to watch him grows through time, by your dedication and some lessons. I really feel like I'm a responsible father right now!

Summarize

We're done with our model! Eventhough we know that the Bigram is bad, and the result is not really beautiful, but throughout this blog, we get to know a lot of new notions! From the sequential and parallel processing of the model (which share some resemblances with the RNN), to the Big Three B,T,C and the cross-over with CNN. We also know about the sampling method, and creating an optimizer for our model. All in all, our first steps are good.

Thanks for listening, and having a good day!

BATCHNORM IN LANGUAGE MODELS

Hưng Lê Tiến — Mon, 24 Nov 2025 09:20:19 +0000

Welcome to the next chapter! This is a really important topic for LLMs and even DL in general. Batchnorm, alongside with other innovative normalization techniques, is a must-know in Deep Learning. Maybe we should jump right in to the topic, as there is a lot to talk about.

A note here, this is still the part from the Makemore series of Andrej Karpathy, so shout out to him with a big respect.

Also, I assume that you already have the code for the previous model, because we won't code everything all over again or make huge modifications, we will just take out the previous one and make some small changes. For the sake of convenience, I won't show you the previous code (it is in the previous chapters, I'm lazy), but rather small parts of it that need to be modified.

Problems with the Previous Model

When scrutinizing a newly-invented technique, we should really think about the problems that it solves. Normalization has been linked with a wide range of problems regarding neural network, so let's see why people need to normalize things when playing with neural network, or a similar question, what are the problems if we let everything behaves freely in the net?.

The Logits

We are creating the weights and biases in the output like no one in the neural net game, it's just so bad. First let's look at our first iteration and see the loss that we get:

    0/ 200000: 27.8817

That is TOO FAR away from the minimized loss, and to make it more alarming, we should look at the graph of the loss in each epoch:

Look familiar? That is a hockey-stick graph, but for the loss, it shows a disaster: We have a way-too-high loss right from the beginning, and the model struggled at thousands of iterations just to get that unfathomable loss down. Can we prevent that? Right from the beginning?

Well, actually we can get a little bit of Maths into the initialization process, to put an ease to our model in its very first step, and maybe we can make it even more efficient and arrive at a better loss. Researchers think like that too, and oftentimes, in their process of initializing the parameters, they often have a kind of Expectation.

What do we expect from the first step of the model? In the most naive way? Well, we may think that the model treats each character equally, right? Every character has the same probability, and then we can drive up or drive down some of them to make the model stronger. As there are 27 of them, in the output layer each character is supposedly assigned with the probability of 1/27. Now let's see what is the loss given that expectation, which is just the negative log of 1/27 (no magic number or 'aha' moment here, you can come up with that in your mind, it's fairly straightforward):

>>>-torch.tensor(1/27.0).log()
tensor(3.2958)

It's a GREAT loss to start! So why did our model start so poorly? Think about it.

It turns out that, in the initialization process, when we randomly sample from the Gaussian distribution, there are definitely some cases that some characters are given a higher value than others while they are not deserved to be so. So it's not a good idea to let that bias to affect us in the first step, and actually, we want everything to be equally judged so that it produces a smaller initial loss.

Let's see a simple example:

logits = torch.tensor([0.0,0.0,0.0,0.0])
probs =  torch.softmax(logits,dim=0)
loss = -probs[2].log()
probs,loss

(tensor([0.2500, 0.2500, 0.2500, 0.2500]), tensor(1.3863))

It's good, but what if we introduce some extreme values?

logits = torch.tensor([-10.0,0.0,0.0,15.0])
probs =  torch.softmax(logits,dim=0)
loss = -probs[2].log()
probs,loss

(tensor([1.3888e-11, 3.0590e-07, 3.0590e-07, 1.0000e+00]), tensor(15.0000))

The loss increases significantly! So the problem here are the extreme values, or in another words (we should really translate our language into math-like things so that we can modify the code), we don't want some values to be too far away from the average. To put it more precisely, we want the standard deviation of the statistics to be really small, like 1 for example.

Also, the mean should be set around zero, it is useful later and we will discuss about that. Actually, if you read enough ML/DL books or Statistics books, you might be familiar with the: Mean 0, Std 1 kind of stuff. Well, for some reasons, researchers love that thing, and it has a beautiful name which is the Unit Gaussian Distribution. That's exactly what we do when we normalize our data, it's really about making the data like the Unit Gaussian.

Now let's turn back too our code, what will we do with the weights and biases to make the values less extreme?

First, the bias should be set to zero, as we discuss earlier. Second, the weight should be smaller, because we don't want some numbers to 'explode' after matrix multiplication, we want to keep the numbers as small as possible. To this end, we multiply 0.1 to the W1 matrix, makes sense.

You may wonder: Why don't we set the weight matrix to 0? Actually, even though we said that Everything should be equal, but we won't do that in an extreme way.

"It's not good to go extreme, my friend"

Researcher didn't said that, but they meant that by putting it like: We won't set the matrices at initialization to all zeros because we want some entropy to our loss.
I won't go deep into this, just think of it as a kind of regularization on the other end.

Now let's modify our code

n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 200 # the number of neurons in the hidden layer of the MLP

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C  = torch.randn((vocab_size, n_embd),            generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g)
b1 = torch.randn(n_hidden,                        generator=g)
W2 = torch.randn((n_hidden, vocab_size),          generator=g) * 0.01
b2 = torch.randn(vocab_size,                      generator=g) * 0

Notice the *0.01 and the *0 added in the two last matrices (those of the output layer). We shall see our loss now:

      0/ 200000: 3.3221

Perfect! Now look at the graph

No hockey stick! And maybe our model even performs better, as it doesn't have to waste thousands of initial iterations. But we need to move to another problem in our model.

Saturated Tanh & Vanishing Gradient

Logits are not the only thing that went wrong in the process. Let's look at the hidden states, which is simply the tanh of the embedded chars:

plt.hist(h.view(-1).tolist(),50)

Woah what happened? The values are kind of polarized into two ends: -1.00 and 1.00. We know that something has gone wrong here, and as some brilliant detectives, we traced one step back, which is the hidden states before the tanh.

And for the last piece in the picture, let me reminds you of the tanh() function:

tanh() is a squashing function, it squashes and caps every input into the range of 1.0 and -1.0. Seeing the graph of the hidden states before activation, we know exactly what is the problem here: Before activations, there are large numbers, and those numbers enter the tanh() and later stay at one of the two ends. This process is called Saturation, where values are mostly pushed towards the extreme ends of the function.

And how does that affect our model exactly? Thinking about the gradient. The gradient of the tanh() is actually 1-t^2, with t being the value of the output of tanh(), so with every output close to 1 and -1, the gradient will be close to zero! In fact, after backpropagating through many layers of neurons, the gradient can really shrink to zero. That is called Gradient Vanishing, and when no gradients are passed down, things get updated really slowly, hence driving down the efficiency of the model.

Now let's see what I mean here, just look at the proportion of the hidden states that have the absolute value larger than 0.99:

plt.figure(figsize = (20,10))
plt.imshow(h.abs()>0.99,cmap = 'gray',interpolation = 'nearest')

The white part is where the gradients zero out and the neurons are dead. You can see that our model is largely saturated.

So far we've been chilling with our model because it is just a small one. But oftentimes we will have to deal with neural nets with tons of layers, and the problem can accumulate really fast.

Now let's fix this: We should bring everything close to zero, and make the variance small so that the tanh() does not push anything to the ends (you can see why this is a good choice by again looking at the graph of the tanh()). In doing so, we're normalizing the hidden states before activation:

n_embd = 10 # the dimensionality of the character embedding vectors
n_hidden = 200 # the number of neurons in the hidden layer of the MLP

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C  = torch.randn((vocab_size, n_embd),            generator=g)
W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * 0.01
b1 = torch.randn(n_hidden,                        generator=g) * 0.01
W2 = torch.randn((n_hidden, vocab_size),          generator=g) * 0.01
b2 = torch.randn(vocab_size,                      generator=g) * 0

parameters = [C, W1, W2, b2]
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
  p.requires_grad = True

That is the same with the thing we did last time, but in the process, we multiply the b1 by 0.01, not 0, this is, again, an attempt to "introduce some entropy to the model". Now let's run and see our hidden states, after and before activation respectively.

# And we should also see if there are some dead neurons
plt.figure(figsize = (20,10))
plt.imshow(h.abs()>0.99,cmap = 'gray',interpolation = 'nearest')

Everything looks great! The tanh() doesn't behave badly like before, as we successfully normalize the input, and also, we don't see any saturated values in the last plot (actually we will have to change the scalar of the weight up a little bit so there are few dead neurons, another attempt to prevent going too extreme).

About the loss, we don't expect really much in this because we're just dealing with a fairly small MLP.

Initialization

A small example

You might question me "How do you know those magic 0.1 numbers that are multiplied with the weights?". Actually that is just a random small value (and yet it worked!), but there are numerous research around this, really interesting ones.

Let's consider a small set of data:

import torch
import matplotlib.pyplot as plt
x = torch.randn(1000,10)
w = torch.randn(10,200) 
y = x @ w
print(x.mean(),x.std())
print(y.mean(),y.std())
plt.figure(figsize = (20,10))
plt.subplot(121)
plt.hist(x.view(-1).tolist(),50,density = True)
plt.subplot(122)
plt.hist(y.view(-1).tolist(),50,density = True)

tensor(-0.0027) tensor(1.0000)
tensor(0.0031) tensor(3.1318)

You can see that while the x is Unit Gaussian, everything messes up after multiplying with w. Specifically, the mean is good, it's around zero, but the standard deviation spells disaster.

Remember, our goal is to make everything "Unit-Gaussian-like", so the real task here is to find the value of the matrix w such that the Unit Gaussian is preserved after matrix multiplication. Now let's try this thing: Divide the weight by sqrt(fanin), in which the fanin, by convention, is the number of input.

x = torch.randn(1000,10)

# Divide the weights by 1/sqrt(fanin)
w = torch.randn(10,200) / 10 ** 0.5  
y = x @ w
...

The output looks beautiful! But how?

Kaiming_Init

One of the initialization strategies proposed is the Kaiming Initialization, or the He Initilization. It is also a method in the Pytorch library, named init.kaiming_normal. This method has the nonlinearity argument, which specifies the activation function that we would like to use. Note that each activation function has their own gain, which is simply a scalar to combat the "squish" that the activation function makes. The gain for our tanh() is approximated at 5/3. Also, there is the argument called fan_mode, we can pass in whether 'fan_in' or 'fan_out', depending on the flow, forward or backward. In our case, we use fan_mode = 'fan_in', which is the default value. The standard deviation is calculated following the formula : "std = gain / sqrt(fan_mode)".

You can see the whole paper here.

Now let's change our W1 matrix:

W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3) / (n_embd * block_size)**0.5

And that's basically it ! We solved some of the problems regarding neural nets. Note that all of which are just issues involving driving down the weights and even zeroing out the bias, in order to make the data more Unit Gaussian, which is crucial to maintain the stability of the model.

Batch Normalization

Now let's talk about the topic today, right? We went such a long way to reach here. Actually this is one of the modern innovations that solve all of our previous problems, and the idea behind is fairly simple.

If we want the hidden states to be Unit Gaussian, why bothering about initializing and tweaking the weights and bias matrices? Why don't we just SIMPLY NORMALIZE them? And that's how a brilliant idea came in to place.

You can see the paper here.

And this is the implementation of it:

It's just the procedure of normalization: Subtract from the mean, and divide by the standard deviation (which is the square root of the variance). Note that in the diving part (line 3), we have the term epsilon in the denominator: That term is added to prevent the case of zero variance, in which we have to divide things by zero. Usually we have epsilon = 1e-5.

Another important thing to note here is the last line, we have two parameters called gamma and beta, which is for "scale and shift". What is it for?

Well, we don't really want our data to be exactly Gaussian, we would like the neural net to move around a bit, to make it more diffuse. Those parameters are trained in the training part, so the model decides how to move the data around. May be it's just some kind of flexibility added to the model.

# We call those parameters "gain" and "bias", "gain"
# is like the "weight", but it's not a matrix
# In fact, both are 1-dim row vectors

# In the initialize part, we expect to keep things unchanged
# So everything is multiplied by 1 and added with zero

bngain = torch.ones((1,n_hidden))
bnbias = torch.zeros((1,n_hidden))

Also, remember to include these things in the parameters, as those are trainable:

# Include in the parameters
parameters = [C, W1, W2, b2, bngain, bnbias]

for p in parameters:
  p.requires_grad = True

Now let's modify our model a bit: Before applying the tanh() function, we normalize the hidden states:

# same optimization as last time
max_steps = 200000
batch_size = 32
lossi = []

for i in range(max_steps):

  # minibatch construct
  ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
  Xb, Yb = Xtr[ix], Ytr[ix] # batch X,Y

  # forward pass
  emb = C[Xb] # embed the characters into vectors
  embcat = emb.view(emb.shape[0], -1) # concatenate the vectors
  # Linear layer
  hpreact = embcat @ W1 + b1 # hidden layer pre-activation
  # Batch Norm
  hpreact = bngain * (hpreact - hpreact.mean(0, keepdim=True)) / hpreact.std(0, keepdim=True) + bnbias
  # Non-linearity
  h = torch.tanh(hpreact) # hidden layer
  logits = h @ W2 + b2 # output layer
  loss = F.cross_entropy(logits, Yb) # loss function

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  lr = 0.1 if i < 100000 else 0.01 # step learning rate decay
  for p in parameters:
    p.data += -lr * p.grad

  # track stats
  if i % 10000 == 0: # print every once in a while
    print(f'{i:7d}/{max_steps:7d}: {loss.item():.4f}')
  lossi.append(loss.log10().item())

Notice the Batch Norm layer in the middle of the Linear and the Non-linearity layers.

Batch Norm should also be used in the test set, or in the evaluation mode.

@torch.no_grad()
def split_loss(split):
  x,y = {
    'train': (Xtr, Ytr),
    'val': (Xdev, Ydev),
    'test': (Xte, Yte),
  }[split]
  emb = C[x] # (N, block_size, n_embd)
  embcat = emb.view(emb.shape[0], -1) # concat into (N, block_size * n_embd)
  hpreact = embcat @ W1  + b1

  # Adding the BatchNorm layer here
  hpreact = bngain * (hpreact - hpreact.mean(0, keepdim=True)) / hpreact.std(0, keepdim=True) + bnbias
  h = torch.tanh(hpreact) # (N, n_hidden)
  logits = h @ W2 + b2 # (N, vocab_size)
  loss = F.cross_entropy(logits, y)
  print(split, loss.item())

split_loss('train')
split_loss('val')

In fact, when we're dealing with numerous hidden layers in the neural net, we should sprinkle BatchNorm across the whole net, right after each layer.

So far we've been praising this technique really much. But does it live up to the hype?

Some issues with BatchNorm

First just look at the name of the technique: BatchNorm, and yes, we normalize our data, but we do that in batches. What is the problem with that?

Our batches are generated in a purely random way, and then we pass our data in groups. In our previous model, it still works because when we pass the data in, we don't do anything with the group that each instance belongs to, but rather we treat each one independently, so eventhough we're dealing with batches, we still hold the independence property of data. But look at what we do when applying BatchNorm: We normalize the data with respect to the their batches, so actually the value of data is dependent on others in the same group.

That is rather an ugly thing, as it is super unnatural to work with groups like that, and we're even generating the groups randomly. But, a big moment here, it turns out that in practice, this is actually a regularization technique.

By feeding in random batches of instances, the model gets affected largely when new batches come, which is considered as a kind of noise. Hence, the model becomes more robust to tiny shifts, as well as it will learn features rather than memorize, and that means regularization. (Actually I'm not really clear about this, so you can search on Google or ask ChatGPT for better explanation)

Through time, there's been numerous normalizing techniques (like the LayerNorm) that have been developed to replace this BatchNorm, mainly because people don't like the "grouping bug" of this thing. But let's just stick with this amazing normalization technique in this blog post.

Problems with Inference/ Evaluation

This is a hard-to-grasp concept, and it took me hours to understand. I will do my best to give you a good sense of what this is.

Remember in the test mode, we included the hpreact.mean and the hpreact.std? But what do you think our model are doing when making predictions? It should make predictions based on only that input, independently, right? And again, what our model is doing is rather unnatural and strange, as somehow we're taking the batches into consideration when predicting a single instance.

So, in order to preserve the independence of the data when evaluating, we should have an population mean and std to normalize our data. One way to implement this is to wait until the end of the training and run the whole training set again, but this time we calculate the mean and std of the population.

# calibrate the batch norm at the end of training

# Remember to use torch.no_grad because we won't backpropagate through this during training, this will save a lot of memory
with torch.no_grad():
  # pass the training set through
  emb = C[Xtr]
  embcat = emb.view(emb.shape[0], -1)
  hpreact = embcat @ W1 # + b1
  # measure the mean/std over the entire training set
  bnmean = hpreact.mean(0, keepdim=True)
  bnstd = hpreact.std(0, keepdim=True)

And then in the evaluation mode, we replace the mean and std of the batch with the population mean and std:

hpreact = bngain * (hpreact - bnmean) / bnstd + bnbias

That would solve the problem!

But researcher would not like that, training the whole dataset again just to find the mean and the std is rather laborious, and actually in practice they have a nicer way to approximate those values, right during the training part.

The technique here is the keep a exponential moving average (EMA) throughout training. We constantly update the values of mean and std when each batch arrives, and the level of update is determined by the hyperparameter momentum, which is the beta in the formula below:

\mu_{\text{running}} = \beta.\mu_{\text{running}} + (1-\beta).\mu_B

\sigma_{\text{running}}^2 = \beta.\sigma_{\text{running}}^2 + (1-\beta).\sigma_B^2

Implementing the Running Mean and Standard Deviation

Now let's turn back to our code for the (nearly) last optimization. When training out model, we have two jobs to do:

We need to calculate the mean and std for each batch, and then we normalize the values just like we did when implementing BatchNorm, this is for training.
We also have to update two values during this process, which are later used in testing (or inference): The bnmean_running and the bnstd_running, calculated by the exponential moving average. Remember that these are not the parameters, so set the torch.no_grad() to them.

With everything prepared, we should jump right in the code. First we initialize the running mean and std.

# Initialize the running mean and std
bnmean_running = torch.zeros((1, n_hidden))
bnstd_running = torch.ones((1, n_hidden))

Now let's change the BatchNorm layer a bit:

# BatchNorm layer
  # -------------------------------------------------------------
  # This is for normalization
  bnmeani = hpreact.mean(0, keepdim=True)
  bnstdi = hpreact.std(0, keepdim=True)
  hpreact = bngain * (hpreact - bnmeani) / bnstdi + bnbias

 # This is later used for testing, notice the torch.no_grad()
  # We set the momentum = 0.999
  with torch.no_grad():
    bnmean_running = 0.999 * bnmean_running + 0.001 * bnmeani
    bnstd_running = 0.999 * bnstd_running + 0.001 * bnstdi

Now we should take those running mean and std into our evaluation mode:

@torch.no_grad() # this decorator disables gradient tracking
def split_loss(split):
  ...
  hpreact = bngain * (hpreact - bnmean_running) / bnstd_running + bnbias
  h = torch.tanh(hpreact) # (N, n_hidden)
  logits = h @ W2 + b2 # (N, vocab_size)
  loss = F.cross_entropy(logits, y)
  print(split, loss.item())

And that's basically it! We can see our approximation of mean and std, compared with the real bnmean and bnstd we calculated earlier:

>>>bnmean_running
tensor([[-2.3338,  0.6988, -0.9011,  0.9966,  1.0906,  1.0759,  1.7426, -2.1253,...
>>>bnmean
tensor([[-2.3145,  0.6885, -0.9134,  0.9972,  1.0878,  1.0841,  1.7470, -2.1102,...

The approximations were really good! And let's see our final training and validation loss:

train 2.0666308403015137
val 2.1051523685455322

Great!

Removing the Bias

One small thing to note here, maybe the last thing, is that we can actually remove the bias if we're gonna applying BatchNorm. This is simply because when normalizing, we don't really care about our data being offset by some value, because we subtract everything by the mean anyway. (Think of the bias as something that make the distribution move around left and right, and it is meaningless in normalization because we will eventually move our distribution to center around zero)

So let's just comment out the initialization of b1, which contributes to better memory usage:

#b1 = torch.randn(n_hidden, generator=g) * 0.01

If you scrutinize some advanced models using BatchNorm, like the Resnet, you will notice that when creating a Linear layer, they all set the bias = False.

Summarize

When I first learned about BatchNorm in the book, I didn't even know why we need to normalize stuff, and I got so confused at that time. But now when actually implementing a model of my own (well it's from Andrej, sorry if it bothers you), I realized that BatchNorm is a really significant milestone that solves a lot of problems in the neural net, and those are really easy to notice.

Today we fixed a lot of bugs our models, as well as bringing the holy BatchNorm to light, we went from some problems with Initialization, the Vanishing Gradient, to BatchNorm and its implementation, its pros and cons, as well as learning about the EMA, removing the bias,...

Well, another productive day of us! Thanks for reading and have a good day.

LANGUAGE MODELS USING MLP (Part 2)

Hưng Lê Tiến — Thu, 20 Nov 2025 22:21:42 +0000

Welcome back! Today we will train our multi-layer perceptron, as well as exploring some techniques to fine-tune the model. Remember the terrifying loss that we have in our previous chapter? Keep that in mind, we will try our best to minimize that loss in this blog.

CROSS-ENTROPY

We first introduce to you a new evaluation metric called cross_entropy. Let's see what it actually is:

F.cross_entropy(logits,Y)

tensor(14.3920)

It is the same with our negative log likelihood! In fact, the cross-entropy is just a short implementation of our loss. Rather than hard-coding all of the intermediate procedures, we wrap everything up on one line of code, after calculating our logits:

# Calculating the logits
logits = h @ W2 + b2

# Our previous code
counts = logits.exp()
prob = counts / counts.sum(1,keepdims = True)
loss =  -prob[torch.arange(32),Y].log().mean()

# New version
F.cross_entropy(logits,Y)

The function cross_entropy actually offers convenience beyond that:

By eliminating the intermediate steps, we also save tons of memory. To be clear, in our previous code, we created numerous intermediate tensors for storing the counts, then the prob, then eventually the loss, that itself represents a waste of memory. In the cross_entropy function, however, there's no intermediate steps (at least that's what I was told), so there's a huge saving in the memory.
The function also allows for more efficient backward pass. We don't have to pass through numerous intermediate functions like in our previous code, instead, we just need to backward through a single computation in the cross_entropy, which is more efficient and time-saving.
It is more numerically well-behave. This is the thing that we really need to dive into. So let us spend some minutes talking about what is numerically well-behaved.

Let's consider the prob of a random tensor:

logits = torch.tensor([-100,-3,0,5])-6
counts = logits.exp()
probs = counts / counts.sum()
probs

tensor([1.5123e-08, 3.3311e-04, 6.6906e-03, 9.9298e-01])

So far so good, no problems here. But let's see what happens when we have some positive values in our tensor:

logits = torch.tensor([-13,-3,0,100])-6
counts = logits.exp()
probs = counts / counts.sum()
probs

tensor([0., 0., 0., nan])

There's something wrong! Further scrutinize the counts, we will find that:

>>>counts
tensor([5.6028e-09, 1.2341e-04, 2.4788e-03,        inf])

The count for the value 100 goes to infinity! And that is understandable because previously we took the exponential of each values, so when it encounters some huge positive numbers, it will have the tendency to blow up, which is often called the numerical overflow (also note that it performs well with the negative numbers).

We don't want that bug to appear in our project, so we need to find a way to scale down all of the values. This involves a subtle detail in the process of constructing the softmax: The output doesn't change when we offset our logits by a constant value. Take a moment to think about that, it is fairly easy to explain mathematically.

With that in mind, then the simplest thing we can do to solve the problem is just subtract the maximum value in the whole tensor. From then, we will have a bunch of negative numbers and we're happy to deal with that.

Turning back to our function cross_entropy, the operation we mentioned above is conveniently included in the function itself, so the function actually does wonders for us in terms of convenience.

Now we shall turn back to our project.

Putting Things Together

During our course of implementing our project, we initializing tensors and functions in a kind of a thinking flow, and thus our code is rather messy and unorganized. In practice, organizing the project is crucial for readability and debugging later on. Moreover, when we want to make modifications to our model, we would know exactly where to go, so if you're working on your project, don't be lazy, take time to put things together as it will be immensely useful later on.

Initializing all the parameters

We put all of the matrices that we have to initialize in a single block, and we also define the seed for the Generator, so that we can reproduce our code.

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

Now we shall build our model

Build the model

First, remember to set requires_grad = True so that we can implement our backward pass.

for p in parameters:
  p.requires_grad = True

And then we build the model with forward and backward pass, we also need to update the parameters:

for i in range(100):

  # forward pass
  emb = C[X] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Y)
  print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  for p in parameters:
    p.data += -0.1 * p.grad

Run the code

So now we shall run the code and see what happens. Here's the last few results:

0.26593196392059326
0.26574623584747314
0.265565425157547
0.26538926362991333
0.2652176320552826
0.2650502920150757

Astonishing? No, remember that the data is small (only 32 windows), and when the model arrives at an extremely small loss, we won't praise in, but we would rather think that the model is overfitting.

Overfitting means that the model tries to memorize the data rather than capture the underlying meaning. So even though the model is extremely good in their own training set, it would perform poorly on the real-world data, where it gets to see things that it haven't learnt before, and it cannot do anything with it since everything it've done so far is pure memorization (imagine this as a student studying in a class and then taking a test, then you can really understand what I mean).

But why the loss doesn't converge to 0? If the model gets to memorize everything in the training set, so why it doesn't have the 100% accuracy? Normally, if the model is overfitting like this, we will surely get a loss of zero. However, specifically in this project, there is a subtle detail that prevents our network from guessing accurately.

Let's look at our data again:

... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
... ---> a
..a ---> v
.av ---> a
ava ---> .

Can you notice the key here? Take a guess.

So here's the answer: Look at the instance ..., denoting the start of word, you just cannot guess the character that comes after it! Try it yourself. It is totally random. So that's where our model struggled.

Working with the whole dataset

Lately we've been dealing with mere few words in the dataset, so let's go big this time.

From the code that constructs our dataset, we should remove the words[:5] and pass in everything.

# rebuild the dataset
block_size = 3 
X, Y = [], []

# Now we iterate through all the words
for w in words:
  context = [0] * block_size
  for ch in w + '.':
    ix = stoi[ch]
    X.append(context)
    Y.append(ix)
    #print(''.join(itos[i] for i in context), '--->', itos[ix])
    context = context[1:] + [ix] # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape)

Now let's look at the size of our data

torch.Size([228146, 3]) torch.Size([228146])

That's huge.

Also, when calculating the loss make sure to remove the number 32 and replace it by the appropriate dimension. That's actually the reason why we always try to prevent hard-coding some numbers, it will save a lot of time going around and fixing minor bugs.

loss =  -prob[torch.arange(X.shape[0]),Y].log().mean()

Now we should run our model and see what happens, I recommend you to set the number of iterations to 10, we will discuss shortly about that.
Here's the results:

19.505226135253906
17.08449363708496
15.776531219482422
14.833340644836426
14.002603530883789
13.253260612487793
12.57991886138916
11.983101844787598
11.47049331665039
11.051856994628906

When running the code, it turns out the the model become slower, and that's understandable because it has to calculate the gradient for the whole dataset of 220000 instances, it's really a big deal. And we're just iterating 10 times, so this approach is really questionable when scaling.

A clever way to combat this is passing the input in mini batches, rather than the whole batch of data. So we're dealing with a group of instances at a time, not every single one. This approach is, surely, not as accurate as the batch ones, but it is more time-saving and puts less strain on the computer, and actually it is widely used in practice. In other words, it's much better to approximate the gradient and take much more steps, rather than taking the exact gradient and fewer steps.

So let us construct the mini-batch: We will randomly initialized tensors with 32 values ranging from zero to our number of training instances. Here's how it goes:

torch.randint(0,X.shape[0],(32,))

tensor([200670, 191458, 142413, 156993, 217108, 174176, 143298,  30653, 148878,
        158381,  11828,  75183, 115824,  49455,  91737, 216958, 142564, 224086,
         73948, 217108, 174951, 170926, 180371, 224631, 167595, 173195, 116182,
        192239, 158702,  43879,  45633, 165950])

That looks tasty. Let includes that in our code, and modify our model a little bit so that it will just take out 32 values no more no less.

for p in parameters:
  p.requires_grad = True
for i in range(100):
  # mini batch construct
  ix = torch.randint(0,X.shape[0],(32,))
  # forward pass
  emb = C[X[ix]] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Y[ix])
  print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  for p in parameters:
    p.data += -0.1 * p.grad

Here's some few last loss:

3.3569700717926025
3.668323516845703
3.5134356021881104
3.7804458141326904

Promising! And it takes just a few seconds to arrive at that loss.

Learning Rate

Let's look at the implementation again, there's actually one thing that we haven't talk about yet. It's the number 0.1 we multiply by when we update the parameters. It is called the learning rate, and we will spend time discussing about it in this section.

The learning rate, as you can easily notice by yourself, is simply a number to control the length of our step. If the learning rate is too low, which means that we take small steps, then it will take a huge number of iterations to converge to the minimum. By contrast, if the learning rate is too high, indicating big step size, then we will potentially "slip" through the minimum and end up bouncing back and forth without really converging but rather diverge. If it is hard for you to visualize, here's a cool visualization that might help:

*Note: Actually the image is from a really dedicated blog post about learning rate (100x better than mine), I recommend you to read the blog as it is so deep-dive into the topic. Here's the link.

Learning rate decay

There are many ways to set up the learning rate during training in order to maximize the power of the model. The simplest one, which we will do in our project, is perhaps the step decay (or the learning rate decay). The technique involves decreasing the learning rate after a fixed number of epochs (like 100000), thus making the model move slower when the loss is small and potentially near the minimum. This is kind of intuitive: we will take big steps when we're far from the min, and smaller ones when we're close.

Even though there are numerous techniques for step decay, we will still keep it simple by just dividing the learning rate by 10 after each epoch of 10000 iterations.

Find the right learning rate

Now we now the importance of choosing the right learning rate and adjust it so that the model performs well. But how exactly can we do that?

We will follow Andrej's way in finding the good learning rate, actually it's not the most optimal but it's really intuitive and we can easily follow all the steps. First we have to determine a range of possible good values. There would be some points where the learning rate is too low and the model is not really good at decreasing the loss, and there would be points where the learning rate is too high and the loss is bouncing around. We might see those points when we plug in different values, this is pure trial and error. And after playing around with some values of the learning rate (actually Andrej did this in his video, not me), we have our range: from 0.001 to 1.

The next step is examining every values in each range and see what happens with the loss, we should have our optimal value when the loss is at its minimum. So let's run the model with different values of learning rate, actually we can use Pytorch to construct an array of values in that range

# Want to examine the range from 0.001 to 1
torch.linspace(0.001,1,1000)

Plotting the lr-loss Curve

After having the array we want, let's plug it in our model. Also, we need to construct two lists, storing values of learning rate and the loss respectively, in order to graph things out.
A BIG NOTE: After messing things around in my own project, I have to warn you this so that you won't have this same silly mistake: REMEMBER TO RERUN THE ENTIRE MODEL AGAIN. Don't use the previous Weights and Bias for rerun, you have to INITIALIZE THEM ALL OVER AGAIN. Keep this in mind for every time we rerun the model.
Here's the code:

# Keep track of lr and loss for each iter
lri = []
lossi = []

for p in parameters:
  p.requires_grad = True
for i in range(1000):
  # mini batch construct
  ix = torch.randint(0,X.shape[0],(32,))
  # forward pass
  emb = C[X[ix]] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Y[ix])
  print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  lr = lrs[i]
  for p in parameters:
    p.data += -lr * p.grad
  # tracking stats
  lri.append(lr)
  lossi.append(loss.item())

Here comes the plot:

# Plot the lr
plt.plot(lri,lossi)

This is like a hockey-stick-alike kind of graph, means that it has a steep slope at the origin (nearly like a straight line), and then having small fluctuations along the middle part before bouncing back and forth uncontrollably till the end of the range.

Let's interpret this graph, and you will see that our conclusions so far are all correct. First, notice in the beginning of the range, where the learning rate is small, the loss is not even close to the minimum, indicate that we're not making large enough steps to converge at the minimum. Conversely, in the bulk of the remaining range, where the learning rate increases, the loss bounces between values and it bounces even stronger when the learning rate is near 1, which is an indicator of divergence.

Now our mission is looking at this graph and find the value where the learning rate gets the loss to its minimum. From our eyes we can see that the value is in somewhere around 0.1. If it's not too obvious, then we can plot out the log of the learning rate, so that the range of values can be wider. Let's implement that in our code:

# Take the log of our range of values

lre = torch.linspace(-3,0,1000)
lrs = 10 ** lre

# Intilize a new list 
lrlog = []

# Update this list in when tracking stats
....
  lrlog.append(lre[i])
  lri.append(lr)
  lossi.append(loss.item())

# Plotting things out
plt.plot(lrlog,lossi)

And here's the plot:

Now it's getting obvious. So we go full circle in this section: The magic number 0.1 in the beginning is actually the optimal learning rate for this model.

And after rerun the model with this optimal learning rate, plus the learning rate decay technique, we have our loss:

2.3681914806365967

It's the best result that we've got so far! It's amazing that just this one small number can create such a huge effect on our model, even when we're playing with the simple methods (guess how crazy it will be when you try some advanced ones, worth to try!).

Splitting the Data

So far we've been getting the model to study a lot, and we also fine-tune it in order to bring down the loss. But does this loss mean that our model will do well in real-world setting? Actually we don't really know, because there's no test yet. How can we create a test for the model?

The answer is that we will stop feeding the model every data points, instead, we will hold back a small portion of the dataset. That's gonna be our test, because the model doesn't get to see those data during training. Normally the split is around 8:2 or 7:3 for the training: testing. The evaluation method in the test set is still our cross_entropy. One thing to keep in mind: If the validation is high while the training loss is low then there's a high chance of overfitting the data (The model does well only when it comes to the data it has seen before while performing poorly in the test, indicating that it is memorizing rather than generalizing).

Moreover, when training, we also want to hold out a bit of data in order to fine-tune the model, it's like another test set but not for test, but for evaluations to find the good hyperparameters (just like what we've done with the learning rate). "But why another test set? Why don't we evaluate the hyperparameters right on the training set just like what we've done lately?" - you may ask. The thing is, you're tweaking and tuning the hyperparameters so that they will make the model powerful with just the training data, there's no guarantee that these hyperparameters will work in practice, with data that it's never seen before. So we create another set of training called the dev_set, designed specifically for finding the hyperparameters.

Another side note here, what happens if the data is too small to split into 3 sets? How can we accurately evaluate the model? The solution for this is quite interesting: If the data is too small, we have to "reuse" it when testing. To be more specific, rather than having just one specific split and evaluating the model on that, we will create a kind of parallel universe where each one possesses a different split, then we evaluate the model and get the big picture. This clever method is cross-validation, which is widely used in ML tasks.

The library cross-validates by first randomly splitting the data into k folds, and then we will have k evaluation scores according to the splits where one of the k subsets is held out for testing. That is called k-fold cross validation.

Split the data

Just to remind you, here's a brief recap of our three sets:

Training set: This is for training the parameters, or training the model.
Dev set: This is for training the hyperparameters
Test set: This is for the evaluation of the model

Now let's implement the split in our code: We will first shuffle the data and get the counts for each of our sets, depending on the splits we want (in this project we will use 8:1:1). Then, we build the three sets of data from the counts we've got. Here's how it looks like:

block_size = 3 

# Create a function for the dataset construction 
# For convenience purposes
def build_dataset(words):
  X, Y = [], []
  for w in words:

    #print(w)
    context = [0] * block_size
    for ch in w + '.':
      ix = stoi[ch]
      X.append(context)
      Y.append(ix)
      #print(''.join(itos[i] for i in context), '--->', itos[ix])
      context = context[1:] + [ix] # crop and append

  X = torch.tensor(X)
  Y = torch.tensor(Y)
  print(X.shape, Y.shape)
  return X, Y

# Finding the counts in each sets
import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))

# Split
Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])

torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])

That's our splits! Now we have to retrain the model, and remember to use the data in the training set only.

# Now we train on just the Xtr and Ytr
for i in range(30000):

  # minibatch construct
  ix = torch.randint(0, Xtr.shape[0], (32,))

  # forward pass
  emb = C[Xtr[ix]] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Ytr[ix])
  #print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  lr = 0.1
  for p in parameters:
    p.data += -lr * p.grad
print(loss.item())

2.3609843254089355

And now comes the evaluation, remember to use the dev set for this task:

emb = C[Xdev] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ydev)
loss

tensor(2.4578, grad_fn=<NllLossBackward0>)

You can see that the validation loss is a bit higher than the training loss, which indicates that the model is slightly overfitting. When it's overfitting, we often think about some regularization techniques, actually we did one during our previous project. But let's leave this part here, in the next section we will continue to explore the ways for fine-tuning the model.

Fine-tuning the Model

How can we make the model more powerful? The first thing that comes in our mind is perhaps: Just scale it!.

Scale the model

Let's modify the matrices that we initialize: We will increase the neurons in our hidden layer, from 100 to 300. Here's the modifications to the matrices:

g = torch.Generator().manual_seed(2147483647) 
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6, 300), generator=g)
b1 = torch.randn(300, generator=g)
W2 = torch.randn((300, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

Now let's run our model again (remember to set the requires_grad), but now we keep track of the steps and the loss, just to see how the loss changes after each iteration. We might need to take the log of the loss, for better visualization.

# Now we track the step, use the log for the loss
lossi = []
stepi = []
for i in range(30000):

  # minibatch construct
  ix = torch.randint(0, Xtr.shape[0], (32,))

  # forward pass
  emb = C[Xtr[ix]] 
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) 
  logits = h @ W2 + b2
  loss = F.cross_entropy(logits, Ytr[ix])
  #print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  #lr = lrs[i]
  lr = 0.1
  for p in parameters:
    p.data += -lr * p.grad

  # track stats

  stepi.append(i)
  lossi.append(loss.log().item())
plt.plot(stepi,lossi)

You can see how the loss is quickly driven down and then fluctuates.

We should see our final loss:

2.799872875213623

It's worsen, maybe that's because we put some data out. Let's evaluate this model using dev set.

# Evaluate using the dev set
emb = C[Xdev] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ydev)
loss

tensor(2.5693, grad_fn=<NllLossBackward0>)

Not bad! Our model is performing even better with the validation set. That's a huge improvement!

Changing the embedding size

We scaled the net and yet we did not get a really good result. So one thing that comes in mind is that maybe the bottleneck is not in the size of the net, but in the embedding size. Remember that in the first place, we squish the whole 27 characters (27-dimensional) in two dimensions, so there might be a lot of information loss.

But before we change the embedding size, let's take a look at how our model learned during the training phase. Let's visualize our embedding matrix C:

# visualize dimensions 0 and 1 of the embedding matrix C for all characters
plt.figure(figsize=(8,8))
plt.scatter(C[:,0].data, C[:,1].data, s=200)
for i in range(C.shape[0]):
    plt.text(C[i,0].item(), C[i,1].item(), itos[i], ha="center", va="center", color='white')
plt.grid('minor')

The model learns how to cluster characters, and if you notice, it even separate the vowels a,i,u,e,o, which is astonishing to see. And the g is far apart, may be the model thinks that g is not really a common character in names? But is is still amazing how there are some meaningful things in our net, just by tweaking and tuning a whole bunch of characters.

Now let's increase the embedding size to 10, thus changing the C matrix and the W1.

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 10), generator=g)
W1 = torch.randn((30, 300), generator=g)
b1 = torch.randn(300, generator=g)
W2 = torch.randn((300, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

Adjusting the dimension wisely, and rerun the code, here is our loss:

# training loss
loss.item()

2.395942211151123

# validation loss
emb = C[Xdev] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ydev)
loss

tensor(2.4487, grad_fn=<NllLossBackward0>)

emb = C[Xte] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Yte)
loss

tensor(2.4514, grad_fn=<NllLossBackward0>)

Those were better than the previous result! We did great, man.

Sampling from the model

Now we should see our babies, the sampling method is just the same from the previous project, but let's see the results:

# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):

    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      emb = C[torch.tensor([context])] # (1,block_size,d)
      h = torch.tanh(emb.view(1, -1) @ W1 + b1)
      logits = h @ W2 + b2
      probs = F.softmax(logits, dim=1)
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      context = context[1:] + [ix]
      out.append(ix)
      if ix == 0:
        break

    print(''.join(itos[i] for i in out))

carpah.
quelle.
khi.
mila.
tety.
salaysa.
jazhnen.
amerahtia.
qui.
nellana.
chaiiv.
kaneel.
hham.
pein.
quinn.
sron.
taivanbi.
watell.
dearisi.
fine.

It's way better! Eventhough some names are nonsense, but a lot of them starts to sound name-like, and that is a huge improvement for our model. We're moving forward.

Summarize

We've come so far in this journey: We learned numerous things about the MLP, we manipulate data using Tensors, then we fine-tune our model by a wide range of methods, including feeding data in mini batches, tweaking the learning rate, splitting the data, scale the model and increase the embedding size. It's a huge amount of knowledge!

Thanks for reading, see you in the next chapter!

LANGUAGE MODELS USING MLP (Part 1)

Hưng Lê Tiến — Mon, 17 Nov 2025 08:31:48 +0000

Welcome to the third part of the series! Just to remind, this blog follows the series from Andrej Karpathy on Youtube, and I'm just taking notes from his videos. Today we will explore deeper about the neural nets, and have great improvements regarding our language model.

We will have two chapters for this topic, mainly because the contents are way too long and there are so much things to mention in the Andrej's video. I will try my best to provide a comprehensive view of this topic, as well as explain clearly to you guys every single concept involved.

Here's the link to the video: Building makemore Part 2: MLP

MLP (Multi-layer Perceptron)

Remember the architecture you saw in the previous chapter? That was a multi-layer perceptron. Actually this term is no fancy if you're already familiar with neural nets.

But in the last chapter, we built just one layer of perceptron, that explains why the model is so simple and provides some dissapointing results. In this blog, we will build a multilayer perceptron (actually we will just build 2 layers), and we shall see how powerful the model becomes when having more hidden layers.

A Revolutionizing Approach

Before we jump right in the coding things and blindly adjusting the parameters to find the best result, I think we really need to take a step back and address the main problems of our past approach and why the model is so bad at generating names even when the loss is fairly optimized. So take a sip, and we will go through some important insights for our models:

Why Don't We Scale the Model?

This is the most naive question that we could come up with when we want to make a model more powerful. Maybe we would love to have a 4-gram or 5-gram language model rather than just a bigram, so that the model can take into account a lot of preceding characters hence arriving at better results.

But imagine how the table of counts would look like? Actually it will scale exponentially, with the base of 27. In the case of your lovely 4-gram or 5-gram, that would be 27^4 and 27^5, which is quite intimidating to deal with.

So scaling is not really good, but are there any clever approaches? Well yes, numerous research had been carried out, and the most ground-breaking one, I would say, is the one from Bengio and his friends.

A Brief Overview of Bengio's Paper

Reading paper can be boring, and intimidating at times. But those are just some kinds of approach to solve some problems that get on the nerves of scientists, so we don't need to really know the maths or some experiments to understand a paper, actually, when reading a paper, we should focus more on What problems are being addressed? and What is the intuition behind their solution for those problems. Of course, it would be great if we can further understand the implementation and the proves through long texts of formulas and theorems. For now, let's talk about just the intuition.

The paper first pointed out that there are two main problems with the current language model, the n-gram, and those problems are, to your surprised, mentioned in our previous chapters when we build the model:

The model does not take into account context farther than 1 or 2 words. Well, we talked about that.
The model does not take into account the similarity between words.

Hold up. You may question that the similarity, or the synonyms are not really relevant to our task of predicting the next word. But actually it is of paramount importance for our main task, specifically in improving the model's ability to generalize .

Imagine when the model encounter a sequence "A dog was running in the ..." and it has no previous data about the whole sequence in the training set, what will it do? Suppose that it has a training instance which is "The cat is walking in the bedroom", and it is trained a huge text corpus and has the ability to know that "cat" is similar to "dog", and "walking" is just like "running", "A" and "the" are virtually the same,... With that, maybe the model will come up with something like "room" or "living room", which share the similarities with "bedroom" from the training set. And that was a great prediction! The model performs extremely well even though it encounters data that it has never seen before. In other words, we say that the model has generalized well with the test set, just by using the similarities between words.

So how can we create a model which can perform that magical task? Now, we shall see the most ground-breaking part of the paper.

Feature vectors - Embeddings

We starts by associating each words in the sequence with a vector, usually in a lower dimension than the vocabulary size (that's actually a clever way to combat the Curse of Dimensionality), and we will work with the vector from now on.

In our previous chapter, we did this kind of conversion by the one-hot encoding, which turns the characters into a one-hot vector to feed in the neural net. But this one-hot vector captures nothing but the position of the word in the alphabet, which is of no avail to our task of predicting.

So now imagine their approach as a smart way to encode our words. They convert words to vectors just like us, but first they store the words in a lower-dimensional space, and second, they arrange the vectors so that they capture the similarities. What I mean here when saying "capture the similarity" is that in the vector space, the vectors represent words with similar meanings, or share the same semantic will end up closer to each other, and that is the main point for all of this. We will have huge clusters of synonyms, not only that but also some cool tricks playing with the meaning of words.

This "feature vector" trick is still applied in our modern world, but it have been improved significantly, now it is prevalent with the name of Embeddings, so we will call this method as embeddings from now. Embeddings itself involves some complicated mechanisms, but from the very root, it is just a modern version of the feature vectors.

Explore Bengio's model's architecture

First, in the Input layer, we still have the inputs as the index for the words, (or in our case, the characters), and this is just the lookup table we created with the mapping from a-z to 1-26. So no big-brain stuff here.

In the next layer, interesting things happen, I will call this the Embedding layer. This is different from our previous model, and it's the key point that brings about the significant improvements later on. We have a matrix C for our embedding process, and the parameters in the matrix is tweaked and tuned during the learning process, as well as it is shared across all words. Later on, in our model, we will explore some interesting patterns that our neural net learned during the training phase by looking at this matrix.

A notable thing to mention here is the embedding size, we will decide what dimension would we "squish" our words into. In the paper, they embedded a vocabulary of total 17,000 words into just 30-dimensional vector space. This is called Dimensionality Reduction, and there are numerous other methods to implement this. But you should also note that when we transform a high-dimensional vector space to a lower-dimensional ones, there would definitely be some information loss. So there's a tradeoff and we should choose carefully.

The next layer is the Hidden layer of our net. This time we can choose the size of the layer by any number we want. Note that in the layer they used an activation function called tanh, a classic one, we will talk about the whole family of the activation function perhaps in the next blog.

The last layer, is the Output layer, and you can see the note "most computation here". It is an expensive layer, as we have to calculate the probability distribution for every word in the vocabulary, that contributes to the total of 17,000 logits, and then we apply the Softmax for all of those to get the final result.

So that is the barebone of their models and we will start to rebuilt it in our small project. Let's begin!

Load the data & Libraries

Moving on to the third chapter, we might need some improvement in the structure of our project. For convenient purpose, we should import all of the libraries in the very beginning, and the load the dataset

Importing libraries

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt # for making figures
%matplotlib inline

Loading the datset

# download the names.txt file from github
!wget https://raw.githubusercontent.com/karpathy/makemore/master/names.txt

words = open('names.txt', 'r').read().splitlines()
words[:8]

So now we have our data!

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

Prepare the Data

Indexing

First thing first, we need to reuse the mapping that we created for the characters to index them and feed into our model.

# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(''.join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi['.'] = 0

# Also remember to map backward
itos = {i:s for s,i in stoi.items()}
print(itos)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}

Build the dataset

There is an important note here: We will use a Trigram language model, not the bigram ones, so we're scaling the context window a little bit.

With that in mind, then we will need some modifications when preparing our data. We have a new variable block_size, which is basically the number of characters in the context window. We set it to 3 for our model.

We will also need a technique to capture 3 characters at a time for each iteration rather than just one. Actually it is fairly simple: We first have a window of size 3 filled with 0, and then we slide the window across the training data, storing it in our tensor X and the corresponding labels in the tensor Y.

Let's look at how we can implement it in python, it is just cropping and appending new element in a list:

# build the dataset
block_size = 3 # context length: how many characters do we take to predict the next one?
X, Y = [], []

# We will just deal with 5 names for now
for w in words[:5]:
  context = [0] * block_size
  for ch in w + '.':
    ix = stoi[ch]
    X.append(context)
    Y.append(ix)
    print(''.join(itos[i] for i in context), '--->', itos[ix])
    context = context[1:] + [ix] # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape)

Here's our data:

... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
...
torch.Size([32, 3]) torch.Size([32])

We have 32 windows, each of size 3, that is just enough to move to the next stage.

Embeddings & Data Manipulation with Pytorch

We will take us somewhere off the track in this section, as we will not dive into the projects or fine-tuning, but we will focus on the part of Data Manipulation, which is really interesting.

Pytorch is often the go-to place for Neural Nets and even more complicated models as it offers some really convenient operations and better use of memory when storing data. In our previous project, we just explore a bit of the magic that Pytorch offer, namely the Broadcasting rule and The backward() function. Those are just the surface, there are a whole world of convenience in Pytorch, and some interesting manipulation that I think is worth to learn, given that we will have to do a lot of stuff to our data in our projects.

So let's walkthrough some while we're building our model:

Initializing the embedding vector

The most important thing here is the size of this matrix, we need to determine the dimension that we want to project our data on. In this case, we will choose 2.

# Initilize randomly
C = torch.randn(27,2)

Now we need to have a good understanding of the size of the matrices that we create, and thus understand the underlying process. This matrix C is of the size 27x2, which means that it stores 27 components, each component is a vector that embeds our character. The reason why the number 27 appears is that we have 27 characters in total, and the number 2 denotes the corresponding vector that the character represents.

What if we want to embed a character using this matrix? There are two ways for doing this:

Using the indexing

C[5]

Multiply the C matrix with an one-hot vector of size 27

# One-hot -> Remember to convert to float
F.one_hot(torch.tensor(5), num_classes=27).float() @ C

These both return an 2-dim vector:

tensor([0.5262, 1.0655])

For convenience purposes, we will stick with the former, and actually, Pytorch has some great indexing techniques that offer even more flexibility.

Indexing with Pytorch

We can index using lists, which means that we can pass in a list of integers representing the index we want and it will return the sequence of 2-dimensional embedding vectors.

# Index using List
C[[5,6,7]]

tensor([[ 0.5262,  1.0655],
        [-2.2277, -0.5293],
        [-0.6665, -1.0212]])

The list can contain duplicates:

# Can be a tensor of int, we can repeat
C[torch.tensor([5,6,7,7,7,7,7])]

tensor([[ 0.5262,  1.0655],
        [-2.2277, -0.5293],
        [-0.6665, -1.0212],
        [-0.6665, -1.0212],
        [-0.6665, -1.0212],
        [-0.6665, -1.0212],
        [-0.6665, -1.0212]])

What if we pass in a whole matrix? Pytorch can handle that too! It will iterate through the whole matrix and assign each entries with the corresponding vector. So when indexing, it should return a new matrix with an additional dimension at the end, which is equivalent to the embedding size. Take a moment to think about it, and then look at the implementation to understand what I mean:

# Indexing with a multidimensionl tensor
C[X].shape

torch.Size([32, 3, 2])

Let's take a look at what it is actually doing inside the matrix:

>>>X[:5]
tensor([[ 0,  0,  0],
        [ 0,  0,  5],
        [ 0,  5, 13],
        [ 5, 13, 13],
        [13, 13,  1]])

>>>C[X][:5]
tensor([[[-2.3175, -0.5157],
         [-2.3175, -0.5157],
         [-2.3175, -0.5157]],

        [[-2.3175, -0.5157],
         [-2.3175, -0.5157],
         [ 0.6433,  0.8121]],

        [[-2.3175, -0.5157],
         [ 0.6433,  0.8121],
         [ 1.0138, -0.7526]],

        [[ 0.6433,  0.8121],
         [ 1.0138, -0.7526],
         [ 1.0138, -0.7526]],

        [[ 1.0138, -0.7526],
         [ 1.0138, -0.7526],
         [ 1.5965, -0.8861]]])

So in the X we have a bunch of 3 dimensional array, and in C we have a bunch of 3x2 matrix, it simply takes in an additional dimension to store the embedding of each elements! I also get confused in this part so don't be shy to take a moment to think about it.

And we can get the location of a specific vector by indexing just like we're working with multidimensional array in Python:

# We get the embedding for the 13th window, at the third character
C[X][13][2]

tensor([-0.0371,  0.8457])

Now we should name the embedding of X:

emb = C[X]
emb.shape

Creating the first layer

As you can see from the architecture, a pack of 3 characters are inputted into the model, but we embedded those into 2 dimensional vectors, so that contributes to a total of 3 x 2 = 6 input values.

Okay so our layer plugs in 6 values of input, and what is the output, or more precisely, how many neurons does it have? This is totally up to us, and I will choose 100. And we're ready to go!

You may question why do we need to think about all of this in the first place. Well, it is crucial to think about the matrix that we want to create, and how can we create it, there are lots of insights when we look at the size of the matrix. Moreover, when we perform matrix multiplication, it is important to keep track of the dimensions, as we cannot perform that operation if the dimensions don't match. Also, I want to clarify each step that we're doing, so that we won't get lost by some randomly-popped-up numbers, it's a way to assure that we're on our track.

Initializing the Weights and Bias

In the previous chapter, we dealt with the weights only, but in practice, we need to additional term called Bias, so that we will have the classic formula of "wx + b" that appears in virtually every ML/DL books.

# Creating the first layer

# Number of input : 3x2 because we have 2 dim embedd and 3 chars
# Number of neurons: 100 (totally up to us)
W1 = torch.randn((6,100))
b1 = torch.randn(100)

Fitting the Dimension

So now we need to feed our data into the first layer, right? We will do that by matrix multiplication, but wait? There's something wrong.

We can't do matrix multiplication because the dimension don't fit! Note that the embedding matrix is of the size [32,3,2] while our W1 matrix if of the size [6,100]. Those are not even the matrices of the same type. So what will we do? We would love to have our embedding matrix to have the size [32,6], specifically we want to find a way to "merge" all of the embedding vectors in the 3-char window into one. In other words, the number of the trigram remains, but for each vector in it, we want to convert a sequence of 3 2-d vector, to a sequence of 6 values in order to feed in our first hidden layer. And guess what? Pytorch saves the day again!

There is a function in the Pytorch library called concat and it does the merging stuff that we discussed above. First we need to take the sequence of embedding vectors for each character in the window, then apply the concat function to it. The code goes like this:

# Take the three and concat, in the second dim
torch.cat([emb[:,0,:], emb[:,1,:],emb[:,2,:]], 1)

Let's see the shape of this vector:

torch.Size([32, 6])

Exactly what we want! But hard-coding the sequence of embedding vectors to pass in the function is not really good, and we want to optimize that too. So another function comes in the way which does exactly that task, it is called unbind:

# Unbind take the lists, return tuples of tensors
torch.cat(torch.unbind(emb,1),1)

Just spend a few seconds to contemplate how our library can help us to implement a labor-intensive task in one line of code. That's amazing.

Internals of Tensors

Actually there is even a more convenient way to implement the task above, which doesn't require any fancy functions. We will first introduce to you about the Internals of Tensors, but I would recommend you to read the blog of ezyang for a more comprehensive understanding. This is the blog..

Briefly speaking, the tensor in Pytorch has an interesting way of storing data. Specifically, it stores every values in the matrix in a one-dimensional array, irrespective of the size. So you can imagine that everything got flatten out into just one single long series. We can get access to this storage by the function storage:

emb.storage()

 -0.6637181043624878
 0.31748151779174805
 -0.6637181043624878
 0.31748151779174805
 -0.6637181043624878
 0.31748151779174805
 ...

But what is the point for all of this? It turns out, that there is this specific method called view() that can return to us a matrix of any size, using the data from our matrix. It's a much more efficient way since it's just representing the data differently, it doesn't create new tensors to work with. A note here is that your new matrix can be of any size that you want as long as the dimensions match with the original. Let's implement this black magic:

emb.view(32,6)

tensor([[-1.3302, -1.3333, -1.3302, -1.3333, -1.3302, -1.3333],
        [-1.3302, -1.3333, -1.3302, -1.3333, -0.1033, -1.4972],
        [-1.3302, -1.3333, -0.1033, -1.4972, -0.9485,  0.6885],
        [-0.1033, -1.4972, -0.9485,  0.6885, -0.9485,  0.6885],
        ......

We can also use emb.view(-1,6), it generates the same result and it also eliminates the need to hard-code the number 32 (that number won't be used later when we consider the whole dataset). But remember, we can't have matrices like emb.view(5,6), we need the multiples of the dimension equal with that of the original matrix.

Now we need to name the new matrix, and also get the tanh of the values, just like in the architecture.

# Get the tanh
h = torch.tanh(emb.view(-1,6) @ W1 + b1)

If you are thinking about the dimensions here, you may question why the b1 matrix can be added while it's just a matrix of size 100, while others are two-dimensional. This is actually a valid operation (at least in Pytorch), and it involves the thing that we already know: Broadcasting. The b1 vector is later broadcasted into a row vector, so the value in each row are added to the corresponding row in the previous matrix, and its exactly what we want!

And that is everything about the magic of the Pytorch library for manipulating data. I hope you will get a sense of how amazing this library is, and you can try some operations yourself, they are all at their webpage here.

Create the Final Layer

Think about the dimension again, what should the W2 vector be like? First, it should output the probability for 27 characters, so there should be a 27 in the size. Moreover, it has to match with the previous matrix, which is of the size 6x100. Hence, we have the size of the final weight matrix: 100x27, the bias is, surely, an array of 27 elements.

Let's implement that in our code:

W2 = torch.randn((100,27))
b2 = torch.randn(27)

And now, we're ready to calculate our probability!

logits = h @ W2 + b2

# Applying softmax
counts = logits.exp()
prob = counts / counts.sum(1,keepdims = True)

And finally, the negative log likelihood:

loss = -prob[torch.arange(32),Y].log().mean()

The result is, maybe some drum roll for this moment:

tensor(14.3920)

That is TERRIFYING. But we haven't done any optimization yet, zero zip nah. And that, my friend, is gonna be the story of the next part. In this part, we've already equipped ourselves with a whole bunch of knowledge, from the revolutionizing approach of Bengio, generating a trigram dataset, to great data manipulation techniques in Pytorch. Those are really enough for today, and congrats yourself for reaching this far.

Thanks for reading, and see you in the next blog!

BIGRAM LANGUAGE MODELS USING A NEURAL NET

Hưng Lê Tiến — Fri, 14 Nov 2025 14:13:12 +0000

Welcome to the second chapter in the series, today I will present a different approach to our task of predicting names, based on the video from Andrej Karpathy. In this blog, we will do more than the mere task of counting and normalizing the sum to get the probability, which is not really machine-learning-alike, we will take a step further, a big step, to jump right into Neural Network.

A Brief Introduction for Neural Nets

I will give a short overview of this wonderful architecture, just some basics to make you feel well when reading to this blog. Further details will be discussed in later blog, like the MLP, or the CNN, LSTM things like that. So let's begin, shall we?

Neural network is a kind of architecture that some brilliant people created in order to replicate the neural system of human. To be more precise, we're talking about the Artificial Neural Network (ANN), and it consists of a bunch of nodes connecting with each other as you can see here:

You just need to remember that:

Each node in the neural net is called a neuron, and from the image we can see that those are nicely separated into parts that are the layers. The layers in the neural net are often the Input layer, the Output layer, and some Hidden layers in the middle (but in this blog today we will just need to construct one hidden layer, so it's not really hidden)
"What are those neurons doing and why do we need to pass information through layers of a bunch of neurons?" might be the question in your mind. Well this is hard to tell, I would say that the neurons take the inputs and apply a function to it, like wx + b, so we have a bunch of functions (sometimes those can be non-linear, we will talk about that later) and the values of w,b that we assign for each function is what we would call the parameters. A complex neural net has thousands of parameters, and those are changeable. This is perhaps the most important thing, we can change the parameters in the neural net.
Our goal is to tweak and tune the parameters in order to get our desired result, and we use a technique called backpropagation to do so. Backpropagation is just a method we use when we want to minimize the loss (the difference) between our prediction and the actual result, where we work backward from the prediction all the way to the input, probably tweaking the parameters along the way. But to get the prediction, first we need to plug in the input and pass it through layers of neurons, that's called a forward pass.
But how can we really "get the desired result" from a neural net? What do we expect from the output layer? It really depends on the task, but normally, in the classification problem or in this specific makemore project, the output layer would produce a kind of probability distribution for the possible values. We will discuss about that later when we delve into the code.

Okay if you understand all of the things above, then you are ready for the next part! (actually I think my explanation is not that good)

Building Makemore with a Neural Net

When dealing with neural nets, things get complicated. We just can't "tell" the net to do the things that we want, I mean there is a huge language barrier, right? So we need to come up with a task, a really specific task to plug to the machine and when the machine does the task, it generates the result we want.

Normally the task in ML or DL involves minimizing a function, okay so we need a function to minimize. So where? Remember the negative log-likelihood that we talked about in the previous chapter? There it is! That also explains why we use the negative log-likelihood, not the log-likelihood itself, because our objective is to maximize the likelihood, which in turn equals to minimize the negative log-likelihood.

During the previous chapter, we solved the problem in a rather explicit way, which is, we are just using mere counts or explicit data in order to get the result. In this chapter and numerous chapters later on, we will solve the task in an implicit way, which is like using a secret mechanism, a kind of black magic, to arrive at the final answer.

We need to restructure our data a bit, as we're putting it into a neural network. Now, remember that this is still a bigram language model, we still have one character as an input and we output the probability of the next character. We also have to signal the cases where the model is correct, which means they assign a high probability to the correct character, so those correct characters should be our labels, and our goal is to maximize the probability for the labels.

Create the training set

Now we shall begin:

# Create the training set
# Inputs and the labels
xs,ys = [],[]

for w in words[:1]:
  chs = ['.'] + list(w) + ['.']
  for ch1,ch2 in zip(chs,chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    xs.append(ix1)
    ys.append(ix2)

# Create tensors
xs = torch.tensor(xs)
ys = torch.tensor(ys)

So we have our inputs and labels, the characters are encoded in the form of integers:

tensor([ 0,  5, 13, 13,  1])
tensor([ 5, 13, 13,  1,  0])

One-hot encoding

But is this good data to feed in our neural net? The answer is NO. Briefly speaking, our boy doesn't like strings or integers, it prefers vectors. And if we normalize the vector then it is even better!

Just like when we have to turn text into integers to feed in our bigram language model, in this case, we need to find a way to encode integers into some sorts of vectors to feed in the neural net. A convenient way to do that is One-hot encoding.

One-hot encoding is simple, it makes an array full of zeros and turn the i-th element to 1 for every integer with the value of i. You can take a look at it in the code below

import torch.nn.functional as F

xenc = F.one_hot(xs, num_classes = 27)
xenc

tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0]])

We also need to cast the entries of our vector to the floating point number, which is convenient for computing inside a neural net.

xenc = F.one_hot(xs, num_classes = 27).float()

So the ingredients are fully prepared, let's cook!

Making the net

The parameters can be stored in a huge matrix and the input vectors are parts of a huge input matrix too. For convenience when computing, we are putting things in matrices, and if you scrutinize the behavior of the neural net, or even MLP or Attention, you can see that the operations are just a bunch of matrix multiplication.

Let's create our own matrices of the parameter w, actually this man has a name, it is the weight.

Initialize the weights

There are numerous ways initialize the weights, and even some strategic ones. But oftentimes we draw out the weight randomly from a normal distribution.

# Initialize the weights
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27,27),generator = g)
# That randn function draw from a normal distribution

Here's a piece of our weight matrix:

tensor([[ 1.5674, -0.2373, -0.0274, -1.1008,  0.2859, -0.0296, -1.5471,  0.6049,
          0.0791,  0.9046, -0.4713,  0.7868, -0.3284, -0.4330,  1.3729,  2.9334,
          1.5618, -1.6261,  0.6772, -0.8404,  0.9849, -0.1484, -1.4795,  0.4483,
         -0.0707,  2.4968,  2.4448],
        [-0.6701, -1.2199,  0.3031, -1.0725,  0.7276,  0.0511,  1.3095, -0.8022,
         -0.8504, -1.8068,  1.2523, -1.2256,  1.2165, -0.9648, -0.2321, -0.3476,
          0.3324, -1.3263,  1.1224,  0.5964,  0.4585,  0.0540, -1.7400,  0.1156,
          0.8032,  0.5411, -1.1646],

And then we will have our inputs modified by the weights, using matrix multiplication

(xenc @ W)

tensor([[-0.6701, -1.2199,  0.3031, -1.0725,  0.7276,  0.0511,  1.3095, -0.8022,
         -0.8504, -1.8068,  1.2523, -1.2256,  1.2165, -0.9648, -0.2321, -0.3476,
          0.3324, -1.3263,  1.1224,  0.5964,  0.4585,  0.0540, -1.7400,  0.1156,
          0.8032,  0.5411, -1.1646]])

Adding the non-linearity - Softmax

We won't use just that for our neural net's layers ! The problem here is that: when you keep multiply the input by a weight w, and you do that consecutively, then at the end of the day, after hundreds of multiplication, you still get a mere multiple of x. Applying the linear function several times still makes things linear, which makes our model rather simple and unable to capture any complicated patterns.

Hence, we should add some non-linearity to the process. We will use a function called Softmax in this blog.

So what is Softmax? It is created to generate the probability distribution that we want. Remember that when we do matrix multiplication, the output of the neural net can be virtually anything, while we just want it to output some positive integers that sum up to 1. Softmax does the trick, it takes the exponential of the output and normalize it by the sum of all exponentials. There are some other interesting details about this function, but I will do that in another blog, let see the formula of the function:

Now we create Softmax on our own:

logits =  (xenc @ W) # log-counts
counts = logits.exp() 
probs = counts / counts.sum(1,keepdims = True)
probs

And there it is, nice-looking positive numbers that sum up to 1:

tensor([[0.0607, 0.0100, 0.0123, 0.0042, 0.0169, 0.0126, 0.0027, 0.0232, 0.0137,
         0.0313, 0.0079, 0.0278, 0.0091, 0.0082, 0.0500, 0.2372, 0.0604, 0.0025,
         0.0250, 0.0055, 0.0339, 0.0109, 0.0029, 0.0198, 0.0118, 0.1535, 0.1458],
        [0.0290, 0.0796, 0.0248, 0.0521, 0.1983, 0.0289, 0.0094, 0.0335, 0.0097,
         0.0301, 0.0701, 0.0229, 0.0115, 0.0184, 0.0108, 0.0315, 0.0291, 0.0045,
         0.0915, 0.0215, 0.0486, 0.0300, 0.0501, 0.0027, 0.0118, 0.0022, 0.0472],

Tweak and Tune the Parameters

Gradient

Before we dive into the main part of our project today, we need to take a step back to think about a very important concept of Machine Learning and Deep Learning in general: The Gradient or Gradient Descent.

You may heard about the Gradient in your Calculus class, for a multivariable function, the gradient shows vector that points to the direction of the steepest ascent. Well, to put it simply, imagine you are on the hill, that Gradient friend is the guy who always points you to the direction that is the steepest and tells you go climb up. When doing ML stuff, we prefer to climb down, because we're minimizing not maximizing, that that's were Gradient Descent comes in.

Gradient descent is an optimization algorithm that uses the gradient to iteratively find the minimum of a function by taking steps in the direction opposite to the gradient.

This is the formula for Gradient Descent:

To be honest, this is just an over-simplified view of the matter, Gradient Descent is much more complicated and has numerous variants. But we understand the main idea: It's a way to get the loss down as much as possible.

Backpropagation

We might need a whole blog to talk about the beauty of this algorithm, this is the thing that revolutionized the world of neural nets. This lies in the core of almost every models you see in the world today.

What is it? When calculating the the gradient to minimize the loss, actually we cannot calculate it directly. The forward pass, or the computing process to get to the prediction, involves numerous consecutive computations that is really difficult for us to connect to one main function to get the gradient.

So a brilliant way to solve that problem was proposed: we adjust the parameters step by step, starting from the neurons at the prediction, and pass the gradient down by each neurons consecutively until we reach the very first one. This is conveniently achieved by the Chain rule from Calculus, and given the fact that the whole neural net is built from easy-to-differentiate functions (linear, or the sigmoid,..), we can work our way backward!

Another thing that contributes to the success of this algorithm is the autodiff, which stands for automatic differentiation, actually Andrej also has a video about this and maybe I will do it later in another blog.

You can watch the video from 3blue1brown, I really admire him!
Backpropagation (Intuitive)
Backpropagation (Calculus)

Implementation on the code

Theory's enough, let's turn back to our main project. Actually, there are not so much things that we can do, because the library already includes some really convenient functions for us. Well, at least we understand the underlying process.

We start by backpropagate just one time. Remember our forward pass?

  xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean()

Now we just need to perform a backward pass:

W.grad = None # set to zero the gradient
loss.backward() # magical things happen when applying this function

Remember to set the requires_grad = True when you initialize the W, then when we use the backward(), it applies backpropagation across the whole training set. We can have a look at our W.grad:

tensor([[ 0.0121,  0.0020,  0.0025,  0.0008,  0.0034, -0.1975,  0.0005,  0.0046,
          0.0027,  0.0063,  0.0016,  0.0056,  0.0018,  0.0016,  0.0100,  0.0476,
          0.0121,  0.0005,  0.0050,  0.0011,  0.0068,  0.0022,  0.0006,  0.0040,
          0.0024,  0.0307,  0.0292],
        [-0.1970,  0.0017,  0.0079,  0.0020,  0.0121,  0.0062,  0.0217,  0.0026,
          0.0025,  0.0010,  0.0205,  0.0017,  0.0198,  0.0022,  0.0046,  0.0041,
          0.0082,  0.0016,  0.0180,  0.0106,  0.0093,  0.0062,  0.0010,  0.0066,
          0.0131,  0.0101,  0.0018],

As you can see, it is filled with some non-zero values, which means we have successfully perform backpropagation for this training set.

Now we need to update our W matrix by subtracting the gradient, multiplied by the learning-rate (imagine it like the length of the step we want to make)

W.data += -50 * W.grad

And we repeat the whole process until we satisfy with the result. Note that there is this thing called epochs, which is simply the number of iterations we want to make.

# gradient descent
epochs = 100
for k in range(epochs):

  # forward pass
  xenc = F.one_hot(xs, num_classes=27).float() # input to the network: one-hot encoding
  logits = xenc @ W # predict log-counts
  counts = logits.exp() # counts, equivalent to N
  probs = counts / counts.sum(1, keepdims=True) # probabilities for next character
  loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
  print(loss.item())

  # backward pass
  W.grad = None # set to zero the gradient
  loss.backward()

  # update
  W.data += -50 * W.grad

And that's basically it! We should see how our model performs. Here is some of its last iterations:

2.4908623695373535
2.4906723499298096
2.4904870986938477
2.4903063774108887
2.4901304244995117
2.489959478378296

Wonderful! Actually it's even better than counting! So we're good now.

Sampling

And for the last, here is how we sample names from this model, it's just the same stuff we do to sample from the previous bigram model:

# finally, sample from the 'neural net' model
g = torch.Generator().manual_seed(2147483647)

for i in range(5):

  out = []
  ix = 0
  while True:

    # ----------
    # BEFORE:
    #p = P[ix]
    # ----------
    # NOW:
    xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
    logits = xenc @ W # predict log-counts
    counts = logits.exp() # counts, equivalent to N
    p = counts / counts.sum(1, keepdims=True) # probabilities for next character
    # ----------

    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    out.append(itos[ix])
    if ix == 0:
      break
  print(''.join(out))

Let's look at our results

cexze.
momasurailezityha.
konimittain.
llayn.
ka.

It's the same with the bigram ! Well, another dissapointing result. This is just a one-layer neural network, so may be it does it best for now.

Key notes

Bigram versus Neural Net

The sample from both models are the same, means that even though they go different path, they still share huge similarities. In the neural net, the loss seems to be lower, which indicates its better performance.

Even though neural nets can be hard to understand, but it really outperforms its counterpart (remember that this is just a plain vanilla one-layer neural net, imagine the big bro MLP and Transformers come in), and it is also flexible, we can apply it in numerous tasks.

Some insights from the W matrix

So far we've treated this matrix as nothing but a complicated black box that is tweaked and tuned all the time to produce some satisfying results. But there is an intuitive way to understand these parameters, which can help us gain insights into what the models are actually doing.

The parameter w sometimes represents for the importance or the influence of a neuron to one another. As the next neuron is just a linear combination of the previous, we will have the function of numerous variables multiplied by it's corresponding weight. So when we weight is positive, this means that the neuron is gonna greatly influence the next, and vice versa.

When we want the model to produce, for example, the '.' that is the end of word, we may not want it to be greatly influenced by, let's say, the letter 'j', because there is just a few names that end with 'j', and we want it to be more influenced by 'n' or 'm' because those are the common ending characters. So what will we do ? We just need to lower the weight for j and increase the weights for n and m. And that's what our model is trying to do under the hood.

Regularization

If our model is confident about the prediction, that's exactly what we want, but is it good when the model is over-confident? Not really, sometimes we allow for a little bit of uncertainty in order to make the model perform better in real life. That is to prevent overfitting and we call that by a fancy name: Regularization

Allowing for some uncertainty in the decision-making process is equivalent to smooth out the probability distribution. And, to your surprise, it's actually similar to the Smoothing technique we discussed in the previous chapter. This time we will do it in a slightly different way.

Notice what when the W matrix is full of zeros, then the probability distribution would be uniform, so we will try to drive the values in that matrix down so we can get a more uniform distribution. This is done in the loss function, where now we have two objectives, and the technique is called Weight decay Regularization:

loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()

And when you run the code again, you can see that the loss is a bit higher than our loss without regularization. It's a small tradeoff we made to ensure that the model would perform well in real-world setting. (You can search about Bias/Variance tradeoff)

Regularization is actually an interesting topic that needs a blog post for itself (well just after 2 blogs my to-do list stacked considerably). There are numerous regularization techniques that possess their own beauty. You can search if you are curious.

Summarize

We've come so far in our journey! We learned tons of things about the neural net, about Softmax function, about Gradient Descent and Backpropagation, and Regularization also. There is just a lot of things packed in this tiny session about a small, simple language model. We broaden a whole new horizon of knowledge just by diving deep enough, and we have some fun playing with the code along the way. Life is good, my friend!

In the next chapter we will discuss about MLP (multi-layer perceptron), following Andrej series, and I think it would be a lot of fun!

Stay tuned, and thanks for reading. Having a good day!

BUILDING A BIGRAM LANGUAGE MODEL

Hưng Lê Tiến — Thu, 13 Nov 2025 11:48:20 +0000

This is deeply inspired by Andrej Karpathy's series about makemore, actually this post is just the notes of Andrej's first video, together with some insights that I came up with.

An Introduction To N-gram Language Model

N-gram language model is perhaps the simplest form of language model that any human could think of. Imagine the predictive text that pops up above your keyboard for each word that you are typing, which is just some plain predictions given the previous word or phrases. N-gram language model is just it!

N-gram language model considers your n-1 previous words and make prediction about the next word in the sequence.

Some N-gram language models that we should know:

Unigram language models: 1-gram, consider no previous words to make prediction, honestly this is a funny language model.
Bigram language models: consider just the previous word, no more no less, this is the model that we will examine today.
Trigram language models: look back the 2 previous words.

In the world of powerful and ever-developing LLMs, N-gram language models are just children's toys. As you can immediately notice in the way it works, and later when we scrutinize the making of it, it does not really take into account the context of the sequence. Sometimes it doesn't even know the meaning of each words, which is sad given the complex nature of language.

So why should we stick with this overly-simplified language model? Well, everyone starts somewhere, mate. This should be a great start for learning about language models, and later we will gradually dive deeper, step by step, so feel free to skip this blog if you know too well.

How It Works (The Intuition Behind)

Just by counting, no more no less.

It is basically the main idea! We just need to have a huge text corpus to train on, and we count the relative frequency of each combination of words. Next, we assign a kind of probability for upcoming words given the frequency we knew earlier, and our model is ready to go!

If you use the training set including conversations of two people loving each other, for example, when the model sees "I" it will confidently think that the next word would be "love", and when there is "love" then the next should probably be "you". The model just works by pure memorization and a bit of probability.

But don't criticize it too much, we can still get some insights from the model, like some syntactics nature of language, e.g. Nouns and Adjectives often come after Verbs, or Verbs often come after Nouns.

So let's make things complicated by doing some maths.

How It Works (The Math Behind)

Considering the task of predicting the next word of a sequence "Your eyes are..."
Well, maybe your eyes are beautiful (are they?), but in a probabilistic sense, we should ask:

What is the probability that the word "beautiful" appears after the sequence of "Your eyes are"?

Mathematicians have a nice way to ask this question, using conditional probability:

P(\text{beautiful}|\text{Your eyes are})

Another way to interpret this is that: In our text corpus, among all of the cases that "Your eyes are" appears, how many times that the word "beautiful" is put after it?

P(\text{beautiful}|\text{Your eyes are}) = \frac{C(\text{Your eyes are beautiful})}{C(\text{Your eyes are})}

But this won't work well!, you may think. Well, you are right. We cannot count like that because we won't get a good estimate for medium-sized corpus. Another reason is that language is creative, which means that new things are invented all the time. So all in all, the approach should be more flexible.

A more clever way is calculating the probability in a recursive way, and get the product of them using the Chain rule of Probability:

P(w_{1:n}) = P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n-1}) = \prod\limits_{k=1}^n P(w_k|w_{1:k-1})

Well, actually it doesn't solve the problem, and it even makes the problem worse as we need to keep track of a bunch of long strings before we come up with the final ones. And this is where our friend Markov comes in.

The Markov Assumption

Rather than considering the whole history of our sequence, we can approximate it by just looking at the last few words.

Markov didn't say that, but that was his main idea. In our example, rather than taking the whole sentence "Your eyes are so" into account, we just need to look at the word "so" to predict (of course we're using the bigram model, if it is trigram then we need to consider 2 preceding words).

To wrap up, we made this assumption (for the bigram):

P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-1})

And with that in mind, we can easily calculate the probability that we want, just by counting:

(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{\sum_w C(w_{n-1}w_n)} = \frac{C(w_{n-1}w_n)}{C(w_{n-1})}

In other words, we calculate the ratio of the frequency of the combination we think and the frequency of the preceding word, and that's what we called relative frequency.

It's getting boring talking about these theoretical concepts, let's jump right in and play around with the code.

Making Makemore - A Bigram Language Model

You should check this video out, this is the main inspiration for this post, and shout out to Andrej Karpathy !
The spelled-out intro to language modeling: building makemore

To start, makemore is a simple language model that generate new versions of any data that is feed into it, in this case we are dealing with names, so you can think that makemore is a kind of name-generating language model.

Makemore works just like a kind of bigram LM, but instead of looking at the previous word (it can't), it looks at the previous character and predicting the next one.

We first load and explore our dataset

words = open('names.txt','r').read().splitlines()
words[:10]

So we have this, which is some American names

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

Now we create our own bigram

# Creating the bigram, just a kind of data preprocessing

# Make a dictionary to store all the pairs
b = {}

# We have the zip function for convenience pairing, and we should
# signal the start/end of words by some additional characters
for w in words:
  chs = ['<S>'] + list(w) + ['<E>']
  for ch1,ch2 in zip(chs,chs[1:]):
    bigram = (ch1,ch2)
    b[bigram] = b.get(bigram,0) + 1

Let's look at our bigram, shall we?

# We sort the pairs by their frequency
sorted(b.items(), key = lambda kv : -kv[1])[:5]

# Here's the result
[(('n', '<E>'), 6763),
 (('a', '<E>'), 6640),
 (('a', 'n'), 5438),
 (('<S>', 'a'), 4410),
 (('e', '<E>'), 3983)]

Tensor

For better data manipulation, we use Tensor from Pytorch, which is a special data structure that does great in ML/DL projects.

First we need to create a Tensor of our own:

import torch
# Create the tensor for our project

# We have 26 letters in the alphabet and 2 charr <S> and <E> -> 28
N = torch.zeros((28,28), dtype = torch.int32)

# This should be a 27x27 dimensional array storing 32-bit integers, normally the array will store the floating points number so we need to specify the dtype when calling the function

Indexing characters - from strings to integers

Computers talk by numbers, not words, so we should find a way to "encode" our characters into integers. The simplest way is perhaps indexing each character by its order in the alphabet, and each pair is just a tuple of 2 integers.

Here's how it goes:

# Creating a lookup table from characters to integers
chars = sorted(list(set(''.join(words))))

# Map from char to int, indexing
stoi = {s:i for i,s in enumerate(chars)}

# Remember the additional characters
stoi['<S>'] = 26
stoi['<E>'] = 27

# Take out our previous code
for w in words:
  chs = ['<S>'] + list(w) + ['<S>']
  for ch1,ch2 in zip(chs,chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    N[ix1,ix2] += 1

# Reverse mapping

itos = {i:s for s,i in stoi.items()}

It's basically done! We can use matplotlib to contemplate our beautifully crafted bigram.

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize = (16,16))
plt.imshow(N,cmap = 'Blues')
for i in range(28):
  for j in range(28):
      chstr = itos[i] + itos[j]
      plt.text(j,i,chstr, ha = "center", va = "bottom", color = 'gray')
      plt.text(j,i,N[i,j].item(), ha = "center", va = "top", color = 'gray')
plt.axis('off')

And there you have it!

A small modification

As you can see from our bigram, the last row is all zeros, and so is the columns near the end. What happened?

Notice when we use <S> and <E> for denoting the start and end of words, there will never be cases where the <S> starts at the second position, or the <E> follows behind a character. And that is exactly why we have a row and a column of zeros.

To solve this, according to our friend Andrej in the video, we should replace these two by just one dot. Just a . for both ending and starting position.

So we should make some changes to our code

# Now the tensor is just 27x27
N = torch.zeros((27,27), dtype = torch.int32)

# We also want to set the dot to 0 (personal preference)
# And with that we have to adjust our dictionary
stoi['.'] = 0
stoi = {s:i+1 for i,s in enumerate(chars)}

# Now change the chs
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1,ch2 in zip(chs,chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    N[ix1,ix2] += 1

# And adjust the range of i,j, then we're done

plt.figure(figsize = (16,16))
plt.imshow(N,cmap = 'Blues')
for i in range(27):
  for j in range(27):
      chstr = itos[i] + itos[j]
      plt.text(j,i,chstr, ha = "center", va = "bottom", color = 'gray')
      plt.text(j,i,N[i,j].item(), ha = "center", va = "top", color = 'gray')
plt.axis('off')

So now we have our final result:

Calculating the probability

Now the serious things kick in, we need to calculate the relative frequency for each pairs of characters that we have. The most convenience way is creating a matrix of probability P that is of the same size as N, where each entry represents the probability of the corresponding pair stored in N.

So how can we calculate the probability, really? Remember the formula where we divide the frequency of the pair with the frequency of the preceding word? To your surprise, it is actually the count of the pair divided by the sum of its row in the frequency matrix (N). Take a moment to think about it.


# Create the P matrix
P = N.float()

# Take the sum for each row
# Notice that it is really convenient when using tensor

P.sum(1,keepdim = True)

# And then we normalize every entry
# Notice, again, how clean the code is

P = P / P.sum(1,keepdim = True)

Just a small note here, why should we set keepdim = True ? This can take a whole blog post to explain (and probably I will do it if I'm not lazy). This has to do with something called Broadcasting, which is a really convenient data manipulation and it is used all over the place in data processing. You can check the video from Andrej if you're curious, he talked really deeply about this. For now let us just skip this small detail and move on to the next part.

Sampling from our model

We've created our probabilistic model! May be it's not clear to you, but we've done all the essential part. Now you might ask "How can we get new names from this model?". You need to take a character, possibly a random one, and work recursively to predict the next. What I mean when saying "work recursively to predict" is that: when you have a character, you have to look at the row where it is the preceding character and sample from that distribution to get the next char, and when we get the next word we will do that again to get the upcoming char until we encounter the dot, which is the end of word.

All of that is neatly nested in our code below:

# We need to use a Generator in Pytorch to get the samples
g = torch.Generator().manual_seed(2147483647)

# Get the first 50 names
for i in range(50):
  out = []
  ix = 0
  while True:
   # Get the distribution of the word and sample from it
    p = P[ix]
    ix = torch.multinomial(p,num_samples = 1, replacement = True, generator = g).item()
    out.append(itos[ix])
   # End the loop when we see a dot
    if ix == 0:
      break
  print(''.join(out))

If you use the same seed in the Generator with me, you will have this sequence of names

cexze.
momasurailezitynn.
konimittain.
llayn.
ka.
da.
staiyaubrtthrigotai.
...

That is dissapointing ! But there are some names that are acceptable though, and we cannot expect too much from this simple model.

Evaluation

We need to use a kind of metric to evaluate this model, and don't even think about MSE or MAE or RMSE, those are not even for probability, let alone languages.

When dealing with probability, there are some evaluation metrics that should come to your mind. One of which is the Maximum Likelihood Estimation.

Maximum Likelihood Estimation (MLE)

This is also a topic that should have a blog post for itself, given the wide application of it in ML/DL and even beyond that scope. To be honest, I don't know too well about this topic, so I will just scratch the surface in this part.

The likelihood of the data is a kind of a function, it is different from probability, and in some specific tasks, we might need to maximize this function to get the optimized loss. Considering the likelihood function, you just need to know that it is the product of a bunch of probability.

But there is a problem with scaling if you notice: Probability always ranges from 0 to 1, so when we scale the data, which means that we multiply numerous values of probability together, we would get a extremely small result, which is bad. That is called numerical underflow.

To solve the problem, we just need to use the log of the likelihood instead. And we also need to calculate the negative log likelihood, well it is not really useful in today's task, but it will be of immense importance in the next blog (when we use a neural network). And that's it!

# Calculate the log_likelihood
log_likelihood = 0.0
n = 0

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    prob = P[ix1, ix2]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')

So we have our log, and we should normalize it by the number of words.

log_likelihood=tensor(-559891.7500)
nll=tensor(559891.7500)
2.454094171524048

Not really bad ! This is the best thing we can do in this char-to-char predicting task, perhaps we expect too much from it.

Perplexity

Another metric we would use is Perplexity (PP or PPL), which shares the idea with the definition of "Entropy".

We think of this metric as the level of surprise the model gets when generating the new word. If the model does well and assigns a high probability to the word, then it would be less surprised when the word came, and vice versa. So a good model is the one who is least surprised by the test set (or we can say it is less perplexed, which I think is the main idea behind the term Perplexity).

So how can we measure surprise? Note that when assigning a high probability, the model will be less surprised, and vice versa. So we can think of an inverse relationship, and that’s exactly scientists do! They take the inverse of the probability, together with a kind of normalization by the length.

Here's the formula:

\text{perplexity}(W) = {P(w_1w_2...w_n)}^{\frac{-1}{N}} = \sqrt[N]{\frac{1}{P(w_1w_2...w_n)}}

Now as we've come this far, I shall tell you this fact:

The Perplexity is actually the exponential of the negative log-likelihood. Those are the same.

Till we meet again.

Smoothing

Our code is not really perfect yet! What if we need to calculate the log-likelihood of a sequence that is never seen before? Let's see:

# SMOOTHING

# Check this word and calculate the loss
for w in ['bidjz']:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    prob = P[ix1, ix2]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1
    print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')
print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')

Let's look at the result

b: 0.0408 -3.1998
bi: 0.0820 -2.5005
id: 0.0249 -3.6946
dj: 0.0016 -6.4146
jz: 0.0000 -inf
z.: 0.0667 -2.7072
log_likelihood=tensor(-inf)
nll=tensor(inf)
inf

The log_likelihood goes to negative infinity! This is because when the model encounter a pair that it's never seen before, the probability turns into zero, and hence the likelihood. Of course we don't want that, so we need to cancel out all of the zeros.

The simplest way to do that is adding 1 to every entry in the N matrix (which is called Laplace Smoothing, well, guess how brilliant he felt when he came up with that !).

# Just a small modification here
P = (N+1).float()

And when we run our code again, we should have:

.b: 0.0408 -3.1999
bi: 0.0816 -2.5061
id: 0.0249 -3.6939
dj: 0.0018 -6.3141
jz: 0.0003 -7.9817
z.: 0.0664 -2.7122
log_likelihood=tensor(-559977.9375)
nll=tensor(559977.9375)
2.454407215118408

We're good for now!

Summarize

For this model, actually we don't have much to say, but the reason why this post is so long is that I want to go deep into some relevant topics as well, like the Markov assumption, the Tensor and Broadcasting, or playing around with the model. All in all, I just want to wrap up all of the knowledge I learned in the video from Andrej, and share these cool things with you, my friend.

Thanks for reading, having a good day !