Hưng Lê Tiến

Posted on Nov 13 • Edited on Nov 15

BUILDING A BIGRAM LANGUAGE MODEL

#machinelearning #beginners #ai #tutorial

This is deeply inspired by Andrej Karpathy's series about makemore, actually this post is just the notes of Andrej's first video, together with some insights that I came up with.

An Introduction To N-gram Language Model

N-gram language model is perhaps the simplest form of language model that any human could think of. Imagine the predictive text that pops up above your keyboard for each word that you are typing, which is just some plain predictions given the previous word or phrases. N-gram language model is just it!

N-gram language model considers your n-1 previous words and make prediction about the next word in the sequence.

Some N-gram language models that we should know:

Unigram language models: 1-gram, consider no previous words to make prediction, honestly this is a funny language model.
Bigram language models: consider just the previous word, no more no less, this is the model that we will examine today.
Trigram language models: look back the 2 previous words.

In the world of powerful and ever-developing LLMs, N-gram language models are just children's toys. As you can immediately notice in the way it works, and later when we scrutinize the making of it, it does not really take into account the context of the sequence. Sometimes it doesn't even know the meaning of each words, which is sad given the complex nature of language.

So why should we stick with this overly-simplified language model? Well, everyone starts somewhere, mate. This should be a great start for learning about language models, and later we will gradually dive deeper, step by step, so feel free to skip this blog if you know too well.

How It Works (The Intuition Behind)

Just by counting, no more no less.

It is basically the main idea! We just need to have a huge text corpus to train on, and we count the relative frequency of each combination of words. Next, we assign a kind of probability for upcoming words given the frequency we knew earlier, and our model is ready to go!

If you use the training set including conversations of two people loving each other, for example, when the model sees "I" it will confidently think that the next word would be "love", and when there is "love" then the next should probably be "you". The model just works by pure memorization and a bit of probability.

But don't criticize it too much, we can still get some insights from the model, like some syntactics nature of language, e.g. Nouns and Adjectives often come after Verbs, or Verbs often come after Nouns.

So let's make things complicated by doing some maths.

How It Works (The Math Behind)

Considering the task of predicting the next word of a sequence "Your eyes are..."
Well, maybe your eyes are beautiful (are they?), but in a probabilistic sense, we should ask:

What is the probability that the word "beautiful" appears after the sequence of "Your eyes are"?

Mathematicians have a nice way to ask this question, using conditional probability:

P(\text{beautiful}|\text{Your eyes are})

Another way to interpret this is that: In our text corpus, among all of the cases that "Your eyes are" appears, how many times that the word "beautiful" is put after it?

P(\text{beautiful}|\text{Your eyes are}) = \frac{C(\text{Your eyes are beautiful})}{C(\text{Your eyes are})}

But this won't work well!, you may think. Well, you are right. We cannot count like that because we won't get a good estimate for medium-sized corpus. Another reason is that language is creative, which means that new things are invented all the time. So all in all, the approach should be more flexible.

A more clever way is calculating the probability in a recursive way, and get the product of them using the Chain rule of Probability:

P(w_{1:n}) = P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n-1}) = \prod\limits_{k=1}^n P(w_k|w_{1:k-1})

Well, actually it doesn't solve the problem, and it even makes the problem worse as we need to keep track of a bunch of long strings before we come up with the final ones. And this is where our friend Markov comes in.

The Markov Assumption

Rather than considering the whole history of our sequence, we can approximate it by just looking at the last few words.

Markov didn't say that, but that was his main idea. In our example, rather than taking the whole sentence "Your eyes are so" into account, we just need to look at the word "so" to predict (of course we're using the bigram model, if it is trigram then we need to consider 2 preceding words).

To wrap up, we made this assumption (for the bigram):

P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-1})

And with that in mind, we can easily calculate the probability that we want, just by counting:

P (w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{\sum_w C(w_{n-1}w_n)} = \frac{C(w_{n-1}w_n)}{C(w_{n-1})}

In other words, we calculate the ratio of the frequency of the combination we think and the frequency of the preceding word, and that's what we called relative frequency.

It's getting boring talking about these theoretical concepts, let's jump right in and play around with the code.

Making Makemore - A Bigram Language Model

You should check this video out, this is the main inspiration for this post, and shout out to Andrej Karpathy !
The spelled-out intro to language modeling: building makemore

To start, makemore is a simple language model that generate new versions of any data that is feed into it, in this case we are dealing with names, so you can think that makemore is a kind of name-generating language model.

Makemore works just like a kind of bigram LM, but instead of looking at the previous word (it can't), it looks at the previous character and predicting the next one.

We first load and explore our dataset

words = open('names.txt','r').read().splitlines()
words[:10]

So we have this, which is some American names

['emma',
 'olivia',
 'ava',
 'isabella',
 'sophia',
 'charlotte',
 'mia',
 'amelia',
 'harper',
 'evelyn']

Now we create our own bigram

# Creating the bigram, just a kind of data preprocessing

# Make a dictionary to store all the pairs
b = {}

# We have the zip function for convenience pairing, and we should
# signal the start/end of words by some additional characters
for w in words:
  chs = ['<S>'] + list(w) + ['<E>']
  for ch1,ch2 in zip(chs,chs[1:]):
    bigram = (ch1,ch2)
    b[bigram] = b.get(bigram,0) + 1

Let's look at our bigram, shall we?

# We sort the pairs by their frequency
sorted(b.items(), key = lambda kv : -kv[1])[:5]

# Here's the result
[(('n', '<E>'), 6763),
 (('a', '<E>'), 6640),
 (('a', 'n'), 5438),
 (('<S>', 'a'), 4410),
 (('e', '<E>'), 3983)]

Tensor

For better data manipulation, we use Tensor from Pytorch, which is a special data structure that does great in ML/DL projects.

First we need to create a Tensor of our own:

import torch
# Create the tensor for our project

# We have 26 letters in the alphabet and 2 charr <S> and <E> -> 28
N = torch.zeros((28,28), dtype = torch.int32)

# This should be a 27x27 dimensional array storing 32-bit integers, normally the array will store the floating points number so we need to specify the dtype when calling the function

Indexing characters - from strings to integers

Computers talk by numbers, not words, so we should find a way to "encode" our characters into integers. The simplest way is perhaps indexing each character by its order in the alphabet, and each pair is just a tuple of 2 integers.

Here's how it goes:

# Creating a lookup table from characters to integers
chars = sorted(list(set(''.join(words))))

# Map from char to int, indexing
stoi = {s:i for i,s in enumerate(chars)}

# Remember the additional characters
stoi['<S>'] = 26
stoi['<E>'] = 27

# Take out our previous code
for w in words:
  chs = ['<S>'] + list(w) + ['<S>']
  for ch1,ch2 in zip(chs,chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    N[ix1,ix2] += 1

# Reverse mapping

itos = {i:s for s,i in stoi.items()}

It's basically done! We can use matplotlib to contemplate our beautifully crafted bigram.

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize = (16,16))
plt.imshow(N,cmap = 'Blues')
for i in range(28):
  for j in range(28):
      chstr = itos[i] + itos[j]
      plt.text(j,i,chstr, ha = "center", va = "bottom", color = 'gray')
      plt.text(j,i,N[i,j].item(), ha = "center", va = "top", color = 'gray')
plt.axis('off')

And there you have it!

A small modification

As you can see from our bigram, the last row is all zeros, and so is the columns near the end. What happened?

Notice when we use <S> and <E> for denoting the start and end of words, there will never be cases where the <S> starts at the second position, or the <E> follows behind a character. And that is exactly why we have a row and a column of zeros.

To solve this, according to our friend Andrej in the video, we should replace these two by just one dot. Just a . for both ending and starting position.

So we should make some changes to our code

# Now the tensor is just 27x27
N = torch.zeros((27,27), dtype = torch.int32)

# We also want to set the dot to 0 (personal preference)
# And with that we have to adjust our dictionary
stoi['.'] = 0
stoi = {s:i+1 for i,s in enumerate(chars)}

# Now change the chs
for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1,ch2 in zip(chs,chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    N[ix1,ix2] += 1

# And adjust the range of i,j, then we're done

plt.figure(figsize = (16,16))
plt.imshow(N,cmap = 'Blues')
for i in range(27):
  for j in range(27):
      chstr = itos[i] + itos[j]
      plt.text(j,i,chstr, ha = "center", va = "bottom", color = 'gray')
      plt.text(j,i,N[i,j].item(), ha = "center", va = "top", color = 'gray')
plt.axis('off')

So now we have our final result:

Calculating the probability

Now the serious things kick in, we need to calculate the relative frequency for each pairs of characters that we have. The most convenience way is creating a matrix of probability P that is of the same size as N, where each entry represents the probability of the corresponding pair stored in N.

So how can we calculate the probability, really? Remember the formula where we divide the frequency of the pair with the frequency of the preceding word? To your surprise, it is actually the count of the pair divided by the sum of its row in the frequency matrix (N). Take a moment to think about it.


# Create the P matrix
P = N.float()

# Take the sum for each row
# Notice that it is really convenient when using tensor

P.sum(1,keepdim = True)

# And then we normalize every entry
# Notice, again, how clean the code is

P = P / P.sum(1,keepdim = True)

Just a small note here, why should we set keepdim = True ? This can take a whole blog post to explain (and probably I will do it if I'm not lazy). This has to do with something called Broadcasting, which is a really convenient data manipulation and it is used all over the place in data processing. You can check the video from Andrej if you're curious, he talked really deeply about this. For now let us just skip this small detail and move on to the next part.

Sampling from our model

We've created our probabilistic model! May be it's not clear to you, but we've done all the essential part. Now you might ask "How can we get new names from this model?". You need to take a character, possibly a random one, and work recursively to predict the next. What I mean when saying "work recursively to predict" is that: when you have a character, you have to look at the row where it is the preceding character and sample from that distribution to get the next char, and when we get the next word we will do that again to get the upcoming char until we encounter the dot, which is the end of word.

All of that is neatly nested in our code below:

# We need to use a Generator in Pytorch to get the samples
g = torch.Generator().manual_seed(2147483647)

# Get the first 50 names
for i in range(50):
  out = []
  ix = 0
  while True:
   # Get the distribution of the word and sample from it
    p = P[ix]
    ix = torch.multinomial(p,num_samples = 1, replacement = True, generator = g).item()
    out.append(itos[ix])
   # End the loop when we see a dot
    if ix == 0:
      break
  print(''.join(out))

If you use the same seed in the Generator with me, you will have this sequence of names

cexze.
momasurailezitynn.
konimittain.
llayn.
ka.
da.
staiyaubrtthrigotai.
...

That is dissapointing ! But there are some names that are acceptable though, and we cannot expect too much from this simple model.

Evaluation

We need to use a kind of metric to evaluate this model, and don't even think about MSE or MAE or RMSE, those are not even for probability, let alone languages.

When dealing with probability, there are some evaluation metrics that should come to your mind. One of which is the Maximum Likelihood Estimation.

Maximum Likelihood Estimation (MLE)

This is also a topic that should have a blog post for itself, given the wide application of it in ML/DL and even beyond that scope. To be honest, I don't know too well about this topic, so I will just scratch the surface in this part.

The likelihood of the data is a kind of a function, it is different from probability, and in some specific tasks, we might need to maximize this function to get the optimized loss. Considering the likelihood function, you just need to know that it is the product of a bunch of probability.

But there is a problem with scaling if you notice: Probability always ranges from 0 to 1, so when we scale the data, which means that we multiply numerous values of probability together, we would get a extremely small result, which is bad. That is called numerical underflow.

To solve the problem, we just need to use the log of the likelihood instead. And we also need to calculate the negative log likelihood, well it is not really useful in today's task, but it will be of immense importance in the next blog (when we use a neural network). And that's it!

# Calculate the log_likelihood
log_likelihood = 0.0
n = 0

for w in words:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    prob = P[ix1, ix2]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')

So we have our log, and we should normalize it by the number of words.

log_likelihood=tensor(-559891.7500)
nll=tensor(559891.7500)
2.454094171524048

Not really bad ! This is the best thing we can do in this char-to-char predicting task, perhaps we expect too much from it.

Perplexity

Another metric we would use is Perplexity (PP or PPL), which shares the idea with the definition of "Entropy".

We think of this metric as the level of surprise the model gets when generating the new word. If the model does well and assigns a high probability to the word, then it would be less surprised when the word came, and vice versa. So a good model is the one who is least surprised by the test set (or we can say it is less perplexed, which I think is the main idea behind the term Perplexity).

So how can we measure surprise? Note that when assigning a high probability, the model will be less surprised, and vice versa. So we can think of an inverse relationship, and that’s exactly scientists do! They take the inverse of the probability, together with a kind of normalization by the length.

Here's the formula:

\text{perplexity}(W) = {P(w_1w_2...w_n)}^{\frac{-1}{N}} = \sqrt[N]{\frac{1}{P(w_1w_2...w_n)}}

Now as we've come this far, I shall tell you this fact:

The Perplexity is actually the exponential of the negative log-likelihood. Those are the same.

Till we meet again.

Smoothing

Our code is not really perfect yet! What if we need to calculate the log-likelihood of a sequence that is never seen before? Let's see:

# SMOOTHING

# Check this word and calculate the loss
for w in ['bidjz']:
  chs = ['.'] + list(w) + ['.']
  for ch1, ch2 in zip(chs, chs[1:]):
    ix1 = stoi[ch1]
    ix2 = stoi[ch2]
    prob = P[ix1, ix2]
    logprob = torch.log(prob)
    log_likelihood += logprob
    n += 1
    print(f'{ch1}{ch2}: {prob:.4f} {logprob:.4f}')
print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
print(f'{nll/n}')

Let's look at the result

b: 0.0408 -3.1998
bi: 0.0820 -2.5005
id: 0.0249 -3.6946
dj: 0.0016 -6.4146
jz: 0.0000 -inf
z.: 0.0667 -2.7072
log_likelihood=tensor(-inf)
nll=tensor(inf)
inf

The log_likelihood goes to negative infinity! This is because when the model encounter a pair that it's never seen before, the probability turns into zero, and hence the likelihood. Of course we don't want that, so we need to cancel out all of the zeros.

The simplest way to do that is adding 1 to every entry in the N matrix (which is called Laplace Smoothing, well, guess how brilliant he felt when he came up with that !).

# Just a small modification here
P = (N+1).float()

And when we run our code again, we should have:

.b: 0.0408 -3.1999
bi: 0.0816 -2.5061
id: 0.0249 -3.6939
dj: 0.0018 -6.3141
jz: 0.0003 -7.9817
z.: 0.0664 -2.7122
log_likelihood=tensor(-559977.9375)
nll=tensor(559977.9375)
2.454407215118408

We're good for now!

Summarize

For this model, actually we don't have much to say, but the reason why this post is so long is that I want to go deep into some relevant topics as well, like the Markov assumption, the Tensor and Broadcasting, or playing around with the model. All in all, I just want to wrap up all of the knowledge I learned in the video from Andrej, and share these cool things with you, my friend.

Thanks for reading, having a good day !

DEV Community