DEV Community

omkar
omkar

Posted on

My Notes: Makemore - Character Level Language Model

Introduction

Character level language model that predicts the next character given previous characters.

Example: For 'isabella':

  • i likely comes first
  • s after i
  • a after is
  • b after isa, and so on

Representation: <START>isabella<END>


1. Loading the Dataset

with open('names.txt', 'r') as file:
    words = file.read().splitlines()
Enter fullscreen mode Exit fullscreen mode

Check dataset statistics:

words[:10]
min(len(word) for word in words)
max(len(word) for word in words)
Enter fullscreen mode Exit fullscreen mode

2. Bigram Language Model

Bigram language model: Working with two characters at a time. Given a char, predict next character.

Bigrams: Two characters in a row

Example with a single word

word = words[0]

zips = zip(word, word[1:])
print(word, word[1:])
print(*zips)
Enter fullscreen mode Exit fullscreen mode

Adding special tokens

word, ['<S>'] + list(word) + ['<E>']
Enter fullscreen mode Exit fullscreen mode

Extracting bigrams from multiple words

# Two consecutive characters
for word in words[:3]:
    chs = ['<S>'] + list(word) + ['<E>']
    for c1, c2 in zip(chs, chs[1:]):
        print(c1, c2)
Enter fullscreen mode Exit fullscreen mode

Note: zip halts if any list is shorter than the other.


3. Counting Bigrams

Simple way to learn bigram model is to count number of times bigrams occur in training set.

Count bigrams for first 3 words

b = {}

for word in words[:3]:
    chs = ['<S>'] + list(word) + ['<E>']
    for c1, c2 in zip(chs, chs[1:]):
        bigram = (c1, c2)
        b[bigram] = b.get(bigram, 0) + 1
Enter fullscreen mode Exit fullscreen mode

Count bigrams for all words

# Now lets do this for all the words
b = {}

for word in words:
    chs = ['<S>'] + list(word) + ['<E>']
    for c1, c2 in zip(chs, chs[1:]):
        bigram = (c1, c2)
        b[bigram] = b.get(bigram, 0) + 1

# Get (bigram, counts) tuples
items = b.items()
Enter fullscreen mode Exit fullscreen mode

Sort by counts

# sort by count   
# sort by default sorts wrt first element of object, here its bigram

sorted_by_counts_asc = sorted(items, key= lambda kv: kv[1])
sorted_by_counts_desc = sorted(items, key= lambda kv: -kv[1])
Enter fullscreen mode Exit fullscreen mode

4. 2D Count Array with PyTorch

Goal: Put counts in a 2D array where:

  • Rows are first char
  • Columns are second char of bigram
  • Each entry is number of counts that they appear
import torch

# 26 letters of alphabet and 2 special tokens <S> and <E> 
# so we need (28, 28) array for above purpose

# Count array
N = torch.zeros((28, 28), dtype=torch.int32)
Enter fullscreen mode Exit fullscreen mode

Creating character lookup tables

We need some lookup table from characters to integers so that we can index into tensor.

# Set of all lowercase characters
# This joins all dataset into one big string and set() removes all duplicate characters from that string
# This way we have set of unique characters in dataset.

chars = sorted(list(set(''.join(words))))

# Lookup table
stoi = {s: i for i, s in enumerate(chars)}
stoi['<S>'] = 26
stoi['<E>'] = 27
Enter fullscreen mode Exit fullscreen mode

Populate the count array

# Map both chars to their integers
for word in words:
    chs = ['<S>'] + list(word) + ['<E>']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1
Enter fullscreen mode Exit fullscreen mode

5. Visualizing Bigram Counts

import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(N)
Enter fullscreen mode Exit fullscreen mode

Detailed visualization with labels

itos = {i: s for s, i in stoi.items()}

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')

for i in range(28):
    for j in range(28):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha='center', va='bottom', color='grey')
        plt.text(j, i, N[i, j].item(), ha='center', va='top', color='grey')

plt.axis('off');
Enter fullscreen mode Exit fullscreen mode

Observations:

  • Last row is entirely zero because <E> will never come first in bigram
  • One column is entirely zero because <S> will never come at end in bigram
  • Only possible combination is <S><E> i.e. a word with no letters

6. Using Special Token '.'

Solution: Change special token to . both for starting and ending.

N = torch.zeros((27, 27), dtype=torch.int32)

stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i: s for s, i in stoi.items()}

# Map both chars to their integers
for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')

for i in range(27):
    for j in range(27):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha='center', va='bottom', color='grey')
        plt.text(j, i, N[i, j].item(), ha='center', va='top', color='grey')

plt.axis('off');
Enter fullscreen mode Exit fullscreen mode

Observations:

  • First row shows counts of chars that start the word
  • First column shows count of chars that end the word

7. Converting Counts to Probabilities

# first row
N[0]

# Convert these counts to probabilities
p = N[0].float()
p /= p.sum()
Enter fullscreen mode Exit fullscreen mode

Interpretation: This gives probability that given char is first letter of the word.


8. Sampling from Probability Distribution

Understanding torch.multinomial

# Deterministic way of creating a torch generator object
h = torch.Generator().manual_seed(2147483647) 

# We use generator object as source of randomness in following function
p = torch.rand(3, generator=h)  # Gives 3 random numbers between 0 and 1

p = p / p.sum()
Enter fullscreen mode Exit fullscreen mode

First index will be sampled according to its probability and so on.

# Use torch multinomial to draw samples from above p
# Samples indexes according to p

torch.multinomial(p, num_samples=100, replacement=True, generator=h)
Enter fullscreen mode Exit fullscreen mode

Sampling first character

Now we sample from our first row same as done here.

# Convert these counts to probabilities
p = N[0].float()
p = p / p.sum()
print(p)
g = torch.Generator().manual_seed(2147483647) 
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
ix, itos[ix]
Enter fullscreen mode Exit fullscreen mode

Note: This is different than one in video, probably due to change in library itself.

Sampling next character

Now that our first sampled char is 'c', we go to row that starts with 'c'.

p = N[3].float()
p /= p.sum()

ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
ix, itos[ix]
Enter fullscreen mode Exit fullscreen mode

9. Generating Words with Loop

Algorithm:

  1. Set ix = 0: Sample from 0th row (starting characters)
  2. Then in loop:
    • Sample the first char from ix=0
    • ix = sampled char
g = torch.Generator().manual_seed(2147483647) 

for i in range(10):
    out = []
    ix = 0

    while True:
        p = N[ix].float()
        p /= p.sum()

        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))
Enter fullscreen mode Exit fullscreen mode

Result: As you can see, bigrams are terrible and we should do better. But bigrams are still better than untrained model.


10. Comparison with Uniform Sampling

Following model samples uniformly from 27 characters:

g = torch.Generator().manual_seed(2147483647) 

for i in range(10):
    out = []
    ix = 0

    while True:
        # p = N[ix].float()
        # p /= p.sum()
        p = torch.ones(27)
        p /= 27

        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))
Enter fullscreen mode Exit fullscreen mode

This is garbage. So bigrams are one step better than this but still are terrible.


11. Broadcasting and Efficient Normalization

Don't normalize every row every time, instead compute probabilities at once. So that every row contains prob distribution over 27 words, given previous word.

Understanding torch.sum with dimensions

P.sum(input, dim, keepdim=True):

  • When given dim, sum is performed across that dim
  • dim=0 (rows in [27, 27]): sum is performed across rows i.e. along columns
    • Vertical sum resulting in [1, 27] row vector
  • dim=1 (columns in [27, 27]): sum is performed across columns i.e. along rows
    • Horizontal sum resulting in [27, 1] column vector
  • keepdim=False: result is (27,) 1D vector
  • keepdim=True: result is [1, 27] or [27, 1] 2D matrix
P = N.float()
P.sum(dim=1, keepdim=True).shape  # [27, 1]

P = N.float()
P /= P.sum(dim=1, keepdim=True)
P.shape  # [27, 27]

P[0].sum()  # Should be 1.0
Enter fullscreen mode Exit fullscreen mode

Broadcasting Rules

Broadcasting has rules in pytorch. Visit doc.

Rule 1: Align all dimensions from the right:

    [27, 27]
    [27]
→
    [27, 27]
    [    27]
Enter fullscreen mode Exit fullscreen mode

Rule 2: Iterate over all dimensions starting from right to left. Each dimension must be: equal, one of them is 1, or one of them does not exist.

Example 1: Broadcasting with [27, 1]

    [27, 27]
    [27, 1]
Enter fullscreen mode Exit fullscreen mode

Broadcasting copies [27, 1] vector 27 times to make it [27, 27] where last dimension is now matched. Then it does element-wise division.

Example 2: Why keepdim=False causes issues

P.sum(dim=1, keepdim=False) gives sum across columns and output shape would be [27] vector, not a matrix.

Consider P / P.sum(0):

    [27, 27]
    [27]
Enter fullscreen mode Exit fullscreen mode

After alignment:

    [27, 27]
    [ 1, 27]
Enter fullscreen mode Exit fullscreen mode

[1,27] is a row vector that will get copied 27 times down to match dimension.

Problem: P.sum(0) is matrix where along columns we have sum for each row

  • We want to divide a row element by sum of that row
  • Above broadcasting creates a divide matrix that has row sums along columns and not rows
  • So we're dividing first entry of each row by sum of first row, second column by sum of second row, and so on
  • i.e. we are normalizing the columns instead of rows
  • This is not our desired behavior

Lesson: Have respect for broadcasting, check your work, understand how it works under the hood, and make sure broadcasting is working in the direction that you want, otherwise you'll introduce very subtle and hard to detect bugs.

Using probability matrix for sampling

g = torch.Generator().manual_seed(2147483647) 

for i in range(10):
    out = []
    ix = 0

    while True:
        p = P[ix]
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))
Enter fullscreen mode Exit fullscreen mode

We get exact same results as above without having to normalize at every iteration.

Note:

  • P = P / P.sum() creates a new tensor P
  • P /= P.sum() operates inplace

12. Model Summary

So, now we have trained a bigram model by counting frequency of pairs and then normalizing counts to get probability distribution.

Elements of P are really the parameters of our bigram model, summarizing statistics of bigrams.


13. Evaluating Quality of Model

Using Negative Log Likelihood

Now we need to summarize quality of this trained model into a single number. i.e. how good model is in predicting the training set.

One example is training loss which tells us how model did in training against dataset.

for word in words[:3]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        print(f'{ch1}{ch2}: {prob:.4f}')
Enter fullscreen mode Exit fullscreen mode

Interpretation:

  • These are the probs that model assigned to each bigram in dataset
  • If every bigram was equally likely, then these probs would have been 1/27 ≈ 0.0370, roughly 4%
  • If any prob is above 4% means we have learnt something useful from bigram statistic
  • Model has assigned pretty good probs for what's in training set (some are 4%, some are 17%, 35%, 40%)
  • If you had a very good model, these probs for each bigram in train set would be near 1

Maximum Likelihood Estimation

In literature, we use Maximum Likelihood Estimation.

Likelihood = product of above probs

Likelihood tells us probability of entire dataset assigned by the trained model.

Product of these probs should be as high as possible to have a good model.

For convenience we use log of probs:

  • log(1) = 0
  • As we go below 1, log falls to negative values till log(0) = -inf
  • log(a*b*c) = log(a) + log(b) + log(c)

If all probs are near 1:

  • log likelihood → 0
  • If probs are near 0, log likelihood → more negative

We want a loss function which we can minimize, so we invert log likelihood:

  • When probs → 1: log likelihood → 0
  • When probs → 1: -log likelihood → 0 (what we want for loss)

Computing Negative Log Likelihood

log_likelihood = 0.0
for word in words[:3]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        print(f'{ch1}{ch2}: {prob:.4f}, {log_prob}')

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
Enter fullscreen mode Exit fullscreen mode

Why NLL is a good loss function:

  • It's always ≥ 0
  • When probs are near 1, it's near to 0
  • When probs are away from 1, it increases away from 0
  • Higher the NLL, worse the predictions are

For convenience, we use average negative log likelihood.

Average Negative Log Likelihood

# Average log likelihood
log_likelihood = 0.0
n = 0

for word in words[:3]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n += 1
        print(f'{ch1}{ch2}: {prob:.4f}, {log_prob}')

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')
Enter fullscreen mode Exit fullscreen mode

Thus we use average negative log likelihood as our loss function.

Our aim is to minimize this loss to get high quality model.

Optimization Goal

GOAL: Maximize likelihood of the data wrt model parameters (statistical modelling)

(Later these parameters (counts here) will be calculated by a NN and we want to tune these parameters to maximize likelihood of training data)

Equivalences:

  • Maximize likelihood
  • ≡ Maximize log likelihood (because log is a monotonic function)
  • ≡ Minimize negative log likelihood
  • ≡ Minimize average negative log likelihood

Loss on Entire Training Set

# Average log likelihood for entire training set
log_likelihood = 0.0
n = 0

for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')
Enter fullscreen mode Exit fullscreen mode

Testing on New Data

# Test on a name not in dataset
log_likelihood = 0.0
n = 0

for word in ["andrejq"]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')
Enter fullscreen mode Exit fullscreen mode

14. Laplace Smoothing

Problem: If any count is 0:

  • p((ai, aj)) = 0
  • log(p(ai, aj)) = -inf
  • -log(p(ai, aj)) = inf
  • This means entire sequence can have infinite loss

Solution: Add Laplace smoothing - Add 1 to every count so no count is 0.

P = (N+1).float()  # Add one here for smoothing
P /= P.sum(dim=1, keepdim=True)

g = torch.Generator().manual_seed(2147483647) 

for i in range(10):
    out = []
    ix = 0

    while True:
        p = P[ix]
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))
Enter fullscreen mode Exit fullscreen mode

Test with smoothing:

# Average log likelihood for entire training set
log_likelihood = 0.0
n = 0

for word in ['andndrejq']:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')
Enter fullscreen mode Exit fullscreen mode

15. Neural Network Approach

So we trained a bigram character level model by means of counting, normalizing counts to get probability dist and sampling from that dist to generate words, evaluated model using negative log likelihood.

Now we frame character level bigram model in NN framework:

  • Inputs one char
  • Outputs prob dist for next character
  • We have bigrams as training set, so we know next character given first
  • We can evaluate model based on this
  • NN outputs prob dist over next char, we have targets and a loss function: neg likelihood
  • Model should assign high prob to next char i.e. loss should be low

Creating Training Set

Let's first create training set of all bigrams (x, y):

(x, y)
x: input (int)
y: target (int)

Given x, predict y
Enter fullscreen mode Exit fullscreen mode
# create training set of bigrams (x, y)

xs, ys = [], []

for word in words[:1]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

xs, ys
Enter fullscreen mode Exit fullscreen mode

16. One-Hot Encoding

Note: xs and ys are index of characters. Indexes are integers.

Problem with integers as input:

  • It's not recommended to input an integer to NN
  • We multiply them with weights which are floats, so they should be float
  • Integers imply numerical relationship between indexes
  • If 'a' index is 1 and 'b' index is 2, numerically 'b' is greater than 'a'
  • Character 'm' (index 13) is not "halfway" between 'a' (index 1) and 'y' (index 25)
  • All characters should be treated as equally distinct from each other
  • Characters are categorical data, not continuous

Solution: Common way of encoding integers is one-hot encoding.

import torch.nn.functional as F

xenc = F.one_hot(xs, num_classes=27)
xenc.shape  # torch.Size([5, 27])

plt.imshow(xenc)
Enter fullscreen mode Exit fullscreen mode

Interpretation:

  • 5 examples (inputs) encoded as 5 row vectors
  • We will feed each such example to NN
  • We want input to be floats that are able to take various values (ints can't)
xenc.dtype  # torch.int64

# Cast one hot vectors to float dtype
xenc = F.one_hot(xs, num_classes=27).float()
print(xenc.shape, xenc.dtype)
plt.imshow(xenc)
Enter fullscreen mode Exit fullscreen mode

17. Neural Network Forward Pass

Single Neuron

# initialize Weights for a single neuron that will input above vectors 
W = torch.randn((27, 1))
xenc @ W  # Output: (5, 1)
Enter fullscreen mode Exit fullscreen mode

@ is matrix multiplication operator in pytorch.

Here we fed all 5 inputs to this neuron and it produced its activations (5, 1).

27 Neurons (Full Layer)

# neurons stacked as columns
W = torch.randn((27, 27))
xenc @ W  # Output: (5, 27)
Enter fullscreen mode Exit fullscreen mode

We can efficiently calculate activations by passing inputs stacked as rows as a batch and multiplying them with weights of neurons stacked as columns.

Network Architecture

Our NN for now will be:

  • 27 dimensional input
  • 27 neurons in first linear layer which outputs prob of next char

We will treat 27 numbers that output as log of counts (not integer counts because NN should not output an int):

  • Log of counts are also called logits

From Logits to Probabilities

So how we interpret 27 numbers: they are log counts.

Exponentiate log counts and you get counts.

Exponential function:

  • Negative numbers → output below 1 but greater than 0: (0, 1]
  • Positive numbers → >1 up to +inf: (1, +inf]

So exp are good candidates for counts: never below 0 and can take on various values, depending on settings of W.

(xenc @ W).exp()
Enter fullscreen mode Exit fullscreen mode

These numbers can be interpreted as equivalent of counts.

Log of counts are also called logits.

Complete Transformation

logits = xenc @ W  # log-counts
counts = logits.exp()  # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True)

probs.shape  # (5, 27)
probs[1].sum()  # Should be 1.0
Enter fullscreen mode Exit fullscreen mode

So for every one of 5 examples we have a row that came out of a NN, and with above transformations, we made sure that outputs can be interpreted as probabilities.

All above operations are differentiable that can be backpropagated.

Process:

  • Fed . by getting its index
  • One hot encoded it
  • Fed to NN
  • Output prob dist after transformations

These probs are NN's assignment of prob for next character.

We now want to tune W such that good probs are output.


18. Training the Neural Network

Setup

# create training set 
xs, ys = [], []

for word in words[:1]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

# Randomly initialize 27 neurons' weights, each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g)

xenc = F.one_hot(xs, num_classes=27).float()  # input to network one hot encoded
logits = xenc @ W  # predict log-counts
counts = logits.exp()  # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character

# Last two lines here are together called a 'softmax'
Enter fullscreen mode Exit fullscreen mode

Softmax

Softmax: Takes logits, exponentiates them and normalizes them.

It takes outputs of NN which can be positive or negative and outputs probability distribution i.e. something that sums to one and contains only positive numbers, just like probabilities.

Computing Loss

nlls = torch.zeros(5)
for i in range(5):
    # i-th bigram:
    x = xs[i].item()  # input character index
    y = ys[i].item()  # label character index
    print('--------')
    print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indexes {x},{y})')
    print('input to the neural net:', x)
    print('output probabilities from the neural net:', probs[i])
    print('label (actual next character):', y)
    p = probs[i, y]
    print('probability assigned by the net to the correct character:', p.item())
    logp = torch.log(p)
    print('log likelihood:', logp.item())
    nll = -logp
    print('negative log likelihood:', nll.item())
    nlls[i] = nll

print('=========')
print('average negative log likelihood, i.e. loss =', nlls.mean().item())
Enter fullscreen mode Exit fullscreen mode

This is not a very good setting of W, as our loss (average negative log likelihood) is much higher than 0.

This loss is made up of differentiable functions, so we can minimize the loss by tuning W parameters.


19. Gradient-Based Optimization

Efficient Loss Computation

# Probs to calculate loss
probs[0, 5], probs[1, 13], probs[2, 13], probs[3, 1], probs[4, 0]

# Better way to index
probs[torch.arange(5), ys]
Enter fullscreen mode Exit fullscreen mode

These are probs that NN assigns to correct next character.

# Loss
loss = -probs[torch.arange(5), ys].log().mean()
Enter fullscreen mode Exit fullscreen mode

Training Loop Setup

# Randomly initialize 27 neurons' weights, each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)  # By default requires_grad is False
Enter fullscreen mode Exit fullscreen mode

Forward and Backward Pass

# Forward pass
xenc = F.one_hot(xs, num_classes=27).float()  # input to network one hot encoded
logits = xenc @ W  # predict log-counts
counts = logits.exp()  # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character
loss = -probs[torch.arange(5), ys].log().mean()

# Backward pass
# Make sure all gradients are reset to 0
# Setting grads to None is efficient and interpreted by pytorch as lack of gradient and is same as 0s
W.grad = None 
loss.backward()

# Update
W.data += -0.1 * W.grad
Enter fullscreen mode Exit fullscreen mode

Note: Having low loss means network is assigning high probs to correct targets.


20. Full Training on Dataset

Create Full Dataset

# create the dataset 
xs, ys = [], []

for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('Number of examples:', num)

# Initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
Enter fullscreen mode Exit fullscreen mode

Gradient Descent

# gradient descent
for k in range(100):
    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float()  # input to network one hot encoded
    logits = xenc @ W  # predict log-counts
    counts = logits.exp()  # counts equivalent to N
    probs = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character
    loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
    print(loss.item())

    # backward pass
    W.grad = None 
    loss.backward()

    # update
    W.data += -50 * W.grad
Enter fullscreen mode Exit fullscreen mode

Results:

  • Our loss at starting when we did with counting was around 2.47, roughly 2.45 before smoothing
  • So here we have achieved the same performance with gradient based optimization

Comparison:

  • Counting was straightforward and fast for this problem, we were able to maintain probs in a table
  • But NN is flexible approach

Future improvements:

  • What we can do now is to complexify the NN by feeding multiple previous characters into increasingly complex neural nets
  • But output of NN will always just be logits, which will go through exact same transformation as above
  • The only thing that will change is how we do forward pass, everything else remains same

21. Neural Network vs Counting Approach

Scalability

If we are taking multiple previous characters, then it's not possible to keep counts table for every combination, this is unscalable approach.

NN approach on other hand is scalable and we can improve on over time.

Mathematical Equivalence

logits = xenc @ W
Enter fullscreen mode Exit fullscreen mode

Key insight: Multiplying one hot vector of say 5th character, with W, plucks out 5th row of W, because of how matrix multiplication works.

So logits just become 5th row of W.

In counting approach:

  • We had first char say 5th one
  • Then we would go to 5th row of N which then gave us prob dist for next char
  • So first char was used as a lookup into matrix N

Similar thing is happening in NN:

  • We take index, encode one hot and multiply by W
  • So logits become appropriate row of W
  • Which are then exponentiated into counts and normalized into probability

Conclusion: W.exp() at end of optimization is same as N array of counts.

  • N was filled by counting
  • W was initialized randomly and loss guided us to arrive at same array as N

22. Regularization

Smoothing Equivalence

In smoothing, if we add 10000 to every count, where max count was around 900, then every count will approximately look the same and upon normalization we will have nearly same prob for each character, i.e. we would get a uniform distribution.

Same thing can happen in NN approach:

  • W initialized to all 0s
  • logits become all 0s
  • counts = logits.exp() become all 1s
  • probs = count/count.sum(1, keepdim=True) become all uniform

So incentivizing W to be near 0 is equivalent to label smoothing.

More you incentivize this in loss function, more smooth dist you achieve.

Regularization Loss

This is called regularization where we can add small component to loss called regularization loss.

This is done by adding something like (W**2).mean() to loss function.

  • You achieve 0 loss if W is exactly a 0 matrix
  • But if W has non-zero numbers then you accumulate loss
  • You can choose regularization parameter which decides how much regularization affects the loss
  • This component tries to make all w's be 0 in optimization

So:

  • W wants to be 0
  • Probs want to be uniform
  • But also match up your training data

Regularization parameter is similar to addition of count in Laplace smoothing.


23. Sampling from Neural Network

# Finally sample from the model
g = torch.Generator().manual_seed(2147483647) 

for i in range(5):
    out = []
    ix = 0

    while True:
        xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float() 
        logits = xenc @ W  # predict log-counts
        counts = logits.exp()  # counts equivalent to N
        p = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character

        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))
Enter fullscreen mode Exit fullscreen mode

Result: Thus we got same samples as bigram counting models.

So these are fundamentally same models but we came at it in different way and they have different interpretations.


24. Extension to Trigrams (Work in Progress)

Creating Trigram Dataset

# create the dataset 
xs, ys = [], []

for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs.append((ix1, ix2))
        ys.append(ix3)

xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('Number of examples:', num)
print('shape of xs:', xs.shape)

# Initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
Enter fullscreen mode Exit fullscreen mode

One-Hot Encoding for Pairs

xs[1], F.one_hot(xs[1])
Enter fullscreen mode Exit fullscreen mode

Trigram Training

# gradient descent
for k in range(100):
    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float()  # input to network one hot encoded
    logits = xenc @ W  # predict log-counts
    counts = logits.exp()  # counts equivalent to N
    probs = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character
    loss = -probs[torch.arange(xs.shape[0]), ys].log().mean() + 0.01*(W**2).mean()
    print(loss.item())

    # backward pass
    W.grad = None 
    loss.backward()

    # update
    W.data += -50 * W.grad
Enter fullscreen mode Exit fullscreen mode

Summary

This notebook covered:

  1. Bigram Language Model: Building character-level model using counting
  2. Probability Distributions: Converting counts to probabilities
  3. Sampling: Generating new names from the model
  4. Evaluation: Using negative log likelihood as loss function
  5. Neural Network Approach: Reimplementing bigrams using gradient descent
  6. One-Hot Encoding: Proper way to feed categorical data to neural networks
  7. Softmax: Converting logits to probabilities
  8. Regularization: Smoothing probabilities and preventing overfitting
  9. Equivalence: Understanding how counting and NN approaches are fundamentally the same

The key insight is that while counting is straightforward for bigrams, the neural network approach is more scalable and can be extended to handle longer context (trigrams, n-grams, etc.) and more complex architectures.

Top comments (0)