omkar

Posted on Apr 15

My Notes: Makemore - Character Level Language Model

#python #machinelearning #nlp #llm

Introduction

Character level language model that predicts the next character given previous characters.

Example: For 'isabella':

i likely comes first
s after i
a after is
b after isa, and so on

Representation: <START>isabella<END>

1. Loading the Dataset

with open('names.txt', 'r') as file:
    words = file.read().splitlines()

Check dataset statistics:

words[:10]
min(len(word) for word in words)
max(len(word) for word in words)

2. Bigram Language Model

Bigram language model: Working with two characters at a time. Given a char, predict next character.

Bigrams: Two characters in a row

Example with a single word

word = words[0]

zips = zip(word, word[1:])
print(word, word[1:])
print(*zips)

Adding special tokens

word, ['<S>'] + list(word) + ['<E>']

Extracting bigrams from multiple words

# Two consecutive characters
for word in words[:3]:
    chs = ['<S>'] + list(word) + ['<E>']
    for c1, c2 in zip(chs, chs[1:]):
        print(c1, c2)

Note: zip halts if any list is shorter than the other.

3. Counting Bigrams

Simple way to learn bigram model is to count number of times bigrams occur in training set.

Count bigrams for first 3 words

b = {}

for word in words[:3]:
    chs = ['<S>'] + list(word) + ['<E>']
    for c1, c2 in zip(chs, chs[1:]):
        bigram = (c1, c2)
        b[bigram] = b.get(bigram, 0) + 1

Count bigrams for all words

# Now lets do this for all the words
b = {}

for word in words:
    chs = ['<S>'] + list(word) + ['<E>']
    for c1, c2 in zip(chs, chs[1:]):
        bigram = (c1, c2)
        b[bigram] = b.get(bigram, 0) + 1

# Get (bigram, counts) tuples
items = b.items()

Sort by counts

# sort by count   
# sort by default sorts wrt first element of object, here its bigram

sorted_by_counts_asc = sorted(items, key= lambda kv: kv[1])
sorted_by_counts_desc = sorted(items, key= lambda kv: -kv[1])

4. 2D Count Array with PyTorch

Goal: Put counts in a 2D array where:

Rows are first char
Columns are second char of bigram
Each entry is number of counts that they appear

import torch

# 26 letters of alphabet and 2 special tokens <S> and <E> 
# so we need (28, 28) array for above purpose

# Count array
N = torch.zeros((28, 28), dtype=torch.int32)

Creating character lookup tables

We need some lookup table from characters to integers so that we can index into tensor.

# Set of all lowercase characters
# This joins all dataset into one big string and set() removes all duplicate characters from that string
# This way we have set of unique characters in dataset.

chars = sorted(list(set(''.join(words))))

# Lookup table
stoi = {s: i for i, s in enumerate(chars)}
stoi['<S>'] = 26
stoi['<E>'] = 27

Populate the count array

# Map both chars to their integers
for word in words:
    chs = ['<S>'] + list(word) + ['<E>']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1

5. Visualizing Bigram Counts

import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(N)

Detailed visualization with labels

itos = {i: s for s, i in stoi.items()}

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')

for i in range(28):
    for j in range(28):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha='center', va='bottom', color='grey')
        plt.text(j, i, N[i, j].item(), ha='center', va='top', color='grey')

plt.axis('off');

Observations:

Last row is entirely zero because <E> will never come first in bigram
One column is entirely zero because <S> will never come at end in bigram
Only possible combination is <S><E> i.e. a word with no letters

6. Using Special Token '.'

Solution: Change special token to . both for starting and ending.

N = torch.zeros((27, 27), dtype=torch.int32)

stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i: s for s, i in stoi.items()}

# Map both chars to their integers
for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        N[ix1, ix2] += 1

plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')

for i in range(27):
    for j in range(27):
        chstr = itos[i] + itos[j]
        plt.text(j, i, chstr, ha='center', va='bottom', color='grey')
        plt.text(j, i, N[i, j].item(), ha='center', va='top', color='grey')

plt.axis('off');

Observations:

First row shows counts of chars that start the word
First column shows count of chars that end the word

7. Converting Counts to Probabilities

# first row
N[0]

# Convert these counts to probabilities
p = N[0].float()
p /= p.sum()

Interpretation: This gives probability that given char is first letter of the word.

8. Sampling from Probability Distribution

Understanding torch.multinomial

# Deterministic way of creating a torch generator object
h = torch.Generator().manual_seed(2147483647) 

# We use generator object as source of randomness in following function
p = torch.rand(3, generator=h)  # Gives 3 random numbers between 0 and 1

p = p / p.sum()

First index will be sampled according to its probability and so on.

# Use torch multinomial to draw samples from above p
# Samples indexes according to p

torch.multinomial(p, num_samples=100, replacement=True, generator=h)

Sampling first character

Now we sample from our first row same as done here.

# Convert these counts to probabilities
p = N[0].float()
p = p / p.sum()
print(p)
g = torch.Generator().manual_seed(2147483647) 
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
ix, itos[ix]

Note: This is different than one in video, probably due to change in library itself.

Sampling next character

Now that our first sampled char is 'c', we go to row that starts with 'c'.

p = N[3].float()
p /= p.sum()

ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
ix, itos[ix]

9. Generating Words with Loop

Algorithm:

Set ix = 0: Sample from 0th row (starting characters)
Then in loop:
- Sample the first char from ix=0
- ix = sampled char

g = torch.Generator().manual_seed(2147483647) 

for i in range(10):
    out = []
    ix = 0

    while True:
        p = N[ix].float()
        p /= p.sum()

        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))

Result: As you can see, bigrams are terrible and we should do better. But bigrams are still better than untrained model.

10. Comparison with Uniform Sampling

Following model samples uniformly from 27 characters:

g = torch.Generator().manual_seed(2147483647) 

for i in range(10):
    out = []
    ix = 0

    while True:
        # p = N[ix].float()
        # p /= p.sum()
        p = torch.ones(27)
        p /= 27

        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))

This is garbage. So bigrams are one step better than this but still are terrible.

11. Broadcasting and Efficient Normalization

Don't normalize every row every time, instead compute probabilities at once. So that every row contains prob distribution over 27 words, given previous word.

Understanding torch.sum with dimensions

P.sum(input, dim, keepdim=True):

When given dim, sum is performed across that dim
dim=0 (rows in [27, 27]): sum is performed across rows i.e. along columns
- Vertical sum resulting in [1, 27] row vector
dim=1 (columns in [27, 27]): sum is performed across columns i.e. along rows
- Horizontal sum resulting in [27, 1] column vector
keepdim=False: result is (27,) 1D vector
keepdim=True: result is [1, 27] or [27, 1] 2D matrix

P = N.float()
P.sum(dim=1, keepdim=True).shape  # [27, 1]

P = N.float()
P /= P.sum(dim=1, keepdim=True)
P.shape  # [27, 27]

P[0].sum()  # Should be 1.0

Broadcasting Rules

Broadcasting has rules in pytorch. Visit doc.

Rule 1: Align all dimensions from the right:

    [27, 27]
    [27]
→
    [27, 27]
    [    27]

Rule 2: Iterate over all dimensions starting from right to left. Each dimension must be: equal, one of them is 1, or one of them does not exist.

Example 1: Broadcasting with [27, 1]

    [27, 27]
    [27, 1]

Broadcasting copies [27, 1] vector 27 times to make it [27, 27] where last dimension is now matched. Then it does element-wise division.

Example 2: Why keepdim=False causes issues

P.sum(dim=1, keepdim=False) gives sum across columns and output shape would be [27] vector, not a matrix.

Consider P / P.sum(0):

    [27, 27]
    [27]

After alignment:

    [27, 27]
    [ 1, 27]

[1,27] is a row vector that will get copied 27 times down to match dimension.

Problem: P.sum(0) is matrix where along columns we have sum for each row

We want to divide a row element by sum of that row
Above broadcasting creates a divide matrix that has row sums along columns and not rows
So we're dividing first entry of each row by sum of first row, second column by sum of second row, and so on
i.e. we are normalizing the columns instead of rows
This is not our desired behavior

Lesson: Have respect for broadcasting, check your work, understand how it works under the hood, and make sure broadcasting is working in the direction that you want, otherwise you'll introduce very subtle and hard to detect bugs.

Using probability matrix for sampling

g = torch.Generator().manual_seed(2147483647) 

for i in range(10):
    out = []
    ix = 0

    while True:
        p = P[ix]
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))

We get exact same results as above without having to normalize at every iteration.

Note:

P = P / P.sum() creates a new tensor P
P /= P.sum() operates inplace

12. Model Summary

So, now we have trained a bigram model by counting frequency of pairs and then normalizing counts to get probability distribution.

Elements of P are really the parameters of our bigram model, summarizing statistics of bigrams.

13. Evaluating Quality of Model

Using Negative Log Likelihood

Now we need to summarize quality of this trained model into a single number. i.e. how good model is in predicting the training set.

One example is training loss which tells us how model did in training against dataset.

for word in words[:3]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        print(f'{ch1}{ch2}: {prob:.4f}')

Interpretation:

These are the probs that model assigned to each bigram in dataset
If every bigram was equally likely, then these probs would have been 1/27 ≈ 0.0370, roughly 4%
If any prob is above 4% means we have learnt something useful from bigram statistic
Model has assigned pretty good probs for what's in training set (some are 4%, some are 17%, 35%, 40%)
If you had a very good model, these probs for each bigram in train set would be near 1

Maximum Likelihood Estimation

In literature, we use Maximum Likelihood Estimation.

Likelihood = product of above probs

Likelihood tells us probability of entire dataset assigned by the trained model.

Product of these probs should be as high as possible to have a good model.

For convenience we use log of probs:

log(1) = 0
As we go below 1, log falls to negative values till log(0) = -inf
log(a*b*c) = log(a) + log(b) + log(c)

If all probs are near 1:

log likelihood → 0
If probs are near 0, log likelihood → more negative

We want a loss function which we can minimize, so we invert log likelihood:

When probs → 1: log likelihood → 0
When probs → 1: -log likelihood → 0 (what we want for loss)

Computing Negative Log Likelihood

log_likelihood = 0.0
for word in words[:3]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        print(f'{ch1}{ch2}: {prob:.4f}, {log_prob}')

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')

Why NLL is a good loss function:

It's always ≥ 0
When probs are near 1, it's near to 0
When probs are away from 1, it increases away from 0
Higher the NLL, worse the predictions are

For convenience, we use average negative log likelihood.

Average Negative Log Likelihood

# Average log likelihood
log_likelihood = 0.0
n = 0

for word in words[:3]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n += 1
        print(f'{ch1}{ch2}: {prob:.4f}, {log_prob}')

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')

Thus we use average negative log likelihood as our loss function.

Our aim is to minimize this loss to get high quality model.

Optimization Goal

GOAL: Maximize likelihood of the data wrt model parameters (statistical modelling)

(Later these parameters (counts here) will be calculated by a NN and we want to tune these parameters to maximize likelihood of training data)

Equivalences:

Maximize likelihood
≡ Maximize log likelihood (because log is a monotonic function)
≡ Minimize negative log likelihood
≡ Minimize average negative log likelihood

Loss on Entire Training Set

# Average log likelihood for entire training set
log_likelihood = 0.0
n = 0

for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')

Testing on New Data

# Test on a name not in dataset
log_likelihood = 0.0
n = 0

for word in ["andrejq"]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')

14. Laplace Smoothing

Problem: If any count is 0:

p((ai, aj)) = 0
log(p(ai, aj)) = -inf
-log(p(ai, aj)) = inf
This means entire sequence can have infinite loss

Solution: Add Laplace smoothing - Add 1 to every count so no count is 0.

P = (N+1).float()  # Add one here for smoothing
P /= P.sum(dim=1, keepdim=True)

g = torch.Generator().manual_seed(2147483647) 

for i in range(10):
    out = []
    ix = 0

    while True:
        p = P[ix]
        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))

Test with smoothing:

# Average log likelihood for entire training set
log_likelihood = 0.0
n = 0

for word in ['andndrejq']:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        prob = P[ix1, ix2]
        log_prob = torch.log(prob)
        log_likelihood += log_prob
        n += 1

print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')

15. Neural Network Approach

So we trained a bigram character level model by means of counting, normalizing counts to get probability dist and sampling from that dist to generate words, evaluated model using negative log likelihood.

Now we frame character level bigram model in NN framework:

Inputs one char
Outputs prob dist for next character
We have bigrams as training set, so we know next character given first
We can evaluate model based on this
NN outputs prob dist over next char, we have targets and a loss function: neg likelihood
Model should assign high prob to next char i.e. loss should be low

Creating Training Set

Let's first create training set of all bigrams (x, y):

(x, y)
x: input (int)
y: target (int)

Given x, predict y

# create training set of bigrams (x, y)

xs, ys = [], []

for word in words[:1]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

xs, ys

16. One-Hot Encoding

Note: xs and ys are index of characters. Indexes are integers.

Problem with integers as input:

It's not recommended to input an integer to NN
We multiply them with weights which are floats, so they should be float
Integers imply numerical relationship between indexes
If 'a' index is 1 and 'b' index is 2, numerically 'b' is greater than 'a'
Character 'm' (index 13) is not "halfway" between 'a' (index 1) and 'y' (index 25)
All characters should be treated as equally distinct from each other
Characters are categorical data, not continuous

Solution: Common way of encoding integers is one-hot encoding.

import torch.nn.functional as F

xenc = F.one_hot(xs, num_classes=27)
xenc.shape  # torch.Size([5, 27])

plt.imshow(xenc)

Interpretation:

5 examples (inputs) encoded as 5 row vectors
We will feed each such example to NN
We want input to be floats that are able to take various values (ints can't)

xenc.dtype  # torch.int64

# Cast one hot vectors to float dtype
xenc = F.one_hot(xs, num_classes=27).float()
print(xenc.shape, xenc.dtype)
plt.imshow(xenc)

17. Neural Network Forward Pass

Single Neuron

# initialize Weights for a single neuron that will input above vectors 
W = torch.randn((27, 1))
xenc @ W  # Output: (5, 1)

@ is matrix multiplication operator in pytorch.

Here we fed all 5 inputs to this neuron and it produced its activations (5, 1).

27 Neurons (Full Layer)

# neurons stacked as columns
W = torch.randn((27, 27))
xenc @ W  # Output: (5, 27)

We can efficiently calculate activations by passing inputs stacked as rows as a batch and multiplying them with weights of neurons stacked as columns.

Network Architecture

Our NN for now will be:

27 dimensional input
27 neurons in first linear layer which outputs prob of next char

We will treat 27 numbers that output as log of counts (not integer counts because NN should not output an int):

Log of counts are also called logits

From Logits to Probabilities

So how we interpret 27 numbers: they are log counts.

Exponentiate log counts and you get counts.

Exponential function:

Negative numbers → output below 1 but greater than 0: (0, 1]
Positive numbers → >1 up to +inf: (1, +inf]

So exp are good candidates for counts: never below 0 and can take on various values, depending on settings of W.

(xenc @ W).exp()

These numbers can be interpreted as equivalent of counts.

Log of counts are also called logits.

Complete Transformation

logits = xenc @ W  # log-counts
counts = logits.exp()  # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True)

probs.shape  # (5, 27)
probs[1].sum()  # Should be 1.0

So for every one of 5 examples we have a row that came out of a NN, and with above transformations, we made sure that outputs can be interpreted as probabilities.

All above operations are differentiable that can be backpropagated.

Process:

Fed . by getting its index
One hot encoded it
Fed to NN
Output prob dist after transformations

These probs are NN's assignment of prob for next character.

We now want to tune W such that good probs are output.

18. Training the Neural Network

Setup

# create training set 
xs, ys = [], []

for word in words[:1]:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)

# Randomly initialize 27 neurons' weights, each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g)

xenc = F.one_hot(xs, num_classes=27).float()  # input to network one hot encoded
logits = xenc @ W  # predict log-counts
counts = logits.exp()  # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character

# Last two lines here are together called a 'softmax'

Softmax

Softmax: Takes logits, exponentiates them and normalizes them.

It takes outputs of NN which can be positive or negative and outputs probability distribution i.e. something that sums to one and contains only positive numbers, just like probabilities.

Computing Loss

nlls = torch.zeros(5)
for i in range(5):
    # i-th bigram:
    x = xs[i].item()  # input character index
    y = ys[i].item()  # label character index
    print('--------')
    print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indexes {x},{y})')
    print('input to the neural net:', x)
    print('output probabilities from the neural net:', probs[i])
    print('label (actual next character):', y)
    p = probs[i, y]
    print('probability assigned by the net to the correct character:', p.item())
    logp = torch.log(p)
    print('log likelihood:', logp.item())
    nll = -logp
    print('negative log likelihood:', nll.item())
    nlls[i] = nll

print('=========')
print('average negative log likelihood, i.e. loss =', nlls.mean().item())

This is not a very good setting of W, as our loss (average negative log likelihood) is much higher than 0.

This loss is made up of differentiable functions, so we can minimize the loss by tuning W parameters.

19. Gradient-Based Optimization

Efficient Loss Computation

# Probs to calculate loss
probs[0, 5], probs[1, 13], probs[2, 13], probs[3, 1], probs[4, 0]

# Better way to index
probs[torch.arange(5), ys]

These are probs that NN assigns to correct next character.

# Loss
loss = -probs[torch.arange(5), ys].log().mean()

Training Loop Setup

# Randomly initialize 27 neurons' weights, each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)  # By default requires_grad is False

Forward and Backward Pass

# Forward pass
xenc = F.one_hot(xs, num_classes=27).float()  # input to network one hot encoded
logits = xenc @ W  # predict log-counts
counts = logits.exp()  # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character
loss = -probs[torch.arange(5), ys].log().mean()

# Backward pass
# Make sure all gradients are reset to 0
# Setting grads to None is efficient and interpreted by pytorch as lack of gradient and is same as 0s
W.grad = None 
loss.backward()

# Update
W.data += -0.1 * W.grad

Note: Having low loss means network is assigning high probs to correct targets.

20. Full Training on Dataset

Create Full Dataset

# create the dataset 
xs, ys = [], []

for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        xs.append(ix1)
        ys.append(ix2)

xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('Number of examples:', num)

# Initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)

Gradient Descent

# gradient descent
for k in range(100):
    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float()  # input to network one hot encoded
    logits = xenc @ W  # predict log-counts
    counts = logits.exp()  # counts equivalent to N
    probs = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character
    loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
    print(loss.item())

    # backward pass
    W.grad = None 
    loss.backward()

    # update
    W.data += -50 * W.grad

Results:

Our loss at starting when we did with counting was around 2.47, roughly 2.45 before smoothing
So here we have achieved the same performance with gradient based optimization

Comparison:

Counting was straightforward and fast for this problem, we were able to maintain probs in a table
But NN is flexible approach

Future improvements:

What we can do now is to complexify the NN by feeding multiple previous characters into increasingly complex neural nets
But output of NN will always just be logits, which will go through exact same transformation as above
The only thing that will change is how we do forward pass, everything else remains same

21. Neural Network vs Counting Approach

Scalability

If we are taking multiple previous characters, then it's not possible to keep counts table for every combination, this is unscalable approach.

NN approach on other hand is scalable and we can improve on over time.

Mathematical Equivalence

logits = xenc @ W

Key insight: Multiplying one hot vector of say 5th character, with W, plucks out 5th row of W, because of how matrix multiplication works.

So logits just become 5th row of W.

In counting approach:

We had first char say 5th one
Then we would go to 5th row of N which then gave us prob dist for next char
So first char was used as a lookup into matrix N

Similar thing is happening in NN:

We take index, encode one hot and multiply by W
So logits become appropriate row of W
Which are then exponentiated into counts and normalized into probability

Conclusion: W.exp() at end of optimization is same as N array of counts.

N was filled by counting
W was initialized randomly and loss guided us to arrive at same array as N

22. Regularization

Smoothing Equivalence

In smoothing, if we add 10000 to every count, where max count was around 900, then every count will approximately look the same and upon normalization we will have nearly same prob for each character, i.e. we would get a uniform distribution.

Same thing can happen in NN approach:

W initialized to all 0s
logits become all 0s
counts = logits.exp() become all 1s
probs = count/count.sum(1, keepdim=True) become all uniform

So incentivizing W to be near 0 is equivalent to label smoothing.

More you incentivize this in loss function, more smooth dist you achieve.

Regularization Loss

This is called regularization where we can add small component to loss called regularization loss.

This is done by adding something like (W**2).mean() to loss function.

You achieve 0 loss if W is exactly a 0 matrix
But if W has non-zero numbers then you accumulate loss
You can choose regularization parameter which decides how much regularization affects the loss
This component tries to make all w's be 0 in optimization

So:

W wants to be 0
Probs want to be uniform
But also match up your training data

Regularization parameter is similar to addition of count in Laplace smoothing.

23. Sampling from Neural Network

# Finally sample from the model
g = torch.Generator().manual_seed(2147483647) 

for i in range(5):
    out = []
    ix = 0

    while True:
        xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float() 
        logits = xenc @ W  # predict log-counts
        counts = logits.exp()  # counts equivalent to N
        p = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character

        ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
        out.append(itos[ix])

        if ix == 0:
            break

    print(''.join(out))

Result: Thus we got same samples as bigram counting models.

So these are fundamentally same models but we came at it in different way and they have different interpretations.

24. Extension to Trigrams (Work in Progress)

Creating Trigram Dataset

# create the dataset 
xs, ys = [], []

for word in words:
    chs = ['.'] + list(word) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        ix1 = stoi[ch1]
        ix2 = stoi[ch2]
        ix3 = stoi[ch3]
        xs.append((ix1, ix2))
        ys.append(ix3)

xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('Number of examples:', num)
print('shape of xs:', xs.shape)

# Initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)

One-Hot Encoding for Pairs

xs[1], F.one_hot(xs[1])

Trigram Training

# gradient descent
for k in range(100):
    # forward pass
    xenc = F.one_hot(xs, num_classes=27).float()  # input to network one hot encoded
    logits = xenc @ W  # predict log-counts
    counts = logits.exp()  # counts equivalent to N
    probs = counts / counts.sum(dim=1, keepdim=True)  # Probabilities for next character
    loss = -probs[torch.arange(xs.shape[0]), ys].log().mean() + 0.01*(W**2).mean()
    print(loss.item())

    # backward pass
    W.grad = None 
    loss.backward()

    # update
    W.data += -50 * W.grad

Summary

This notebook covered:

Bigram Language Model: Building character-level model using counting
Probability Distributions: Converting counts to probabilities
Sampling: Generating new names from the model
Evaluation: Using negative log likelihood as loss function
Neural Network Approach: Reimplementing bigrams using gradient descent
One-Hot Encoding: Proper way to feed categorical data to neural networks
Softmax: Converting logits to probabilities
Regularization: Smoothing probabilities and preventing overfitting
Equivalence: Understanding how counting and NN approaches are fundamentally the same

The key insight is that while counting is straightforward for bigrams, the neural network approach is more scalable and can be extended to handle longer context (trigrams, n-grams, etc.) and more complex architectures.