Introduction
Character level language model that predicts the next character given previous characters.
Example: For 'isabella':
- i likely comes first
- s after i
- a after is
- b after isa, and so on
Representation: <START>isabella<END>
1. Loading the Dataset
with open('names.txt', 'r') as file:
words = file.read().splitlines()
Check dataset statistics:
words[:10]
min(len(word) for word in words)
max(len(word) for word in words)
2. Bigram Language Model
Bigram language model: Working with two characters at a time. Given a char, predict next character.
Bigrams: Two characters in a row
Example with a single word
word = words[0]
zips = zip(word, word[1:])
print(word, word[1:])
print(*zips)
Adding special tokens
word, ['<S>'] + list(word) + ['<E>']
Extracting bigrams from multiple words
# Two consecutive characters
for word in words[:3]:
chs = ['<S>'] + list(word) + ['<E>']
for c1, c2 in zip(chs, chs[1:]):
print(c1, c2)
Note: zip halts if any list is shorter than the other.
3. Counting Bigrams
Simple way to learn bigram model is to count number of times bigrams occur in training set.
Count bigrams for first 3 words
b = {}
for word in words[:3]:
chs = ['<S>'] + list(word) + ['<E>']
for c1, c2 in zip(chs, chs[1:]):
bigram = (c1, c2)
b[bigram] = b.get(bigram, 0) + 1
Count bigrams for all words
# Now lets do this for all the words
b = {}
for word in words:
chs = ['<S>'] + list(word) + ['<E>']
for c1, c2 in zip(chs, chs[1:]):
bigram = (c1, c2)
b[bigram] = b.get(bigram, 0) + 1
# Get (bigram, counts) tuples
items = b.items()
Sort by counts
# sort by count
# sort by default sorts wrt first element of object, here its bigram
sorted_by_counts_asc = sorted(items, key= lambda kv: kv[1])
sorted_by_counts_desc = sorted(items, key= lambda kv: -kv[1])
4. 2D Count Array with PyTorch
Goal: Put counts in a 2D array where:
- Rows are first char
- Columns are second char of bigram
- Each entry is number of counts that they appear
import torch
# 26 letters of alphabet and 2 special tokens <S> and <E>
# so we need (28, 28) array for above purpose
# Count array
N = torch.zeros((28, 28), dtype=torch.int32)
Creating character lookup tables
We need some lookup table from characters to integers so that we can index into tensor.
# Set of all lowercase characters
# This joins all dataset into one big string and set() removes all duplicate characters from that string
# This way we have set of unique characters in dataset.
chars = sorted(list(set(''.join(words))))
# Lookup table
stoi = {s: i for i, s in enumerate(chars)}
stoi['<S>'] = 26
stoi['<E>'] = 27
Populate the count array
# Map both chars to their integers
for word in words:
chs = ['<S>'] + list(word) + ['<E>']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
N[ix1, ix2] += 1
5. Visualizing Bigram Counts
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(N)
Detailed visualization with labels
itos = {i: s for s, i in stoi.items()}
plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(28):
for j in range(28):
chstr = itos[i] + itos[j]
plt.text(j, i, chstr, ha='center', va='bottom', color='grey')
plt.text(j, i, N[i, j].item(), ha='center', va='top', color='grey')
plt.axis('off');
Observations:
- Last row is entirely zero because
<E>will never come first in bigram - One column is entirely zero because
<S>will never come at end in bigram - Only possible combination is
<S><E>i.e. a word with no letters
6. Using Special Token '.'
Solution: Change special token to . both for starting and ending.
N = torch.zeros((27, 27), dtype=torch.int32)
stoi = {s: i+1 for i, s in enumerate(chars)}
stoi['.'] = 0
itos = {i: s for s, i in stoi.items()}
# Map both chars to their integers
for word in words:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
N[ix1, ix2] += 1
plt.figure(figsize=(16,16))
plt.imshow(N, cmap='Blues')
for i in range(27):
for j in range(27):
chstr = itos[i] + itos[j]
plt.text(j, i, chstr, ha='center', va='bottom', color='grey')
plt.text(j, i, N[i, j].item(), ha='center', va='top', color='grey')
plt.axis('off');
Observations:
- First row shows counts of chars that start the word
- First column shows count of chars that end the word
7. Converting Counts to Probabilities
# first row
N[0]
# Convert these counts to probabilities
p = N[0].float()
p /= p.sum()
Interpretation: This gives probability that given char is first letter of the word.
8. Sampling from Probability Distribution
Understanding torch.multinomial
# Deterministic way of creating a torch generator object
h = torch.Generator().manual_seed(2147483647)
# We use generator object as source of randomness in following function
p = torch.rand(3, generator=h) # Gives 3 random numbers between 0 and 1
p = p / p.sum()
First index will be sampled according to its probability and so on.
# Use torch multinomial to draw samples from above p
# Samples indexes according to p
torch.multinomial(p, num_samples=100, replacement=True, generator=h)
Sampling first character
Now we sample from our first row same as done here.
# Convert these counts to probabilities
p = N[0].float()
p = p / p.sum()
print(p)
g = torch.Generator().manual_seed(2147483647)
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
ix, itos[ix]
Note: This is different than one in video, probably due to change in library itself.
Sampling next character
Now that our first sampled char is 'c', we go to row that starts with 'c'.
p = N[3].float()
p /= p.sum()
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
ix, itos[ix]
9. Generating Words with Loop
Algorithm:
- Set
ix = 0: Sample from 0th row (starting characters) - Then in loop:
- Sample the first char from
ix=0 - ix = sampled char
- Sample the first char from
g = torch.Generator().manual_seed(2147483647)
for i in range(10):
out = []
ix = 0
while True:
p = N[ix].float()
p /= p.sum()
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
Result: As you can see, bigrams are terrible and we should do better. But bigrams are still better than untrained model.
10. Comparison with Uniform Sampling
Following model samples uniformly from 27 characters:
g = torch.Generator().manual_seed(2147483647)
for i in range(10):
out = []
ix = 0
while True:
# p = N[ix].float()
# p /= p.sum()
p = torch.ones(27)
p /= 27
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
This is garbage. So bigrams are one step better than this but still are terrible.
11. Broadcasting and Efficient Normalization
Don't normalize every row every time, instead compute probabilities at once. So that every row contains prob distribution over 27 words, given previous word.
Understanding torch.sum with dimensions
P.sum(input, dim, keepdim=True):
- When given dim, sum is performed across that dim
-
dim=0(rows in[27, 27]): sum is performed across rows i.e. along columns- Vertical sum resulting in
[1, 27]row vector
- Vertical sum resulting in
-
dim=1(columns in[27, 27]): sum is performed across columns i.e. along rows- Horizontal sum resulting in
[27, 1]column vector
- Horizontal sum resulting in
-
keepdim=False: result is(27,)1D vector -
keepdim=True: result is[1, 27]or[27, 1]2D matrix
P = N.float()
P.sum(dim=1, keepdim=True).shape # [27, 1]
P = N.float()
P /= P.sum(dim=1, keepdim=True)
P.shape # [27, 27]
P[0].sum() # Should be 1.0
Broadcasting Rules
Broadcasting has rules in pytorch. Visit doc.
Rule 1: Align all dimensions from the right:
[27, 27]
[27]
→
[27, 27]
[ 27]
Rule 2: Iterate over all dimensions starting from right to left. Each dimension must be: equal, one of them is 1, or one of them does not exist.
Example 1: Broadcasting with [27, 1]
[27, 27]
[27, 1]
Broadcasting copies [27, 1] vector 27 times to make it [27, 27] where last dimension is now matched. Then it does element-wise division.
Example 2: Why keepdim=False causes issues
P.sum(dim=1, keepdim=False) gives sum across columns and output shape would be [27] vector, not a matrix.
Consider P / P.sum(0):
[27, 27]
[27]
After alignment:
[27, 27]
[ 1, 27]
[1,27] is a row vector that will get copied 27 times down to match dimension.
Problem: P.sum(0) is matrix where along columns we have sum for each row
- We want to divide a row element by sum of that row
- Above broadcasting creates a divide matrix that has row sums along columns and not rows
- So we're dividing first entry of each row by sum of first row, second column by sum of second row, and so on
- i.e. we are normalizing the columns instead of rows
- This is not our desired behavior
Lesson: Have respect for broadcasting, check your work, understand how it works under the hood, and make sure broadcasting is working in the direction that you want, otherwise you'll introduce very subtle and hard to detect bugs.
Using probability matrix for sampling
g = torch.Generator().manual_seed(2147483647)
for i in range(10):
out = []
ix = 0
while True:
p = P[ix]
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
We get exact same results as above without having to normalize at every iteration.
Note:
-
P = P / P.sum()creates a new tensorP -
P /= P.sum()operates inplace
12. Model Summary
So, now we have trained a bigram model by counting frequency of pairs and then normalizing counts to get probability distribution.
Elements of P are really the parameters of our bigram model, summarizing statistics of bigrams.
13. Evaluating Quality of Model
Using Negative Log Likelihood
Now we need to summarize quality of this trained model into a single number. i.e. how good model is in predicting the training set.
One example is training loss which tells us how model did in training against dataset.
for word in words[:3]:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
print(f'{ch1}{ch2}: {prob:.4f}')
Interpretation:
- These are the probs that model assigned to each bigram in dataset
- If every bigram was equally likely, then these probs would have been
1/27 ≈ 0.0370, roughly 4% - If any prob is above 4% means we have learnt something useful from bigram statistic
- Model has assigned pretty good probs for what's in training set (some are 4%, some are 17%, 35%, 40%)
- If you had a very good model, these probs for each bigram in train set would be near 1
Maximum Likelihood Estimation
In literature, we use Maximum Likelihood Estimation.
Likelihood = product of above probs
Likelihood tells us probability of entire dataset assigned by the trained model.
Product of these probs should be as high as possible to have a good model.
For convenience we use log of probs:
log(1) = 0- As we go below 1, log falls to negative values till
log(0) = -inf log(a*b*c) = log(a) + log(b) + log(c)
If all probs are near 1:
- log likelihood → 0
- If probs are near 0, log likelihood → more negative
We want a loss function which we can minimize, so we invert log likelihood:
- When probs → 1: log likelihood → 0
- When probs → 1: -log likelihood → 0 (what we want for loss)
Computing Negative Log Likelihood
log_likelihood = 0.0
for word in words[:3]:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
log_prob = torch.log(prob)
log_likelihood += log_prob
print(f'{ch1}{ch2}: {prob:.4f}, {log_prob}')
print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll=}')
Why NLL is a good loss function:
- It's always ≥ 0
- When probs are near 1, it's near to 0
- When probs are away from 1, it increases away from 0
- Higher the NLL, worse the predictions are
For convenience, we use average negative log likelihood.
Average Negative Log Likelihood
# Average log likelihood
log_likelihood = 0.0
n = 0
for word in words[:3]:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
log_prob = torch.log(prob)
log_likelihood += log_prob
n += 1
print(f'{ch1}{ch2}: {prob:.4f}, {log_prob}')
print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')
Thus we use average negative log likelihood as our loss function.
Our aim is to minimize this loss to get high quality model.
Optimization Goal
GOAL: Maximize likelihood of the data wrt model parameters (statistical modelling)
(Later these parameters (counts here) will be calculated by a NN and we want to tune these parameters to maximize likelihood of training data)
Equivalences:
- Maximize likelihood
- ≡ Maximize log likelihood (because log is a monotonic function)
- ≡ Minimize negative log likelihood
- ≡ Minimize average negative log likelihood
Loss on Entire Training Set
# Average log likelihood for entire training set
log_likelihood = 0.0
n = 0
for word in words:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
log_prob = torch.log(prob)
log_likelihood += log_prob
n += 1
print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')
Testing on New Data
# Test on a name not in dataset
log_likelihood = 0.0
n = 0
for word in ["andrejq"]:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
log_prob = torch.log(prob)
log_likelihood += log_prob
n += 1
print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')
14. Laplace Smoothing
Problem: If any count is 0:
p((ai, aj)) = 0log(p(ai, aj)) = -inf-log(p(ai, aj)) = inf- This means entire sequence can have infinite loss
Solution: Add Laplace smoothing - Add 1 to every count so no count is 0.
P = (N+1).float() # Add one here for smoothing
P /= P.sum(dim=1, keepdim=True)
g = torch.Generator().manual_seed(2147483647)
for i in range(10):
out = []
ix = 0
while True:
p = P[ix]
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
Test with smoothing:
# Average log likelihood for entire training set
log_likelihood = 0.0
n = 0
for word in ['andndrejq']:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
prob = P[ix1, ix2]
log_prob = torch.log(prob)
log_likelihood += log_prob
n += 1
print(f'{log_likelihood=}')
nll = -log_likelihood
print(f'{nll/n=}')
15. Neural Network Approach
So we trained a bigram character level model by means of counting, normalizing counts to get probability dist and sampling from that dist to generate words, evaluated model using negative log likelihood.
Now we frame character level bigram model in NN framework:
- Inputs one char
- Outputs prob dist for next character
- We have bigrams as training set, so we know next character given first
- We can evaluate model based on this
- NN outputs prob dist over next char, we have targets and a loss function: neg likelihood
- Model should assign high prob to next char i.e. loss should be low
Creating Training Set
Let's first create training set of all bigrams (x, y):
(x, y)
x: input (int)
y: target (int)
Given x, predict y
# create training set of bigrams (x, y)
xs, ys = [], []
for word in words[:1]:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
xs.append(ix1)
ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
xs, ys
16. One-Hot Encoding
Note: xs and ys are index of characters. Indexes are integers.
Problem with integers as input:
- It's not recommended to input an integer to NN
- We multiply them with weights which are floats, so they should be float
- Integers imply numerical relationship between indexes
- If 'a' index is 1 and 'b' index is 2, numerically 'b' is greater than 'a'
- Character 'm' (index 13) is not "halfway" between 'a' (index 1) and 'y' (index 25)
- All characters should be treated as equally distinct from each other
- Characters are categorical data, not continuous
Solution: Common way of encoding integers is one-hot encoding.
import torch.nn.functional as F
xenc = F.one_hot(xs, num_classes=27)
xenc.shape # torch.Size([5, 27])
plt.imshow(xenc)
Interpretation:
- 5 examples (inputs) encoded as 5 row vectors
- We will feed each such example to NN
- We want input to be floats that are able to take various values (ints can't)
xenc.dtype # torch.int64
# Cast one hot vectors to float dtype
xenc = F.one_hot(xs, num_classes=27).float()
print(xenc.shape, xenc.dtype)
plt.imshow(xenc)
17. Neural Network Forward Pass
Single Neuron
# initialize Weights for a single neuron that will input above vectors
W = torch.randn((27, 1))
xenc @ W # Output: (5, 1)
@ is matrix multiplication operator in pytorch.
Here we fed all 5 inputs to this neuron and it produced its activations (5, 1).
27 Neurons (Full Layer)
# neurons stacked as columns
W = torch.randn((27, 27))
xenc @ W # Output: (5, 27)
We can efficiently calculate activations by passing inputs stacked as rows as a batch and multiplying them with weights of neurons stacked as columns.
Network Architecture
Our NN for now will be:
- 27 dimensional input
- 27 neurons in first linear layer which outputs prob of next char
We will treat 27 numbers that output as log of counts (not integer counts because NN should not output an int):
- Log of counts are also called logits
From Logits to Probabilities
So how we interpret 27 numbers: they are log counts.
Exponentiate log counts and you get counts.
Exponential function:
- Negative numbers → output below 1 but greater than 0:
(0, 1] - Positive numbers → >1 up to +inf:
(1, +inf]
So exp are good candidates for counts: never below 0 and can take on various values, depending on settings of W.
(xenc @ W).exp()
These numbers can be interpreted as equivalent of counts.
Log of counts are also called logits.
Complete Transformation
logits = xenc @ W # log-counts
counts = logits.exp() # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True)
probs.shape # (5, 27)
probs[1].sum() # Should be 1.0
So for every one of 5 examples we have a row that came out of a NN, and with above transformations, we made sure that outputs can be interpreted as probabilities.
All above operations are differentiable that can be backpropagated.
Process:
- Fed
.by getting its index - One hot encoded it
- Fed to NN
- Output prob dist after transformations
These probs are NN's assignment of prob for next character.
We now want to tune W such that good probs are output.
18. Training the Neural Network
Setup
# create training set
xs, ys = [], []
for word in words[:1]:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
xs.append(ix1)
ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
# Randomly initialize 27 neurons' weights, each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g)
xenc = F.one_hot(xs, num_classes=27).float() # input to network one hot encoded
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True) # Probabilities for next character
# Last two lines here are together called a 'softmax'
Softmax
Softmax: Takes logits, exponentiates them and normalizes them.
It takes outputs of NN which can be positive or negative and outputs probability distribution i.e. something that sums to one and contains only positive numbers, just like probabilities.
Computing Loss
nlls = torch.zeros(5)
for i in range(5):
# i-th bigram:
x = xs[i].item() # input character index
y = ys[i].item() # label character index
print('--------')
print(f'bigram example {i+1}: {itos[x]}{itos[y]} (indexes {x},{y})')
print('input to the neural net:', x)
print('output probabilities from the neural net:', probs[i])
print('label (actual next character):', y)
p = probs[i, y]
print('probability assigned by the net to the correct character:', p.item())
logp = torch.log(p)
print('log likelihood:', logp.item())
nll = -logp
print('negative log likelihood:', nll.item())
nlls[i] = nll
print('=========')
print('average negative log likelihood, i.e. loss =', nlls.mean().item())
This is not a very good setting of W, as our loss (average negative log likelihood) is much higher than 0.
This loss is made up of differentiable functions, so we can minimize the loss by tuning W parameters.
19. Gradient-Based Optimization
Efficient Loss Computation
# Probs to calculate loss
probs[0, 5], probs[1, 13], probs[2, 13], probs[3, 1], probs[4, 0]
# Better way to index
probs[torch.arange(5), ys]
These are probs that NN assigns to correct next character.
# Loss
loss = -probs[torch.arange(5), ys].log().mean()
Training Loop Setup
# Randomly initialize 27 neurons' weights, each neuron receives 27 inputs
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True) # By default requires_grad is False
Forward and Backward Pass
# Forward pass
xenc = F.one_hot(xs, num_classes=27).float() # input to network one hot encoded
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True) # Probabilities for next character
loss = -probs[torch.arange(5), ys].log().mean()
# Backward pass
# Make sure all gradients are reset to 0
# Setting grads to None is efficient and interpreted by pytorch as lack of gradient and is same as 0s
W.grad = None
loss.backward()
# Update
W.data += -0.1 * W.grad
Note: Having low loss means network is assigning high probs to correct targets.
20. Full Training on Dataset
Create Full Dataset
# create the dataset
xs, ys = [], []
for word in words:
chs = ['.'] + list(word) + ['.']
for ch1, ch2 in zip(chs, chs[1:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
xs.append(ix1)
ys.append(ix2)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('Number of examples:', num)
# Initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
Gradient Descent
# gradient descent
for k in range(100):
# forward pass
xenc = F.one_hot(xs, num_classes=27).float() # input to network one hot encoded
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True) # Probabilities for next character
loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()
print(loss.item())
# backward pass
W.grad = None
loss.backward()
# update
W.data += -50 * W.grad
Results:
- Our loss at starting when we did with counting was around 2.47, roughly 2.45 before smoothing
- So here we have achieved the same performance with gradient based optimization
Comparison:
- Counting was straightforward and fast for this problem, we were able to maintain probs in a table
- But NN is flexible approach
Future improvements:
- What we can do now is to complexify the NN by feeding multiple previous characters into increasingly complex neural nets
- But output of NN will always just be logits, which will go through exact same transformation as above
- The only thing that will change is how we do forward pass, everything else remains same
21. Neural Network vs Counting Approach
Scalability
If we are taking multiple previous characters, then it's not possible to keep counts table for every combination, this is unscalable approach.
NN approach on other hand is scalable and we can improve on over time.
Mathematical Equivalence
logits = xenc @ W
Key insight: Multiplying one hot vector of say 5th character, with W, plucks out 5th row of W, because of how matrix multiplication works.
So logits just become 5th row of W.
In counting approach:
- We had first char say 5th one
- Then we would go to 5th row of
Nwhich then gave us prob dist for next char - So first char was used as a lookup into matrix
N
Similar thing is happening in NN:
- We take index, encode one hot and multiply by W
- So logits become appropriate row of W
- Which are then exponentiated into counts and normalized into probability
Conclusion: W.exp() at end of optimization is same as N array of counts.
-
Nwas filled by counting -
Wwas initialized randomly and loss guided us to arrive at same array asN
22. Regularization
Smoothing Equivalence
In smoothing, if we add 10000 to every count, where max count was around 900, then every count will approximately look the same and upon normalization we will have nearly same prob for each character, i.e. we would get a uniform distribution.
Same thing can happen in NN approach:
-
Winitialized to all 0s -
logitsbecome all 0s -
counts = logits.exp()become all 1s -
probs = count/count.sum(1, keepdim=True)become all uniform
So incentivizing W to be near 0 is equivalent to label smoothing.
More you incentivize this in loss function, more smooth dist you achieve.
Regularization Loss
This is called regularization where we can add small component to loss called regularization loss.
This is done by adding something like (W**2).mean() to loss function.
- You achieve 0 loss if
Wis exactly a 0 matrix - But if
Whas non-zero numbers then you accumulate loss - You can choose regularization parameter which decides how much regularization affects the loss
- This component tries to make all w's be 0 in optimization
So:
- W wants to be 0
- Probs want to be uniform
- But also match up your training data
Regularization parameter is similar to addition of count in Laplace smoothing.
23. Sampling from Neural Network
# Finally sample from the model
g = torch.Generator().manual_seed(2147483647)
for i in range(5):
out = []
ix = 0
while True:
xenc = F.one_hot(torch.tensor([ix]), num_classes=27).float()
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts equivalent to N
p = counts / counts.sum(dim=1, keepdim=True) # Probabilities for next character
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
out.append(itos[ix])
if ix == 0:
break
print(''.join(out))
Result: Thus we got same samples as bigram counting models.
So these are fundamentally same models but we came at it in different way and they have different interpretations.
24. Extension to Trigrams (Work in Progress)
Creating Trigram Dataset
# create the dataset
xs, ys = [], []
for word in words:
chs = ['.'] + list(word) + ['.']
for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
ix1 = stoi[ch1]
ix2 = stoi[ch2]
ix3 = stoi[ch3]
xs.append((ix1, ix2))
ys.append(ix3)
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('Number of examples:', num)
print('shape of xs:', xs.shape)
# Initialize the network
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((27, 27), generator=g, requires_grad=True)
One-Hot Encoding for Pairs
xs[1], F.one_hot(xs[1])
Trigram Training
# gradient descent
for k in range(100):
# forward pass
xenc = F.one_hot(xs, num_classes=27).float() # input to network one hot encoded
logits = xenc @ W # predict log-counts
counts = logits.exp() # counts equivalent to N
probs = counts / counts.sum(dim=1, keepdim=True) # Probabilities for next character
loss = -probs[torch.arange(xs.shape[0]), ys].log().mean() + 0.01*(W**2).mean()
print(loss.item())
# backward pass
W.grad = None
loss.backward()
# update
W.data += -50 * W.grad
Summary
This notebook covered:
- Bigram Language Model: Building character-level model using counting
- Probability Distributions: Converting counts to probabilities
- Sampling: Generating new names from the model
- Evaluation: Using negative log likelihood as loss function
- Neural Network Approach: Reimplementing bigrams using gradient descent
- One-Hot Encoding: Proper way to feed categorical data to neural networks
- Softmax: Converting logits to probabilities
- Regularization: Smoothing probabilities and preventing overfitting
- Equivalence: Understanding how counting and NN approaches are fundamentally the same
The key insight is that while counting is straightforward for bigrams, the neural network approach is more scalable and can be extended to handle longer context (trigrams, n-grams, etc.) and more complex architectures.
Top comments (0)