Hưng Lê Tiến

Posted on Nov 30, 2025

BIG STEPS TO TRANSFORMER (PART 1): BUILDING THE BIGRAM

#deeplearning #tutorial #architecture #beginners

I'm back! It took a while for me to grasp all of the basics about Transformer, and we're now ready to jump right into that big boy. The days of playing with casual neural net stuff are gone, we shall behold the most modern architecture in this world, and let's break it down step by step.

This blogpost is, again, based from the series about Neural Nets from Andrej Karpathy. Big respect to him.

Now let's start our journey, we will start simple, really simple, by calling again our primitive language model: The Bigram Language Model.

You may ask: Why is the point of this? Then relax, my friend, even though the bigram is an extremely simple model, it provides a great setup for our big boy. To be more specific, in this blog, we will rebuild the bigram in a more general way, in which we will adjust step by step to eventually converge with a Transformer. So hold your beer for now.

Load the Data

We will consider a much bigger dataset, not a list of American names, but Scripts from Shakespeare. It's somehow preferred by Andrej, so we will stick with that. And as you can probably guess, it is way larger than the corpus we use in the previous project.

Let's download the data for our Colab:

# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

And then read the data:

# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

Tokenize

Just like we did in every of our previous projects, we always have to find a way to encode our text, so that it can be fed into the neural nets, and of course we should also have a decode step to convert the numerical representation back to the text-like format. Now first we just store the characters to construct the mapping later, and also to get the vocab_size:

# Construct the list of chars + vocab_size
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

We might want to see our chars, which is different from things in the previous projects:


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65

Alongside with alphabetical chars, we also got numeric, or even special ones. Actually the char[0] is the \newline, which is interesting, and our vocab_size is now 65, not 27 anymore.

Now we should create a mapping to convert all characters into integers, and then apply a function to encode the string into numbers (remember the inverse mapping and the decode step). Unlike the previous project, now we have to create the encode and decode function because we are dealing with a lot of characters, so things get complicated.

Here's the code:

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there

It works great! Now let's turn the text into integers:

# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:10]) # the 10 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])

Beautiful! Just one more note here: This process is called Tokenization. And the tokenization we're using is the most naive and the simplest form, which is the character-level tokenization. In practice, researchers have developed a whole bunch of methods for tokenizing text. The most groundbreaking ones, and actually it is used in LLMs today, is sub-word tokenization (like the SentencePiece from Google or Tiktoken from OpenAI). There is a lot of things to talk about Tokenization so I will do a whole new blog for that (Promise, because Andrej has a video on that too).

Splitting the dataset

From our previous project, we learned that in order to train the model effectively and reduce the risk of overfitting, we should split the dataset into subsets called "trainset", "devset", and "testset". We will apply the same thing to this project, by splitting the data into two parts, used for training or validation.

# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

Creating the Context Window

Now, remember in the previous model we have the block_size, which is simply the length of the chunk of text that we want to feed in our model. In this project, we will increase our block_size, or in the modern world people call it context_window, to 8. This allows our model to look back further, thus having a better understanding of the context that it's in.
Another thing to note here, is that we're not going to predict the char by just looking at it's previous 8 characters like in the past model, but we will predict the next char in each of the position in the context window. That means, in the chunk of 8 characters, first we show the first 1 char and the model predicts the next, then we show the first 2 chars and the model predicts, the list goes on until the model gets to predict the whole sequence of 8.
In other words, we are not using the "sliding window" technique anymore, but rather we're revealing one character at a time in a chunk of text. Now this is an important moment: Because of that sequential processing, we would rather call the block_size as the Time steps, specifying the time for our predictions and thus signaling a kind of sequential processing. (This is a cross-over with the Recurrent Neural Network(RNN) if you know it)

Now to make clear to you what I mean, we will introduce that in the code:

block_size = 8
x = train_data[:block_size]
# We also construct the target,
# which is just the x but we offset by a character
y = train_data[1:block_size +1] 

for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f'when input is {context} the target: {target}')

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58

This is a different way of prediction, and keep that in mind, because it is the barebone of the Transformer when we scrutinize deeper. Remember that we're doing the setup for that.

Creating Batches

Now we know about the time dimension, we should also know about the Batch dimension. In training, we would love to do things in parallel, means that in each time step, we want to process more than one thing, separately (It's like creating numerous parallel universe working independently). And that's actually the batch_size, the "creating parallel universe" part is just simply feeding the chunks of text in batches.

We create batches just like we did in the previous project, by initializing a tensor of random values, then use the values for indexing the chunks of text. For simplicity, the batch dimension will be just 4:

# For reproducibility, we introduce a manual seed
torch.manual_seed(1337)
batch_size = 4

def get_batch(split):
  data = train_data if split == 'train' else val_data
  ix = torch.randint(len(data) - block_size,(batch_size,))
  # The stack method creates the batch dimension for our data
  # simply by stacking chunks onto each other
  x = torch.stack([data[i:i+block_size] for i in ix]) 
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])
  return x,y
xb,yb = get_batch('train')
print(xb)
# 4x8 tensor

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

And now we can use that data for our neural nets!

Build the model

Let's look at the code first, and then I will explain later

import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):
  def __init__(self,vocab_size):
    super().__init__()
    # each token directly reads off the logits for the next token from a lookup table
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
  def forward(self,idx,targets):
    # idx and targets are both (B,T) tensor of integers
    logits = self.token_embedding_table(idx) # B,T,C
    return logits
model = BigramLanguageModel(vocab_size)
out = model(xb,yb)
print(out.shape)

First we need to import the Module from torch.nn to get some methods that are convenient for our project. The thing to look here is the Embedding method, it works just like when we embed our characters manually: It creates a lookup table, which is just a bunch of vectors (with the dimension depends on our initialization), and then when we want to embed something, we just need to put that in our lookup table, then the table will return an unique vector for it.

In this project we create a 65x65 (as the vocab_size is 65) lookup table, so each character is assigned with a vector storing 65 values, that's the embedding size. The embedding size is up to our liking, it can be any numbers. But why 65? you may ask, well, we're trying to replicate the Bigram model, remember? And in the Bigram model we also have a lookup table of vocab_sizexvocab_size, but that's just for counting the occurrence of pairs. And the hard truth is, actually we're not counting anything here.

To be clear, this is not a count-based Bigram model, but rather a neural net that replicates the bigram, just like we did in the second blogpost. But if we optimize it well, then at the end of the day, the table will eventually converge to nearly the same as the actual count table (and our table represent the log-count, not the normal count, I think we talked about that earlier).

The last thing to keep in mind here, which is really important later. Note that the dimension of the data would increase after embedding, with an addition dimension for storing the embedding vectors. So the shape of the log-counts, or the logits would be 4x8x65, and you might notice the B,T,C near it. That's important, and we will use that thing a lot later in the blog. The B and T, you might know, are the batch dimension and the time dimension of our data, what about C? The C here represents for the Channels, which denotes the depth of the meanings that each character holds. This shares a resemblance with Convolutional Neural Net (CNN), in which we have the data of the size (H,W,C), with the 2 first dimension represents Height and Width, while the last represent the Channel (in a typical RGB image, one pixel have 3 channels of red, green, and blue). Be free to pause and ponder, and think about the T and RNN also, then you will see that this is perhaps the second greatest unification in history (the greatest, in my opinion, is in The Avengers: Endgame).

And a small note here, when we plug the index of our character in the lookup table, it actually return the row of that index, with 65 values corresponding to the logits, and then we can apply softmax to find the probability of the next character.

Now, for every neural network implementation, we would like to see the loss. We might simply use:

loss = F.cross_entropy(logits, targets)

But that would cause an error (try it yourself!). The reason here is that the function cross_entropy in Pytorch expects the Channels as the second dimension, or in other words, it expects the logits to be of the size B,C,T rather than B,C,T. Kinda tired with that, now we have to adjust our logits, and the targets correspondingly.

# Evaluate the loss
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):
  def __init__(self,vocab_size):
    super().__init__()
    # each token directly reads off the logits for the next token from a lookup table
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
  def forward(self,idx,targets):
    # idx and targets are both (B,T) tensor of integers
    logits = self.token_embedding_table(idx) # B,T,C
    # loss = F.cross_entropy(logits, targets)
    # An error -> Pytorch expect the channel to be the second dim
    # Want B, T, C intead of B, C, T
    B, T, C = logits.shape
    logits = logits.view(B*T,C)
    # Stretch the 4x8 into a 1-dim of 32, preserve the channels
    # Do the same for the target (B,T)
    targets = targets.view(-1)
    loss = F.cross_entropy(logits, targets)
    return loss
model = BigramLanguageModel(vocab_size)
out = model(xb,yb)
print(out)

tensor(4.8786, grad_fn=<NllLossBackward0>)

We want to know that whether this is a good initial loss our not (remember the hockey stick). So let's consider the average case, which is when the model assign equal probabilities for everything. In that case, you can do some arithmetic in your mind, the log-loss would be -ln(1/65), which is approximately 4.17. So we're starting with a not really bad loss!

Sampling

Now we will talk about our method for sampling from the model. A key different here, though, is that we have an additional dimension called Time-step, and also, we're processing the predictions *in parallel, so when we have the prediction, it's actually a bunch of predictions in different universes and we will have to concat all of those thing.

Let's look at the code first:

 def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

First, we have the idx, which is simply the first index, note that in the last line of code we pass in the idx = torch.zeros(1,1). This means that we just have one batch, and we're starting at the time step 1. The max time step for us is predetermined by the max_new_tokens, and actually we're starting with the index 0, which is, interestingly, the newline character (Like: Given a new line, give me the full script).

We perform a forward pass with our idx, and you can see that we pass in just the idx alone, no targets. We should cover that in the forward:

def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

Then, when taking the logits, remember that the logits will show the probability for the next character in each of the given position, but as we're marching with time, then we just need the prediction for the last time step, that's why we have the -1 over there. You may see that this approach is rather ridiculous and funny because everytime we make a prediction, we just need the last one and ignore all of the previous (which is clearly a waste of computation). And talking about the bigram, remember the bigram only needs one character to predict one another? So what's the point here?

It turns out, that the Transformer works just like that (remember we're trying to replicate it?). And the waste of computation is clear in the Transformer too! But researchers worked with that, and they introduce the KV Cache, which does the trick. We won't talk about that, that's beyond the scope of this blog.

Next is the sampling and concat part, which is easy to understand. Note that in the last line of code, we have the number [0], it's because the function will return a list of predictions corresponding to the batch size. And we're having just one batch, so indexing at zero is good.

Let's take a loss at our sample, shall we?

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ

Well, a monkey can do better than that. But we haven't run the model yet, so let's make it better.

Run the Model with AdamW Optimizer

I will spend a blog to talk about this, but you just need to know that here's the thing that helps us with our learning rate and stuff, so that we can make the model stronger. Let me introduce this man Adam, he's best at optimizing models:

# Create optimizer

optimizer = torch.optim.AdamW(m.parameters(),lr = 1e-3)

We will change our implementation a bit, to fit with our Optimizer:

batch_size = 32
for steps in range(1000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

You might need to rerun this for numerous time, and at the end, here's what we get:

2.4956934452056885

That's significant! Now let's look at the sample again


As thee.
ARKis ve durre wiouro d wimys:
Thaueerrpanat thathe, s
WAnd Fithee.

Nobys
I ld mb, qun mod y wousa darketwave nghof IORitok has whodirate are atit G t hant m,
HAn
CELUSt y XENTha bu.
Wh blinouke sth thiviglecldewist, trveayokeanguror mepes ' wexfalle spprdswhyealaiate foulokiou:
BE an, dop

It's having structures! And that's good. Honestly, sometimes I think building this kind of thing is like having a baby, and you get to watch him grows through time, by your dedication and some lessons. I really feel like I'm a responsible father right now!

Summarize

We're done with our model! Eventhough we know that the Bigram is bad, and the result is not really beautiful, but throughout this blog, we get to know a lot of new notions! From the sequential and parallel processing of the model (which share some resemblances with the RNN), to the Big Three B,T,C and the cross-over with CNN. We also know about the sampling method, and creating an optimizer for our model. All in all, our first steps are good.

Thanks for listening, and having a good day!

DEV Community