Ujjwal Raj

Posted on Nov 1

Part II : Building My First Large Language Model from Scratch

#ai #programming #softwareengineering #llm

Welcome to another article on building an LLM from scratch.

There’s been a little delay in bringing you the second part of this series, as I took some time off to enjoy my Diwali vacations - a much-needed break filled with lights to dive back into building LLMs from scratch.

If you haven’t read the first one, it’s a quick read - you can go ahead and check it out: Building My First Large Language Model from Scratch

Till this point, we have coded the multi-headed attention mechanism.

Overview of Building LLM Architecture

We will start building the LLM architecture now. After the multi-head mechanism, we get a tensor which we will process through different deep learning neural network layers. The weights of these layers are collectively called parameters in the context of LLMs.
You must have heard that GPT-2 Small has 124M parameters - that’s the total number of connected node pairs.

We used the same values as GPT as our model’s basis.

GPT_CONFIG_124M = {
      "vocab_size" : 50257,    # vocab size of tiktoken
      "context_length" : 1024, # maximum input tokens handled by positional embedding layer (see previous blog)
      "emb_dim" : 786,         # each token is converted into a 786-dimensional vector
      "n_heads" : 12,          # number of heads in the attention mechanism (see previous blog)
      "n_layers" : 12,         # number of transformer blocks in the model
      "drop_rates" : 0.1,      # dropout rate for masking
      "qkv_bias" : False       # Query-Key-Value bias in attention mechanism
}

Using the above config, a 12-layer transformer block is built. Let’s start building a single layer of such a transformer model.

Applying Layer Normalization

Training deep learning neural networks with multiple layers can sometimes be challenging due to problems like vanishing or exploding gradients. The learning process may struggle to minimize the loss function during backpropagation.

We use layer normalization to improve the stability and efficiency of neural network training. The idea is to have a mean of 0 and a variance of 1 (unit variance) for the output of a layer. In GPT-2, layer normalization is applied before and after the multi-head attention module.

You can learn more about layer normalization and its implementation here: https://medium.com/@sujathamudadla1213/layer-normalization-48ee115a14a4

Just like in GPT-2, layer normalization is chosen over batch normalization for greater flexibility and stability. It is also beneficial in distributed training.

Implementing the Activation Function (GELU) in Feedforward Network

A cheaper approximation version of GELU is used as the activation function.

The smoothness of GELU leads to better optimization during training than ReLU (as shown in the figure below).

ReLU has a sharper corner near zero, which makes optimization harder during training, as it outputs zero for any negative value. In contrast, with GELU, neurons that receive negative input still contribute to the learning process to a small extent.

A Feedforward Layer Architecture

The above figure shows how a feedforward layer looks. It contains three layers - a linear layer, followed by a non-linear GELU normalization layer, and then another linear layer. The GELU layer has a dimension four times that of the linear layer. So the embedding dimensions are first expanded by four times and then reduced back by four times. Doing this provides a better representation space.

Shortcut Connection

We provide an alternate path that skips the feedforward layer. This is achieved by adding the output of one layer to the output of a later layer (as shown in the figure). This helps prevent gradient vanishing problems.

Assembling the Transformer Block

Now we will assemble one unit of the transformer block. This will be repeated 12 times in the final LLM architecture.

After the dropout layer, a layer normalization layer is added, followed by a feedforward layer and another dropout layer. The shortcut connection is created as shown in the figure.

The idea is that the attention mechanism identifies and analyzes the relationships between elements in the input sequence, while the feedforward network modifies the data individually at each position. This way, the model is enhanced to handle complex patterns.

The dropout layer helps prevent overfitting.

Generating Text: Greedy Decoding

The LLM is assembled as shown in the figure below.
The final output normalization is done, followed by a linear layer that converts each token vector (786-dim) to a vocabulary-sized vector (~50k-dim). This output vector is called logits.

Now we’ll understand how the final tensor is used to compute the next token, which forms the LLM’s response.

To compute the next generated token, we generate a logits vector. The logits vector has a dimension equal to the vocabulary size (~50k in our case). It represents the probability of occurrence of each token.
For example, if index 2 of the logits vector has a value of 0.12, it means the probability of the 2nd token ID being next is 0.12. So, we simply select the token ID with the highest value as the next generated token.

The following figure from Towards AI helps illustrate the flow:

The softmax function is used to convert the logits into a probability distribution. Since softmax is a monotonic function, you can take the maximum of the logits directly. The idea remains the same - pick the most probable next token.

A typical logits tensor will have the shape:
(batch size, sequence size, vocab size)

The above figure shows how each logits vector is generated iteratively, and each token is produced in every iteration.

Conclusion

With this, we’ve laid the foundation for how individual transformer blocks come together to form a complete LLM.

Next, we’ll explore how we label an LLM based on its parameters, how weights can be reused in input and output layers, and the most interesting part - pretraining the whole architecture.

Till then, stay tuned.

DEV Community