How do Transformer Models work?

#python #machinelearning #ai #nlp

What are Transformer Models?

Transformer models are a type of AI architecture that revolutionized how computers understand and generate text. Think of them as super-powered pattern recognition systems that can process entire sentences at once, rather than word by word like older models. Their ability to understand context and capture meaning is what sets them apart.

Just a quick disclaimer… This is a simplified explanation.

So let’s get started!

Tokens and vectors
Firstly, let’s briefly understand what vectors and tokens are, as we will be using these terms frequently.

Consider this sentence: “I love coding”

Tokenization

When this gets fed into a tokenization engine, it splits this into pieces like ‘I’, ‘love’, ‘coding’, or sometimes even smaller chunks like ‘I’, ‘lo’, ‘ve’, ‘cod’, ‘ing’, depending on the tokenization method. These splits are called tokens.

And when we represent them in terms of numbers, we get Vectors!

Model Weights
Before moving further, let’s also take a look at model weights.

Think of knobs that, during the training process, the model/AI “tweaks” based on it’s generated output. These can be referred to as model weights. Here’s how training works: We give the model some input text and show it what the expected output should be. The model generates its output, and then we measure how different it is from what we wanted. Based on that difference, the AI adjusts all those knobs (weights) to get better at producing the right output. This happens millions of times until the model gets good at it!

Query, Key, Value vectors & Attention Mechanism

After the input is tokenized and converted to vectors, here’s where it gets interesting. Using the model weights, we transform each input vector into three different types of vectors:

Query vector (what this token is “asking about”)
Key vector (what this token is supposed to “represent”)
Value vector (the actual “info” this token carries)
This transformation is done by multiplying each input vector by the model weights. Then we feed all these into the Attention Mechanism. What this does is the model takes each token’s query vector and compares it with ALL the key vectors in the sentence (including its own). This gives us Attention scores, aka Attention Weights.

This is a vector that tells the model which tokens to pay attention to.

Here’s an example to understand this better:

The model takes each token’s query vector and compares it with ALL the key vectors in the sentence (including its own). This comparison gives us attention scores — basically telling the model :

“When processing this word, pay **THIS **much attention to that other word.”

For example, in

“The cat sat on the mat”

When processing “sat,” the AI might pay more attention to “cat” (who’s doing the sitting) and “mat” (where the sitting happens).

Contextual Representation and Prediction
These attention scores get multiplied by the corresponding value vectors and added together. This creates what we call a

contextualized representation: a vector that captures not just what this word means, but what it means in this specific context, given everything around it.

Then the dot product (for simplicity, you can think of this as just multiplying, but it helps to know exactly what taking a dot product means) of this vector with a matrix (think of this as a collection of vectors) containing all possible vectors that the model knows. The attention layers and all the other components of such a model, including these vectors, are also adjusted and learned during training.

So after taking that dot product, it gets scores for each token, and it takes the token with the highest score or uses sampling techniques to randomly choose between the highest ranking tokens and provides it as the output!

And that’s how AI generates the next word! This process repeats for each word in the response until it decides it’s done.

Thank you for reading!