DEV Community

Sandeep Salwan
Sandeep Salwan

Posted on

Transformer Architecture

Before Transformers, models called RNNs were used, but Transformers are better because they solve issues like being difficult to parallelize and the exploding gradient problem.

Line 1: “The person executed the swap because it was trained to do so.”
Line 2: “The person executed the swap because it was an effective hedge.”

Look carefully at those two lines. Notice how in line 1, “it” refers to the "person".
In line 2, “it” refers to the "swap".

Transformers figure out what “it” refers to entirely through numbers by discovering how related the word pairs are.

These numbers are stored in tensors: a vector is a 1D tensor, a matrix is a 2D tensor, and higher-dimensional arrays are ND tensors. Embeddings for the input are based on frequency and co-occurrence of other words.

This architecture relies on three key inputs: the Query matrix, the Key matrix, and the Value matrix.

Imagine you are a detective. The Query is like your list of questions (Who or what is “it”?). The Key is the evidence each word carries (what every word offers as a clue). When you multiply Query by Key, you get a set of attention scores (numbers showing which clues are most relevant).

Lot of math occurs here {these scores are scaled (to keep them stable), normalized with softmax (so they become probabilities that sum to 1), and then used as weights.}

Finally, the Value is the actual content of the evidence (the meaning of each word e.g. person is living and swap is an action). Multiplying the attention weights by the Value matrix gives the final information the model carries forward to make the right decision about “it.”

All of these abstract (Q, K, V) matrix numbers are trained through backpropagation. Training works by predicting an output, comparing it to the true label, measuring the loss (higher loss the worse because it's calculated by difference of calculated vs actual output), calculating gradients (slopes showing how much each weight contributed to that error), and then updating the weights in the opposite direction of the slope (e.g., if the slope of loss is y = 2x, the weights move in the y = –2x direction).

Now you know at a high level how Transformers (used by top LLM's today) work: they’re just predicting the next word in a sequence.

Top comments (0)