Sofia

Posted on May 1

LLM Study Diary #1: Transformer

#ai #machinelearning #llm #devjournal

About Me

I have been working as software engineer for almost 8 years, mostly backend and infra, including distributed system, nearline processing, batch processing, etc. I have some basic knowledge of ML in the school but no complicated ML use case experience. The series will note what I learn about LLM as a general software engineer. Feel free to comment if anything seems wrong and leave your questions.

Transformer

This is a good source to understand each component in the transformer: Mastering Tensor Dimensions in Transformers. Decoder-only models (GPT family, Llama, Claude) are used for generation. Encoder-decoder models (BART, the original "Attention Is All You Need" Transformer) handle translation and summarization. Encoder-only models like BERT are used for classification and embeddings.

Here we talk about decoder-only LLM. To summarize the architecture, the transformer block has two main important component: Masked Multi-Head Attention (MMHA) and Feed Forward Network (FFN).

Masked Multi-Head Attention (MMHA)

The attention formula contains query(Q), (key)K, (value)V.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Q (Query) → what this token is looking for
K (Key) → what this token offers / represents
V (Value) → the actual content to retrieve

In the attention weight calculation, Softmax → attention weights between Q/K. And then output the Weighted sum of values. The intuition of this for training and inference are:

For training, Everyone asks questions (Q) at the same time about everyone else (K/V), with masking;

For inference, Only the newest token asks a question (Q), using stored memory (K/V) from the past.

Q/K/V in Training vs Inference

In training, because the full training sequence is available, the model can process all token positions in parallel. For each transformer layer, Q, K, and V are computed from the same input sequence of hidden states. A causal mask prevents each position from attending to future tokens.

In the inference, there are two phases:

Prefill phase:
The model processes the whole prompt. Q, K, and V are all computed from the prompt tokens. The model stores/caches only K/V for future generation. Q is used temporarily during the prompt forward pass and then discarded.
Decode/generation phase:
For each newly generated token, the model computes Q/K/V for that new token. The new token's Q attends to the cached K/V from the prompt plus previous generated tokens. Then the new token's own K/V are appended to the KV cache for future tokens.

KV Caching in the inference

The same author has another post about KV caching KV Caching Explained: Optimizing Transformer Inference Efficiency. Without caching, K/V for every past token would have to be recomputed every step — a waste, since they don't change. KV caching stores them so each new step only computes Q/K/V for the current token and reuses the rest, which speeds up inference substantially.

Like we mentioned before, inference has two distinct phases: prefill (processing the prompt, where all prompt tokens compute Q in parallel just like training) and decode (autoregressive generation, one token at a time). This split is a foundational concept for inference systems — it drives latency characteristics, batching strategy, and how the KV cache gets populated.

Feed Forward Network (FFN)

This is an expand → nonlinearity → contract process.

\text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2

$W_1$ -> expand weights
$W_2$ -> contraction weights

It’s like:

Expand = generate many candidate features
Activate = choose which matter
Contract = compress back into the residual stream

What's the target expansion dimensions?
This is a hyperparameter, but not arbitrary. Standard rule of thumb: 4x, used in GPT-3.

Weights vs Hyperparameter

The transfomer is learning (tuning):

Attention projections: $W_Q$ , $W_K$ , $W_V$ (per head) and the attention output projection $W_O$
Token + positional embeddings (positional only if learned, e.g. GPT-2; RoPE has no learned params)
LayerNorm scale/bias (γ, β)
Final output / unembedding matrix (often tied with the input embedding)

Loss Function

L = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_{t+1} \mid x_{\le t})

Backpropagation pushes gradients from the output loss back through every layer, updating all of these weights jointly to make the error smaller. Hyperparameters, in contrast, are things like learning rate, batch size, embedding dimensions, expansion dimensions, number of layers, and number of heads — they define the shape of the network, while weights are what gradient descent fills in.

Visualization

To understand each step with specific example, you can use this visualization tool: transformer-explainer

DEV Community