DEV Community: Sofia

LLM Study Diary #3: PyTorch

Sofia — Thu, 07 May 2026 04:02:26 +0000

Continuation of the course...This lesson talks a lot related to pytorch.

Tensor Basics & Memory

It talks about the tensors as the core building blocks for parameters, gradients, and optimizer states. And then he discusses floating-point representations, including FP32 (full precision), BF16 (brain float, often preferred for deep learning), and the move toward FP8 for efficiency

Float Data Types
There are many float types have been discussed, such as float32, float 16, bfloat16, fp8, etc. Using float32 to train requires a lot of memory, and using bfloat16/fp8 gives you some stability. Some people also mix the solutions, use float32 in attention calculation and float16 int feed forward etc. Generally Float32 (also referred to as single precision or full precision) is typically used for storing parameters and optimizer states during training to ensure numerical stability and prevent training from becoming unstable.

Tensor Operations & Einstein Summation

He introduces einops as a more readable and robust alternative to standard PyTorch indexing (e.g., -1, -2), helping developers manage dimensions without confusion. You can understand it as tag for tensor data. For example, here z = einsum(x, y, "batch seq1 hidden, batch seq2 hidden -> batch seq1 seq2") they name the output tensor as batch seq1 seq2.

Compute Accounting (FLOPs)

A deep dive into calculating the total number of floating-point operations. The instructor establishes the rule of thumb that training requires approximately 6x parameters × tokens (a total derived from 2x FLOPs for the forward pass and 4x FLOPs for the backward pass)

Note: If you forgot what the forward pass and back propagation are, here is a video to walk through the math behinds a simple Neural Networks training: The Math behind Neural Networks

Model Building & Optimization

He demonstrates on building a simple linear model, implementing custom optimizers like AdaGrad to understand how states persist across steps, and the importance of proper initialization (e.g., Xavier initialization) to maintain numerical stability in deep networks

Training Infrastructure

There is practical advice on data loading with memmap to handle massive datasets (only load specific part of the data into memory), the importance of checkpointing to prevent progress loss (this is similar to the batch processing and the streaming processing), and the synergy between hardware constraints and model architecture

LLM Study Diary #2: Tokenization

Sofia — Mon, 04 May 2026 22:06:33 +0000

Background

I did some research online and found a nice course that teach how to build LLM from scratch. The course is shared public online and all the assignment resources are here: https://cs336.stanford.edu/. In the following series, I will put the summary and notes starting from lession 1.

Tokenization

Tokenization is at the very beginning of the LLM. There were many different tokenization algorithm, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization and Byte Pair Encoding (BPE).

Character-based Tokenization
Pros: Simple to define by mapping characters to code points.
Cons: Highly inefficient use of vocabulary because some characters are rare, and the compression ratio is suboptimal compared to more advanced methods.
Byte-based Tokenization
Pros: Uses a very small, fixed vocabulary (0-256 indices), avoiding sparsity issues.
Cons: Leads to very long sequences because the compression ratio is effectively 1:1 (one byte per token), which makes model training computationally expensive due to the quadratic nature of attention.
Word-based Tokenization
Pros: Captures semantic units through splitting strings by whitespace or regex.
Cons: Results in an unbounded vocabulary size; it struggles with rare or unseen words, often necessitating an "UNK" (unknown) token which creates significant challenges for model training and evaluation.

BPE

BPE is the best one out of all these. Here is how it works:

Convert to Bytes: First, represent the input string as a sequence of bytes (integers). This ensures every character, even rare ones, can be represented.
Count Frequencies: Scan the entire corpus to count the frequency of all adjacent pairs of bytes or existing tokens.
Merge the Most Frequent: Identify the pair that appears most often and merge them into a new, single token. Add this new token to your vocabulary.
Repeat: Repeat the process of counting and merging for a set number of iterations or until a desired vocabulary size is reached. This process allows the model to adaptively represent common sequences as single tokens and rare ones as multiple smaller components.

Key Takeaways:

Efficiency: BPE is effective because it learns the statistics of your specific data set, rather than relying on predefined word boundaries.
Robustness: Unlike word-based tokenization, BPE handles unknown or rare words gracefully because it can always fall back to individual characters or smaller sub-word units, avoiding the need for "UNK" tokens.
Historical Context: Originally a data compression algorithm from 1994, it was adopted for NLP to improve neural machine translation and eventually became a standard backbone for models like GPT-2 and beyond.

LLM Study Diary #1: Transformer

Sofia — Fri, 01 May 2026 05:27:57 +0000

About Me

I have been working as software engineer for almost 8 years, mostly backend and infra, including distributed system, nearline processing, batch processing, etc. I have some basic knowledge of ML in the school but no complicated ML use case experience. The series will note what I learn about LLM as a general software engineer. Feel free to comment if anything seems wrong and leave your questions.

Transformer

This is a good source to understand each component in the transformer: Mastering Tensor Dimensions in Transformers. Decoder-only models (GPT family, Llama, Claude) are used for generation. Encoder-decoder models (BART, the original "Attention Is All You Need" Transformer) handle translation and summarization. Encoder-only models like BERT are used for classification and embeddings.

Here we talk about decoder-only LLM. To summarize the architecture, the transformer block has two main important component: Masked Multi-Head Attention (MMHA) and Feed Forward Network (FFN).

Masked Multi-Head Attention (MMHA)

The attention formula contains query(Q), (key)K, (value)V.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Q (Query) → what this token is looking for
K (Key) → what this token offers / represents
V (Value) → the actual content to retrieve

In the attention weight calculation, Softmax → attention weights between Q/K. And then output the Weighted sum of values. The intuition of this for training and inference are:

For training, Everyone asks questions (Q) at the same time about everyone else (K/V), with masking;

For inference, Only the newest token asks a question (Q), using stored memory (K/V) from the past.

Q/K/V in Training vs Inference

In training, because the full training sequence is available, the model can process all token positions in parallel. For each transformer layer, Q, K, and V are computed from the same input sequence of hidden states. A causal mask prevents each position from attending to future tokens.

In the inference, there are two phases:

Prefill phase:
The model processes the whole prompt. Q, K, and V are all computed from the prompt tokens. The model stores/caches only K/V for future generation. Q is used temporarily during the prompt forward pass and then discarded.
Decode/generation phase:
For each newly generated token, the model computes Q/K/V for that new token. The new token's Q attends to the cached K/V from the prompt plus previous generated tokens. Then the new token's own K/V are appended to the KV cache for future tokens.

KV Caching in the inference

The same author has another post about KV caching KV Caching Explained: Optimizing Transformer Inference Efficiency. Without caching, K/V for every past token would have to be recomputed every step — a waste, since they don't change. KV caching stores them so each new step only computes Q/K/V for the current token and reuses the rest, which speeds up inference substantially.

Like we mentioned before, inference has two distinct phases: prefill (processing the prompt, where all prompt tokens compute Q in parallel just like training) and decode (autoregressive generation, one token at a time). This split is a foundational concept for inference systems — it drives latency characteristics, batching strategy, and how the KV cache gets populated.

Feed Forward Network (FFN)

This is an expand → nonlinearity → contract process.

\text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2

$W_1$ -> expand weights
$W_2$ -> contraction weights

It’s like:

Expand = generate many candidate features
Activate = choose which matter
Contract = compress back into the residual stream

What's the target expansion dimensions?
This is a hyperparameter, but not arbitrary. Standard rule of thumb: 4x, used in GPT-3.

Weights vs Hyperparameter

The transfomer is learning (tuning):

Attention projections: $W_Q$ , $W_K$ , $W_V$ (per head) and the attention output projection $W_O$
Token + positional embeddings (positional only if learned, e.g. GPT-2; RoPE has no learned params)
LayerNorm scale/bias (γ, β)
Final output / unembedding matrix (often tied with the input embedding)

Loss Function

-\frac{1}{T} \sum_{t=1}^{T} \log P(x_{t+1} \mid x_{\le t})

Backpropagation pushes gradients from the output loss back through every layer, updating all of these weights jointly to make the error smaller. Hyperparameters, in contrast, are things like learning rate, batch size, embedding dimensions, expansion dimensions, number of layers, and number of heads — they define the shape of the network, while weights are what gradient descent fills in.

Visualization

To understand each step with specific example, you can use this visualization tool: transformer-explainer