<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sofia</title>
    <description>The latest articles on DEV Community by Sofia (@zhu_sofia_015552d01df4321).</description>
    <link>https://dev.to/zhu_sofia_015552d01df4321</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3904860%2F60c78752-9e8d-4bd3-854b-1ab6ec187eb3.png</url>
      <title>DEV Community: Sofia</title>
      <link>https://dev.to/zhu_sofia_015552d01df4321</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zhu_sofia_015552d01df4321"/>
    <language>en</language>
    <item>
      <title>LLM Study Diary #3: PyTorch</title>
      <dc:creator>Sofia</dc:creator>
      <pubDate>Thu, 07 May 2026 04:02:26 +0000</pubDate>
      <link>https://dev.to/zhu_sofia_015552d01df4321/llm-study-diary-3-pytorch-4p62</link>
      <guid>https://dev.to/zhu_sofia_015552d01df4321/llm-study-diary-3-pytorch-4p62</guid>
      <description>&lt;p&gt;Continuation of the course...This lesson talks a lot related to pytorch.&lt;/p&gt;

&lt;h1&gt;
  
  
  Tensor Basics &amp;amp; Memory
&lt;/h1&gt;

&lt;p&gt;It talks about the tensors as the core building blocks for parameters, gradients, and optimizer states. And then he discusses floating-point representations, including FP32 (full precision), BF16 (brain float, often preferred for deep learning), and the move toward FP8 for efficiency&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Float Data Types&lt;/strong&gt;&lt;br&gt;
There are many float types have been discussed, such as float32, float 16, bfloat16, fp8, etc. Using float32 to train requires a lot of memory, and using bfloat16/fp8 gives you some stability. Some people also mix the solutions, use float32 in attention calculation and float16 int feed forward etc. Generally Float32 (also referred to as single precision or full precision) is typically used for storing parameters and optimizer states during training to ensure numerical stability and prevent training from becoming unstable. &lt;/p&gt;

&lt;h1&gt;
  
  
  Tensor Operations &amp;amp; Einstein Summation
&lt;/h1&gt;

&lt;p&gt;He introduces einops as a more readable and robust alternative to standard PyTorch indexing (e.g., -1, -2), helping developers manage dimensions without confusion. You can understand it as tag for tensor data. For example, here &lt;code&gt;z = einsum(x, y, "batch seq1 hidden, batch seq2 hidden -&amp;gt; batch seq1 seq2")&lt;/code&gt; they name the output tensor as &lt;code&gt;batch seq1 seq2&lt;/code&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Compute Accounting (FLOPs)
&lt;/h1&gt;

&lt;p&gt;A deep dive into calculating the total number of floating-point operations. The instructor establishes the rule of thumb that training requires approximately 6x parameters × tokens (a total derived from 2x FLOPs for the forward pass and 4x FLOPs for the backward pass)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: If you forgot what the forward pass and back propagation are, here is a video to walk through the math behinds a simple Neural Networks training: &lt;a href="https://www.youtube.com/watch?v=6C5HN_4SkFU" rel="noopener noreferrer"&gt;The Math behind Neural Networks&lt;/a&gt; &lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Model Building &amp;amp; Optimization
&lt;/h2&gt;

&lt;p&gt;He demonstrates on building a simple linear model, implementing custom optimizers like AdaGrad to understand how states persist across steps, and the importance of proper initialization (e.g., Xavier initialization) to maintain numerical stability in deep networks &lt;/p&gt;

&lt;h2&gt;
  
  
  Training Infrastructure
&lt;/h2&gt;

&lt;p&gt;There is practical advice on data loading with &lt;code&gt;memmap&lt;/code&gt; to handle massive datasets (only load specific part of the data into memory), the importance of checkpointing to prevent progress loss (this is similar to the batch processing and the streaming processing), and the synergy between hardware constraints and model architecture&lt;/p&gt;

</description>
      <category>deeplearning</category>
      <category>devjournal</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>LLM Study Diary #2: Tokenization</title>
      <dc:creator>Sofia</dc:creator>
      <pubDate>Mon, 04 May 2026 22:06:33 +0000</pubDate>
      <link>https://dev.to/zhu_sofia_015552d01df4321/llm-study-diary-2-tokenization-2f6k</link>
      <guid>https://dev.to/zhu_sofia_015552d01df4321/llm-study-diary-2-tokenization-2f6k</guid>
      <description>&lt;h1&gt;
  
  
  Background
&lt;/h1&gt;

&lt;p&gt;I did some research online and found a nice course that teach how to build LLM from scratch. The course is shared public online and all the assignment resources are here: &lt;a href="https://cs336.stanford.edu/" rel="noopener noreferrer"&gt;https://cs336.stanford.edu/&lt;/a&gt;. In the following series, I will put the summary and notes starting from lession 1. &lt;/p&gt;

&lt;h1&gt;
  
  
  Tokenization
&lt;/h1&gt;

&lt;p&gt;Tokenization is at the very beginning of the LLM. There were many different tokenization algorithm, such as Character-based Tokenization, Byte-based Tokenization, Word-based Tokenization and Byte Pair Encoding (BPE).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Character-based Tokenization&lt;/strong&gt;&lt;br&gt;
Pros: Simple to define by mapping characters to code points.&lt;br&gt;
Cons: Highly inefficient use of vocabulary because some characters are rare, and the compression ratio is suboptimal compared to more advanced methods. &lt;br&gt;
&lt;strong&gt;Byte-based Tokenization&lt;/strong&gt;&lt;br&gt;
Pros: Uses a very small, fixed vocabulary (0-256 indices), avoiding sparsity issues.&lt;br&gt;
Cons: Leads to very long sequences because the compression ratio is effectively 1:1 (one byte per token), which makes model training computationally expensive due to the quadratic nature of attention.&lt;br&gt;
&lt;strong&gt;Word-based Tokenization&lt;/strong&gt;&lt;br&gt;
Pros: Captures semantic units through splitting strings by whitespace or regex.&lt;br&gt;
Cons: Results in an unbounded vocabulary size; it struggles with rare or unseen words, often necessitating an "UNK" (unknown) token which creates significant challenges for model training and evaluation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  BPE
&lt;/h2&gt;

&lt;p&gt;BPE is the best one out of all these. Here is how it works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Convert to Bytes: First, represent the input string as a sequence of bytes (integers). This ensures every character, even rare ones, can be represented.&lt;/li&gt;
&lt;li&gt;Count Frequencies: Scan the entire corpus to count the frequency of all adjacent pairs of bytes or existing tokens.&lt;/li&gt;
&lt;li&gt;Merge the Most Frequent: Identify the pair that appears most often and merge them into a new, single token. Add this new token to your vocabulary.&lt;/li&gt;
&lt;li&gt;Repeat: Repeat the process of counting and merging for a set number of iterations or until a desired vocabulary size is reached. This process allows the model to adaptively represent common sequences as single tokens and rare ones as multiple smaller components.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Efficiency&lt;/em&gt;: BPE is effective because it learns the statistics of your specific data set, rather than relying on predefined word boundaries.&lt;br&gt;
&lt;em&gt;Robustness&lt;/em&gt;: Unlike word-based tokenization, BPE handles unknown or rare words gracefully because it can always fall back to individual characters or smaller sub-word units, avoiding the need for "UNK" tokens.&lt;br&gt;
&lt;em&gt;Historical Context&lt;/em&gt;: Originally a data compression algorithm from 1994, it was adopted for NLP to improve neural machine translation and eventually became a standard backbone for models like GPT-2 and beyond. &lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>devjournal</category>
      <category>llm</category>
      <category>nlp</category>
    </item>
    <item>
      <title>LLM Study Diary #1: Transformer</title>
      <dc:creator>Sofia</dc:creator>
      <pubDate>Fri, 01 May 2026 05:27:57 +0000</pubDate>
      <link>https://dev.to/zhu_sofia_015552d01df4321/llm-study-dairy-1-transformer-59ip</link>
      <guid>https://dev.to/zhu_sofia_015552d01df4321/llm-study-dairy-1-transformer-59ip</guid>
      <description>&lt;h1&gt;
  
  
  About Me
&lt;/h1&gt;

&lt;p&gt;I have been working as software engineer for almost 8 years, mostly backend and infra, including distributed system, nearline processing, batch processing, etc. I have some basic knowledge of ML in the school but no complicated ML use case experience. The series will note what I learn about LLM as a general software engineer. Feel free to comment if anything seems wrong and leave your questions.&lt;/p&gt;

&lt;h1&gt;
  
  
  Transformer
&lt;/h1&gt;

&lt;p&gt;This is a good source to understand each component in the transformer: &lt;a href="https://huggingface.co/blog/not-lain/tensor-dims" rel="noopener noreferrer"&gt;Mastering Tensor Dimensions in Transformers&lt;/a&gt;. Decoder-only models (GPT family, Llama, Claude) are used for generation. Encoder-decoder models (BART, the original "Attention Is All You Need" Transformer) handle translation and summarization. Encoder-only models like BERT are used for classification and embeddings.&lt;/p&gt;

&lt;p&gt;Here we talk about decoder-only LLM. To summarize the architecture, the transformer block has two main important component: Masked Multi-Head Attention (MMHA) and Feed Forward Network (FFN).  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5nak3qjeslsbaraqc8q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq5nak3qjeslsbaraqc8q.png" alt=" " width="800" height="1056"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Masked Multi-Head Attention (MMHA)
&lt;/h2&gt;

&lt;p&gt;The attention formula contains query(Q), (key)K, (value)V. &lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Attention(Q,K,V)=softmax(QKTdk)V
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;Attention&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;softmax&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="minner"&gt;&lt;span class="mopen delimcenter"&gt;&lt;span class="delimsizing size3"&gt;(&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord sqrt"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span class="svg-align"&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;k&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="hide-tail"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;Q&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;K&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose delimcenter"&gt;&lt;span class="delimsizing size3"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;Q (Query) → what this token is looking for&lt;br&gt;
K (Key) → what this token offers / represents&lt;br&gt;
V (Value) → the actual content to retrieve&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the attention weight calculation, Softmax → attention weights between Q/K. And then output the Weighted sum of values. The intuition of this for training and inference are:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For training, Everyone asks questions (Q) at the same time about everyone else (K/V), with masking;&lt;/p&gt;

&lt;p&gt;For inference, Only the newest token asks a question (Q), using stored memory (K/V) from the past.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Q/K/V in Training vs Inference
&lt;/h3&gt;

&lt;p&gt;In training, because the full training sequence is available, the model can process all token positions in parallel. For each transformer layer, Q, K, and V are computed from the same input sequence of hidden states. A causal mask prevents each position from attending to future tokens.&lt;/p&gt;

&lt;p&gt;In the inference, there are two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Prefill phase:&lt;br&gt;
The model processes the whole prompt. Q, K, and V are all computed from the prompt tokens. The model stores/caches only K/V for future generation. Q is used temporarily during the prompt forward pass and then discarded.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Decode/generation phase:&lt;br&gt;
For each newly generated token, the model computes Q/K/V for that new token. The new token's Q attends to the cached K/V from the prompt plus previous generated tokens. Then the new token's own K/V are appended to the KV cache for future tokens.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  KV Caching in the inference
&lt;/h3&gt;

&lt;p&gt;The same author has another post about KV caching &lt;a href="https://huggingface.co/blog/not-lain/kv-caching?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;KV Caching Explained: Optimizing Transformer Inference Efficiency&lt;/a&gt;. Without caching, K/V for every past token would have to be recomputed every step — a waste, since they don't change. KV caching stores them so each new step only computes Q/K/V for the &lt;em&gt;current&lt;/em&gt; token and reuses the rest, which speeds up inference substantially.&lt;/p&gt;

&lt;p&gt;Like we mentioned before, inference has two distinct phases: &lt;strong&gt;prefill&lt;/strong&gt; (processing the prompt, where all prompt tokens compute Q in parallel just like training) and &lt;strong&gt;decode&lt;/strong&gt; (autoregressive generation, one token at a time). This split is a foundational concept for inference systems — it drives latency characteristics, batching strategy, and how the KV cache gets populated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feed Forward Network (FFN)
&lt;/h2&gt;

&lt;p&gt;This is an expand → nonlinearity → contract process.&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;FFN(x)=σ(xW1+b1)W2+b2
\text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord text"&gt;&lt;span class="mord"&gt;FFN&lt;/span&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;σ&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;b&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;W1W_1&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 -&amp;gt; expand weights&lt;br&gt;

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;W2W_2&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 -&amp;gt; contraction weights&lt;/p&gt;

&lt;p&gt;It’s like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Expand = generate many candidate features&lt;br&gt;
Activate = choose which matter&lt;br&gt;
Contract = compress back into the residual stream&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;What's the target expansion dimensions?&lt;/strong&gt;&lt;br&gt;
This is a &lt;strong&gt;hyperparameter&lt;/strong&gt;, but not arbitrary. Standard rule of thumb: 4x, used in GPT-3.&lt;/p&gt;
&lt;h2&gt;
  
  
  Weights vs Hyperparameter
&lt;/h2&gt;

&lt;p&gt;The transfomer is learning (tuning):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attention projections: 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WQW_Q&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;Q&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WKW_K&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;K&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WVW_V&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;V&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 (per head) and the attention output projection 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WOW_O&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;O&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
&lt;/li&gt;
&lt;li&gt;Token + positional embeddings (positional only if learned, e.g. GPT-2; RoPE has no learned params)&lt;/li&gt;
&lt;li&gt;LayerNorm scale/bias (γ, β)&lt;/li&gt;
&lt;li&gt;Final output / unembedding matrix (often tied with the input embedding)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Loss Function&lt;/strong&gt;&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;L=−1T∑t=1Tlog⁡P(xt+1∣x≤t)
L = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_{t+1} \mid x_{\le t})
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;L&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;−&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mop op-limits"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mrel mtight"&gt;=&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="mop op-symbol large-op"&gt;∑&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;T&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mop"&gt;lo&lt;span&gt;g&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;P&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;span class="mbin mtight"&gt;+&lt;/span&gt;&lt;span class="mord mtight"&gt;1&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;∣&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mrel mtight"&gt;≤&lt;/span&gt;&lt;span class="mord mathnormal mtight"&gt;t&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;Backpropagation pushes gradients from the output loss back through every layer, updating all of these weights jointly to make the error smaller. Hyperparameters, in contrast, are things like learning rate, batch size, embedding dimensions, expansion dimensions, number of layers, and number of heads — they define the &lt;em&gt;shape&lt;/em&gt; of the network, while weights are what gradient descent fills in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Visualization
&lt;/h2&gt;

&lt;p&gt;To understand each step with specific example, you can use this visualization tool: &lt;a href="https://poloclub.github.io/transformer-explainer/" rel="noopener noreferrer"&gt;transformer-explainer&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>devjournal</category>
    </item>
  </channel>
</rss>
