Lewis Won

Posted on Sep 6

FlashAttention by hand

#ai #machinelearning #llm #algorithms

Table of content

Introduction to FlashAttention
Vanilla self-attention mechanism without FlashAttention
Conceptual Overview: Fused Tiled Attention
FlashAttention by hand
Walkthrough of the FlashAttention diagram
Appendix A - How to derive activation matrices from weight matrices
Appendix B - The Paper's pseudocode vs. this walkthrough

Introduction to FlashAttention

FlashAttention builds upon the principles of online softmax to create an efficient single-pass algorithm for the entire self-attention mechanism. Its key innovation is to compute the final output O (where $O = \text{softmax}(QK^T)V$ ) directly, without ever forming the full attention matrix A. This is achieved by fusing the matrix multiplications ( $QK^T$ and $AV$ ) and the online softmax calculation into a single GPU kernel.

By avoiding the need to write and read the large $L \times L$ attention matrix (where L represents the number of input tokens) to and from global memory (DRAM), FlashAttention significantly reduces memory access, which is often the primary bottleneck in attention calculations. This makes it substantially faster and more memory-efficient, especially for long sequences.

This explanation will walk through the FlashAttention algorithm by hand, using the same tiled approach from the online softmax example in my previous article. This walkthrough is based on "From Online Softmax to FlashAttention" by Ye Zihao (2023).

This article was written with the assistance of Google Gemini 2.5 Pro.

Vanilla self-attention mechanism without FlashAttention

In order to better appreciate the efficiency gains with FlashAttention, we first begin with a vanilla implementation of self-attention. Let's illustrate this with our familiar 1 x 6 example used in the previous article on online softmax. Imagine this vector represents the dot products of a single query vector q with six key vectors k.

Dot Products (Logits): $X = qK^T = \begin{bmatrix} 1 & 2 & 3 & 6 & 2 & 1 \end{bmatrix}$

We also need a corresponding V matrix. For simplicity, let's assume V has 6 rows (one for each key) and a dimension of 2.

Value Matrix: $V = \begin{bmatrix} v_1 \newline v_2 \newline v_3 \newline v_4 \newline v_5 \newline v_6 \end{bmatrix} = \begin{bmatrix} 1 & 1 \newline 2 & 2 \newline 3 & 3 \newline 4 & 4 \newline 5 & 5 \newline 6 & 6 \end{bmatrix}$

The standard process is broken down into three distinct, sequential stages:

Calculate Logits: Compute $X = QK^T$ .
Calculate Attention Scores: Compute $A = \text{softmax}(X)$ . This creates a dense attention score matrix.
Calculate Final Output: Compute $O = A \cdot V$ .

Since the logits X are already given, our walkthrough will start from Step 2.

(Note: Q, K and V here are activation matrices which are result of multiplying the input embeddings by the weight matrices W_Q, W_K and W_V. These activation matrices are different for every input sequence and are the actual inputs to the attention calculation kernel that FlashAttention replaces. Refer to Appendix A to understand how these activation matrices are derived from the weight matrices.)

Step 1: Calculate the Full Attention Score Matrix (`A = softmax(X)`)

This step computes the softmax function over the entire logit vector X. This is a multi-pass process, requiring a pass to find the maximum, a pass to compute the denominator, and a pass to compute the final probabilities. The key difference from FlashAttention is that we must compute and store this entire attention vector A before we can even begin to use the V matrix.

1a. Find the Global Maximum (`m`)

First, we scan the entire logit vector X to find its maximum value for numerical stability (the "safe softmax" trick).

m = \max(\begin{bmatrix} 1 & 2 & 3 & 6 & 2 & 1 \end{bmatrix}) = 6

1b. Compute the Exponentials and the Denominator (`d`)

Next, we subtract the maximum from each logit, exponentiate the result, and sum them all up to get the denominator.

Subtract max: $X - m = \begin{bmatrix} 1-6 & 2-6 & 3-6 & 6-6 & 2-6 & 1-6 \end{bmatrix} = \begin{bmatrix} -5 & -4 & -3 & 0 & -4 & -5 \end{bmatrix}$
Exponentiate: $e^{(X-m)} = \begin{bmatrix} e^{-5} & e^{-4} & e^{-3} & e^{0} & e^{-4} & e^{-5} \end{bmatrix} \approx \begin{bmatrix} 0.0067 & 0.0183 & 0.0498 & 1 & 0.0183 & 0.0067 \end{bmatrix}$
Sum to get the denominator d: $d = \sum e^{(X-m)} \approx 0.0067 + 0.0183 + 0.0498 + 1 + 0.0183 + 0.0067 = 1.0998$

1c. Normalize to get the Final Attention Vector `A`

Finally, we divide the exponentiated values by the denominator d to get the final probabilities. The result is the complete attention score vector A.

\begin{aligned} A &= \frac{e^{(X-m)}}{d} \newline &\approx \frac{\begin{bmatrix} 0.0067 & 0.0183 & 0.0498 & 1 & 0.0183 & 0.0067 \end{bmatrix}}{1.0998} \newline &\approx \begin{bmatrix} 0.0061 & 0.0167 & 0.0453 & 0.9092 & 0.0167 & 0.0061 \end{bmatrix} \end{aligned}

At this point, the 1 x 6 vector A is fully computed and stored in memory.

Step 2: Multiply Attention Scores by the Value Matrix ( $O = A \cdot V$ )

Now that we have the attention scores, we can perform the final step: a matrix multiplication between A and V. The attention scores in A act as weights for the corresponding value vectors in V. The output O is the weighted sum of the value vectors.

The operation is $(1 \times 6) \cdot (6 \times 2) \rightarrow (1 \times 2)$ .

O = A \cdot V \approx \begin{bmatrix} 0.0061 & 0.0167 & 0.0453 & 0.9092 & 0.0167 & 0.0061 \end{bmatrix} \cdot \begin{bmatrix} 1 & 1 \newline 2 & 2 \newline 3 & 3 \newline 4 & 4 \newline 5 & 5 \newline 6 & 6 \end{bmatrix}

Let's calculate the weighted sum:

\begin{aligned} O \approx & (0.0061 \cdot \begin{bmatrix}1 & 1\end{bmatrix}) + \newline & (0.0167 \cdot \begin{bmatrix}2 & 2\end{bmatrix}) + \newline & (0.0453 \cdot \begin{bmatrix}3 & 3\end{bmatrix}) + \newline & (0.9092 \cdot \begin{bmatrix}4 & 4\end{bmatrix}) + \newline & (0.0167 \cdot \begin{bmatrix}5 & 5\end{bmatrix}) + \newline & (0.0061 \cdot \begin{bmatrix}6 & 6\end{bmatrix}) \end{aligned}

Now we compute each term and then sum them up:

\begin{aligned} O \approx & \begin{bmatrix}0.0061 & 0.0061\end{bmatrix} + \newline & \begin{bmatrix}0.0334 & 0.0334\end{bmatrix} + \newline & \begin{bmatrix}0.1359 & 0.1359\end{bmatrix} + \newline & \begin{bmatrix}3.6368 & 3.6368\end{bmatrix} + \newline & \begin{bmatrix}0.0835 & 0.0835\end{bmatrix} + \newline & \begin{bmatrix}0.0366 & 0.0366\end{bmatrix} \end{aligned}

Summing the vectors component-wise:

O \approx \begin{bmatrix}3.9323 & 3.9323\end{bmatrix}

This result, [3.932, 3.932].

Comparison and Key Differences

Intermediate Matrix: The standard method materialized the full 1 x 6 attention vector A. In a real-world scenario with a sequence length L, this would be a large L x L matrix. This is the main bottleneck. I will demonstrate how FlashAttention computes the final output without ever creating or storing this matrix.
Memory Access: The standard method requires at least two major memory operations: writing the entire A matrix to memory (DRAM), and then reading it all back in to multiply with V. I will show how FlashAttention fuses these operations, keeping tiles of Q, K, and V in fast SRAM and avoiding the slow roundtrip to DRAM.
Computation Flow: The process is strictly sequential. You cannot start the $A \cdot V$ multiplication until the entire softmax calculation for A is complete. We will show how FlashAttention integrates these steps, updating a running output vector as it iterates through tiles of the K and V matrices.

Conceptual Overview: Fused Tiled Attention

FlashAttention processes the input matrices Q, K, and V in a tiled manner. For each row of the output matrix O, it iterates through the corresponding rows of K and V in blocks. At each step, it calculates the attention scores for just that block, updates the running statistics (the maximum and the denominator, just like in online softmax), and immediately applies these scores to the corresponding block of V to update a running output vector.

The core idea is to maintain three running statistics for each row of the output:

m_running: The running maximum of the dot products (q · k).
d_running: The running denominator of the softmax.
o_running: The running output vector, which is a weighted sum of the V vectors, scaled by the current (and incomplete) softmax probabilities.

We will use the exact same input data as before with the 1 x 6 example. This vector represents the dot products of a single query vector q with six key vectors k.

Dot Products (Logits): $X = qK^T = \begin{bmatrix} 1 & 2 & 3 & 6 & 2 & 1 \end{bmatrix}$

We also need a corresponding V matrix. For simplicity, let's assume V has 6 rows (one for each key) and a dimension of 2.

Value Matrix: $V = \begin{bmatrix} v_1 \newline v_2 \newline v_3 \newline v_4 \newline v_5 \newline v_6 \end{bmatrix} = \begin{bmatrix} 1 & 1 \newline 2 & 2 \newline 3 & 3 \newline 4 & 4 \newline 5 & 5 \newline 6 & 6 \end{bmatrix}$

We will process this with a tile size of 3, breaking X and V into two tiles:

$T_1$ : Logits $\begin{bmatrix} 1 & 2 & 3 \end{bmatrix}$ and Values $\begin{bmatrix} v_1 \newline v_2 \newline v_3 \end{bmatrix}$
$T_2$ : Logits $\begin{bmatrix} 6 & 2 & 1 \end{bmatrix}$ and Values $\begin{bmatrix} v_4 \newline v_5 \newline v_6 \end{bmatrix}$

The algorithm follows the logic of "Algorithm FlashAttention (Tiling)" on page 6 of "From Online Softmax to FlashAttention" by Ye Zihao (2023).

(Note: For intuitive clarity, this walkthrough maintains an un-normalized running output o_running and performs a single normalization at the end. The paper's pseudocode maintains a normalized running output o' at each step. The final result is mathematically identical. An explanation of why this is true is in Appendix B.)

FlashAttention by hand

Before the main loop begins, the running statistics are initialized.

m_running = -∞
d_running = 0
o_running = [0, 0] (a zero vector of the same dimension as a v vector)

Link back to the FlashAttention algorithm

These initial values correspond to the state before the for loop, where $m_0 = -\infty$ , $d^\prime_0 = 0$ , and $o^\prime_0 = \vec{0}$ .

Step 1: Process Tile 1 (`i=1`)

We now begin the first iteration of the for i ← 1, #tiles do loop.

1a. Find the New Maximum

First, we calculate the dot products (logits) for this tile and find the maximum value.

Logits for Tile 1: $x_1 = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix}$
Local max of Tile 1: $m_{T_1} = \max(\begin{bmatrix} 1 & 2 & 3 \end{bmatrix}) = 3$
New overall maximum: $m_{new} = \max(m_{running}, m_{T_1}) = \max(-\infty, 3) = 3$

Link back to the FlashAttention algorithm

The calculation of logits corresponds to the first line inside the loop:

x_i \leftarrow Q[k,:] K^T[:, (i-1)b:ib]

Finding the local max corresponds to the second line:

m_i^{(\text{local})} = \max_{j=1..b}(x_i[j])

Updating the running max corresponds to the third line:

m_i \leftarrow \max(m_{i-1}, m_i^{(\text{local})})

(Here, $m_{i-1}$ is our m_running from before the step).

1b. Calculate Local Denominator and Local Output

Next, we compute the un-normalized attention scores for this tile using m_new, and then use them to calculate a local denominator d_local and a local weighted output o_local.

Un-normalized scores: $P_1 = \begin{bmatrix} e^{1-3} \newline e^{2-3} \newline e^{3-3} \end{bmatrix} = \begin{bmatrix} e^{-2} \newline e^{-1} \newline e^{0} \end{bmatrix} \approx \begin{bmatrix} 0.1353 \newline 0.3679 \newline 1 \end{bmatrix}$
Local denominator: $d_{T_1} = \sum P_1 \approx 0.1353 + 0.3679 + 1 = 1.5032$
Local output (un-normalized weighted sum of V vectors): $\begin{aligned} o_{T_1} &= (0.1353 \cdot v_1) + (0.3679 \cdot v_2) + (1 \cdot v_3) \newline &\approx (0.1353 \cdot \begin{bmatrix}1 & 1\end{bmatrix}) + (0.3679 \cdot \begin{bmatrix}2 & 2\end{bmatrix}) + (1 \cdot \begin{bmatrix}3 & 3\end{bmatrix}) \newline &\approx \begin{bmatrix}0.1353 & 0.1353\end{bmatrix} + \begin{bmatrix}0.7358 & 0.7358\end{bmatrix} + \begin{bmatrix}3 & 3\end{bmatrix} \newline &= \begin{bmatrix}3.8711 & 3.8711\end{bmatrix} \end{aligned}$

Link back to the FlashAttention algorithm

The calculation of $d_{T_1}$ corresponds to the summation part of the denominator update rule:

\sum_{j=1}^b e^{x_i[j] - m_i}

The calculation of $o_{T_1}$ corresponds to the numerator of the second term in the output update rule (before normalization):

\sum_{j=1}^b e^{x_i[j]-m_i} V[j + (i-1)b, :]

1c. Update Running Statistics

Now, we update our global running statistics.

\begin{aligned} d_{new} &= d_{old} \cdot e^{m_{old} - m_{new}} + d_{local} \newline o_{new} &= o_{old} \cdot e^{m_{old} - m_{new}} + o_{local} \end{aligned}

m_old = -∞, d_old = 0, o_old = [0, 0]
m_new = 3
d_local = 1.5032, o_local = [3.8711, 3.8711]

\begin{aligned} d_{running} &= (0 \cdot e^{-\infty - 3}) + 1.5032 = 1.5032 \newline o_{running} &= (\begin{bmatrix}0 & 0\end{bmatrix} \cdot e^{-\infty - 3}) + \begin{bmatrix}3.8711 & 3.8711\end{bmatrix} = \begin{bmatrix}3.8711 & 3.8711\end{bmatrix} \end{aligned}

After processing the first tile, our statistics are: m_running = 3, d_running ≈ 1.5032, o_running ≈ [3.8711, 3.8711].

Link back to the FlashAttention algorithm

This entire step corresponds to the final two lines of the loop body for i=1.

Denominator update:

d^\prime_i \leftarrow d^\prime_{i-1}e^{m_{i-1}-m_i} + \sum_{j=1}^b e^{x_i[j]-m_i}

Since 
 $d^\prime_0 = 0$ 
, the first term is zero, and 
 $d^\prime_1$ 
 becomes just the local sum, matching our result.

Output update:

o^\prime_i \leftarrow o^\prime_{i-1}\frac{d^\prime_{i-1}e^{m_{i-1}-m_i}}{d^\prime_i} + \frac{\sum e^{x_i[j]-m_i}V[\dots]}{d^\prime_i}

Similarly, since 
 $o^\prime_0$ 
 and 
 $d^\prime_0$ 
 are zero, the first term vanishes. Our un-normalized `o_running` is equivalent to 
 $o^\prime_1 \cdot d^\prime_1$ 
, which is simply the numerator of the second term, matching our calculation of 
 $o_{T_1}$ 
.

Step 2: Process Tile 2 (`i=2`)

We now proceed to the second and final iteration of the loop.

2a. Find the New Maximum

Logits for Tile 2: $x_2 = \begin{bmatrix} 6 & 2 & 1 \end{bmatrix}$
Local max of Tile 2: $m_{T_2} = \max(\begin{bmatrix} 6 & 2 & 1 \end{bmatrix}) = 6$
New overall maximum: $m_{new} = \max(m_{running}, m_{T_2}) = \max(3, 6) = 6$

Link back to the FlashAttention algorithm

This again maps to the first three lines of the loop body for i=2. This time, $m_{i-1} = m_1 = 3$ , so the new maximum is correctly found as $\max(3, 6) = 6$ .

2b. Calculate Local Denominator and Local Output

We repeat the process for the second tile's data, using the new max, 6.

Un-normalized scores: $P_2 = \begin{bmatrix} e^{6-6} \newline e^{2-6} \newline e^{1-6} \end{bmatrix} = \begin{bmatrix} e^{0} \newline e^{-4} \newline e^{-5} \end{bmatrix} \approx \begin{bmatrix} 1 \newline 0.0183 \newline 0.0067 \end{bmatrix}$
Local denominator: $d_{T_2} = \sum P_2 \approx 1 + 0.0183 + 0.0067 = 1.025$
Local output: $\begin{aligned} o_{T_2} &= (1 \cdot v_4) + (0.0183 \cdot v_5) + (0.0067 \cdot v_6) \newline &\approx (1 \cdot \begin{bmatrix}4 & 4\end{bmatrix}) + (0.0183 \cdot \begin{bmatrix}5 & 5\end{bmatrix}) + (0.0067 \cdot \begin{bmatrix}6 & 6\end{bmatrix}) \newline &\approx \begin{bmatrix}4 & 4\end{bmatrix} + \begin{bmatrix}0.0915 & 0.0915\end{bmatrix} + \begin{bmatrix}0.0402 & 0.0402\end{bmatrix} \newline &= \begin{bmatrix}4.1317 & 4.1317\end{bmatrix} \end{aligned}$

Link back to the FlashAttention algorithm

This again corresponds to the summation parts of the update rules for i=2.

2c. Update Running Statistics

Now we perform the final update, including the crucial rescaling step.

m_old = 3, d_old ≈ 1.5032, o_old ≈ [3.8711, 3.8711]
m_new = 6
d_local ≈ 1.025, o_local ≈ [4.1317, 4.1317]

First, update the denominator:

\begin{aligned} d_{running} &= d_{old} \cdot e^{m_{old} - m_{new}} + d_{local} \newline &\approx (1.5032 \cdot e^{3 - 6}) + 1.025 \newline &\approx (1.5032 \cdot 0.04979) + 1.025 \newline &\approx 0.0748 + 1.025 = 1.0998 \end{aligned}

Next, update the output vector:

\begin{aligned} o_{running} &= o_{old} \cdot e^{m_{old} - m_{new}} + o_{local} \newline &\approx (\begin{bmatrix}3.8711 & 3.8711\end{bmatrix} \cdot e^{3 - 6}) + \begin{bmatrix}4.1317 & 4.1317\end{bmatrix} \newline &\approx (\begin{bmatrix}3.8711 & 3.8711\end{bmatrix} \cdot 0.04979) + \begin{bmatrix}4.1317 & 4.1317\end{bmatrix} \newline &\approx \begin{bmatrix}0.1927 & 0.1927\end{bmatrix} + \begin{bmatrix}4.1317 & 4.1317\end{bmatrix} \newline &= \begin{bmatrix}4.3244 & 4.3244\end{bmatrix} \end{aligned}

Link back to the FlashAttention algorithm

This step corresponds to the full update rules for i=2.

Denominator update:

d^\prime_2 \leftarrow d^\prime_1 e^{m_1-m_2} + \sum_{j=1}^b e^{x_2[j]-m_2}

The term 
 $d^\prime_1 e^{m_1-m_2}$ 
 is the crucial rescaling factor, which perfectly matches the 
 $(1.5032 \cdot e^{3 - 6})$ 
 part of our calculation.

Output update:

o^\prime_2 \leftarrow o^\prime_1\frac{d^\prime_1e^{m_1-m_2}}{d^\prime_2} + \frac{\sum e^{x_2[j]-m_2}V[\dots]}{d^\prime_2}

Again, our un-normalized `o_running` is equivalent to 
 $o^\prime_2 \cdot d^\prime_2$ 
. If you multiply the paper's update rule by 
 $d^\prime_2$ 
, you get 
 $(o^\prime_1 d^\prime_1) e^{m_1-m_2} + \text{local sum}$ 
. This exactly matches our formula:

o_{old} \cdot e^{m_{old} - m_{new}} + o_{local}

Final Result

After the loop finishes, we have the final un-normalized output and the final denominator.

d_final ≈ 1.0998
o_final ≈ [4.3244, 4.3244]

The last step is to normalize the output vector by dividing by the final denominator.

\begin{aligned} O_{final} &= \frac{o_{final}}{d_{final}} \newline &\approx \frac{\begin{bmatrix}4.3244 & 4.3244\end{bmatrix}}{1.0998} \newline &\approx \begin{bmatrix}3.932 & 3.932\end{bmatrix} \end{aligned}

Link back to the FlashAttention algorithm

This final normalization step is implicitly the result of the algorithm. The final output of the loop, $o^\prime_{N/b}$ , is the correctly normalized output row. Our method simply defers this division to the very end for clarity. The final output vector $O[k, :]$ in the algorithm is this final, normalized value.

O[k,:] \leftarrow o^\prime_{N/b}

Walkthrough of the FlashAttention diagram

Having completed the step-by-step walkthrough of the FlashAttention algorithm, we can now walkthrough the FlashAttention schematics from the original paper ["FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"(https://arxiv.org/abs/2205.14135) by Tri Dao et. al. (2022).

High-Level Overview

The diagram shows a tiled computation where the goal is to compute one block of the final output matrix, softmax(QK^T)V, at a time. The loops control which tiles (blocks) of the input matrices (Q, K, V) are loaded from slow global memory (HBM) into fast local memory (SRAM) to perform a piece of the calculation.

Our walkthrough, where we computed a single output row vector, corresponds to one full pass of the "Inner Loop" in this diagram.

Mapping the Diagram Components to Our Walkthrough

Let's look at each element of the diagram and connect it to our example.

1. The Loops

Inner Loop (Blue Arrows): This loop iterates over the queries (rows of Q) and the corresponding output rows. In our example, we only had a single query vector q that produced a single output row o. Therefore, our entire walkthrough represents a single iteration of the Inner Loop.
- The Q matrix block being copied is our single query q.
- The Output to HBM block at the bottom is our o_running vector $\begin{bmatrix}o_1 & o_2\end{bmatrix}$ , which is being progressively built.
Outer Loop (Red Arrows): This loop iterates over the key-value pairs (columns of K^T and rows of V). This loop is the core of the FlashAttention mechanism and maps directly to the steps of our walkthrough.
- Iteration 1 of the Outer Loop corresponds to our "Step 1: Process Tile 1".
- Iteration 2 of the Outer Loop corresponds to our "Step 2: Process Tile 2".

2. The Memory Hierarchy (SRAM vs. HBM)

HBM (High-Bandwidth Memory): This represents the GPU's main global memory (DRAM). This is where the full, large Q, K, V, and final O matrices reside.
SRAM (Fast On-Chip Memory): This is the small but extremely fast "workbench" memory. The diagram shows that we only ever copy small blocks (orange squares) into SRAM to work on them. In our walkthrough, the SRAM would hold:
- Our query vector q.
- The current block of keys and values we are processing (e.g., $k_1, k_2, k_3$ and $v_1, v_2, v_3$ for Tile 1).
- The running statistics: m_running, d_running, and o_running.

3. Putting it all Together: Tracing Our Walkthrough on the Diagram

Let's trace the flow for our single query q.

Initialization:

Before the loops start, the Output to HBM block (our o_running vector) is initialized to [0, 0]. The running statistics m_running = -∞ and d_running = 0 are initialized in SRAM.

Inner Loop Begins (One and only one iteration for our example):

Copy from Q: Our query vector q is loaded from HBM into SRAM. It will stay in SRAM for the entire duration of the Outer Loop.

Outer Loop - Iteration 1 (Our "Process Tile 1"):

Copy from K^T and V: The first block of K^T (corresponding to logits 1, 2, 3) and the first block of V (v_1, v_2, v_3) are loaded from HBM into SRAM.
Compute Block on SRAM: This is the central computation.
- The dot product $q \cdot K^T_{\text{block 1}}$ is calculated.
- The local max, local denominator, and local output are computed.
- The running statistics m_running, d_running, and o_running (which live in SRAM) are updated. After this step, o_running is $\approx [3.8711, 3.8711]$ . The + sign with the purple dotted arrow signifies this update step.

Outer Loop - Iteration 2 (Our "Process Tile 2"):

Copy from K^T and V: The previous blocks of K^T and V are discarded. The second block (logits 6, 2, 1 and values v_4, v_5, v_6) is loaded into SRAM.
Compute Block on SRAM: The computation is repeated.
- A new, larger global maximum (m_new = 6) is found.
- This triggers the crucial rescaling of the existing d_running and o_running vectors.
- The local contributions are calculated and added. After this step, the final un-normalized o_running is $\approx [4.3244, 4.3244]$ .

Outer Loop Finishes:

The loop is complete. The final normalization is performed in SRAM ( $o_{\text{running}} / d_{\text{running}}$ ) to get the final output vector $\approx [3.932, 3.932]$ .

Output to HBM:

The final, correct output vector is written from SRAM back to its designated row in the main output matrix in HBM.

Inner Loop Finishes:

If there are more rows in Q, the Inner Loop will continue to the next row and repeat the entire Outer Loop process.

Appendix A - How to derive activation matrices from weight matrices

For the purpose of focusing on the core concepts of FlashAttention, the walkthrough focused exclusively on the core attention calculation step: $\text{softmax}(QK^T)V$ . This is the specific operation that FlashAttention optimizes.

This appendix provides the "prequel" step that was omitted to derive the activation matrices.

The Two Sets of Q, K, V

It's crucial to distinguish between:

The Weight Matrices (W_Q, W_K, W_V): These are the trainable parameters learned during the training of the Transformer model. They are part of a standard Linear layer. Their job is to project the input token embeddings into the query, key, and value spaces. They are the same for every input sequence.
The Activation Matrices (Q, K, V): These are the intermediate representations or activations. They are the result of multiplying the input embeddings by the weight matrices. These matrices are different for every input sequence and are the actual inputs to the attention calculation kernel that FlashAttention replaces. They are not trainable parameters themselves.

Think of it like a recipe:

The weight matrices (W_Q, W_K, W_V) are the instructions in the recipe book (fixed, learned).
The activation matrices (Q, K, V) are the actual ingredients you've prepared for one specific meal (changes every time you cook).

FlashAttention's innovation is in how to efficiently combine the prepared ingredients (Q, K, V), not in the initial preparation step itself.

The Omitted "Prequel" Step: Creating Q, K, and V

Here is the step that happens before our walkthrough begins.

Let's assume we have an input sequence of 6 tokens, and each token has an embedding dimension of 4. This is our input matrix X (not to be confused with the logit vector X from the walkthrough).

Input Embeddings X (size 6x4): $X = \begin{bmatrix} \text{embedding for token 1} \newline \text{embedding for token 2} \newline \vdots \newline \text{embedding for token 6} \end{bmatrix}$

Now, let's define our trainable weight matrices. Let's say the head dimension d is 2. The weight matrices must project from the embedding dimension (4) to the head dimension (2). So, they will all be size 4x2.

Weight Matrices W_Q, W_K, W_V (size 4x2, trainable): $W_Q = \begin{bmatrix} \dots \end{bmatrix}, \quad W_K = \begin{bmatrix} \dots \end{bmatrix}, \quad W_V = \begin{bmatrix} \dots \end{bmatrix}$

The Q, K, and V activation matrices are created with standard matrix multiplication:

\begin{aligned} Q &= X \cdot W_Q \quad (\text{results in a } 6 \times 2 \text{ matrix}) \newline K &= X \cdot W_K \quad (\text{results in a } 6 \times 2 \text{ matrix}) \newline V &= X \cdot W_V \quad (\text{results in a } 6 \times 2 \text{ matrix}) \end{aligned}

The V matrix we get from this calculation is precisely the V matrix we used in the walkthrough:

V = \begin{bmatrix} 1 & 1 \newline 2 & 2 \newline 3 & 3 \newline 4 & 4 \newline 5 & 5 \newline 6 & 6 \end{bmatrix}

Connecting to the Logits

In our walkthrough, we focused on a single query vector q attending to all the keys. This corresponds to taking the first row of the Q matrix.

q = Q[0, :] \quad (\text{the first row, size } 1 \times 2)

The logit vector X from the walkthrough would then be calculated as:

\text{Logits} = q \cdot K^T

This multiplication would result in the 1 x 6 vector we started with:

\text{Logits} = \begin{bmatrix} 1 & 2 & 3 & 6 & 2 & 1 \end{bmatrix}

Summary

So, the full, un-omitted process is:

Linear Projections (The Omitted Prequel):
- Start with input embeddings X.
- Compute $Q = X \cdot W_Q$ , $K = X \cdot W_K$ , $V = X \cdot W_V$ using the trainable weight matrices. This is done with standard, highly optimized matrix multiplication libraries (GEMM).
FlashAttention Calculation (The Walkthrough):
- Take the resulting Q, K, and V activation matrices as input.
- Efficiently compute $O = \text{softmax}(QK^T)V$ in a single kernel without materializing the full attention matrix.

Appendix B - The Paper's pseudocode vs. this walkthrough

The goal of my walkthrough was to make the calculation as intuitive as possible to follow by hand. To do this, I slightly rearranged the math to keep the intermediate numbers as simple as possible, while still being mathematically equivalent to the paper's algorithm.

1. The Paper's "Algorithm FlashAttention (Tiling)"

The algorithm in the PDF maintains a normalized output vector o' at every step. Let's look closely at the update rule for the output:

o^\prime_i \leftarrow o^\prime_{i-1}\frac{d^\prime_{i-1}e^{m_{i-1}-m_i}}{d^\prime_i} + \frac{\sum_{j=1}^b e^{x_i[j]-m_i}V[j + (i-1)b, :]}{d^\prime_i}

Notice that both terms are divided by $d^\prime_i$ , the new total denominator. This means that at the end of each iteration i, the vector o'_i is the correctly normalized attention output for all the data processed up to that tile.

2. My Walkthrough's Method

Doing this normalization (division) at every single step can be cumbersome for a manual example. It's easier to work with un-normalized sums and perform a single division at the very end.

So, this walkthrough maintains an un-normalized running output o_running. Our update rule was:

o_{\text{running}} = o_{\text{old}} \cdot e^{m_{\text{old}} - m_{\text{new}}} + o_{\text{local}}

This is mathematically equivalent to the numerator of the paper's formula. You can see this if you multiply the paper's entire update rule by $d^\prime_i$ :

\begin{aligned} o^\prime_i \cdot d^\prime_i &= \left( o^\prime_{i-1}\frac{d^\prime_{i-1}e^{m_{i-1}-m_i}}{d^\prime_i} + \frac{\sum \dots V[\dots]}{d^\prime_i} \right) \cdot d^\prime_i \newline o^\prime_i \cdot d^\prime_i &= (o^\prime_{i-1} d^\prime_{i-1}) e^{m_{i-1}-m_i} + \sum \dots V[\dots] \end{aligned}

If we define the $o_{\text{running}}$ as the paper's $o^\prime_i \cdot d^\prime_i$ , then this equation is exactly the one used in the walkthrough:

My $o_{\text{new}}$ is $o^\prime_i \cdot d^\prime_i$
My $o_{\text{old}}$ is $o^\prime_{i-1} \cdot d^\prime_{i-1}$
My $o_{\text{local}}$ is $\sum \dots V[\dots]$

The note was an attempt to explain this simplification: we chose to track the un-normalized numerator throughout the process for clarity and then perform the final division $o_{\text{final}} / d_{\text{final}}$ only once.

Why normalizing only at the last step works

The reason this works is that the update rules for the un-normalized numerator (o_running) and the denominator (d_running) are designed to maintain a consistent relationship. At every step i, the correctly normalized output is simply the ratio of the running numerator to the running denominator at that step. By deferring the division to the end, we arrive at the same final ratio. The proof by induction below demonstrates this formally.

The Two Methods

Let's formally define the two methods we are comparing. We will use the paper's notation where $o^\prime_i$ is the normalized output after tile i, and we'll introduce $O_i$ (capital O) as our un-normalized running output numerator from the walkthrough.

Method 1: Normalize at Each Step (The Paper's Algorithm)
The state after tile i is defined by:

$d^\prime_i = d^\prime_{i-1} e^{m_{i-1}-m_i} + \left( \sum_{j \in \text{Tile}_i} e^{x_j-m_i} \right)$
$o^\prime_i = o^\prime_{i-1}\frac{d^\prime_{i-1}e^{m_{i-1}-m_i}}{d^\prime_i} + \frac{\sum_{j \in \text{Tile}_i} e^{x_j-m_i}V_j}{d^\prime_i}$

The final result is $o^\prime_{N}$ after the last tile N.

Method 2: Normalize Only at the End (The Walkthrough's Method)
The state after tile i is defined by an un-normalized numerator $O_i$ and the same denominator $d^\prime_i$ :

$d^\prime_i = d^\prime_{i-1} e^{m_{i-1}-m_i} + \left( \sum_{j \in \text{Tile}_i} e^{x_j-m_i} \right)$
$O_i = O_{i-1} e^{m_{i-1}-m_i} + \sum_{j \in \text{Tile}_i} e^{x_j-m_i}V_j$

The final result is calculated as $O_{N} / d^\prime_{N}$ at the very end.

The Proof of Equivalence

We want to prove that $o^\prime_N = O_N / d^\prime_N$ . We can prove this by induction, showing that the relationship $o^\prime_i = O_i / d^\prime_i$ holds true for every step i.

1. Base Case (i=1)

Let's check the first tile. Both methods start with $d^\prime_0 = 0$ , $o^\prime_0 = \vec{0}$ , and $O_0 = \vec{0}$ .

Method 1:

$o^\prime_1 = \underbrace{o^\prime_0 \cdot (\dots)}{0} + \frac{\sum{j \in T_1} e^{x_j-m_1}V_j}{d^\prime_1} = \frac{\sum_{j \in T_1} e^{x_j-m_1}V_j}{d^\prime_1}$
Method 2:

$O_1 = \underbrace{O_0 \cdot (\dots)}{0} + \sum{j \in T_1} e^{x_j-m_1}V_j = \sum_{j \in T_1} e^{x_j-m_1}V_j$ The final result would be $O_1 / d^\prime_1$ .

Comparing the two, we see they are identical. The base case holds.

2. Inductive Hypothesis

Assume that the relationship is true for step i-1. That is, assume:

o^\prime_{i-1} = \frac{O_{i-1}}{d^\prime_{i-1}} \quad \implies \quad O_{i-1} = o^\prime_{i-1} d^\prime_{i-1}

3. Inductive Step

Now we must prove that the relationship holds for step i. Let's start with the formula for $o^\prime_i$ from Method 1 and show it equals $O_i / d^\prime_i$ .

Start with the definition of $o^\prime_i$ :

o^\prime_i = o^\prime_{i-1}\frac{d^\prime_{i-1}e^{m_{i-1}-m_i}}{d^\prime_i} + \frac{\sum_{j \in T_i} e^{x_j-m_i}V_j}{d^\prime_i}

Let's combine the two fractions over the common denominator $d^\prime_i$ :

o^\prime_i = \frac{\left(o^\prime_{i-1} d^\prime_{i-1}\right) e^{m_{i-1}-m_i} + \sum_{j \in T_i} e^{x_j-m_i}V_j}{d^\prime_i}

Now, look at the term in the parentheses: $(o^\prime_{i-1} d^\prime_{i-1})$ . According to our Inductive Hypothesis, this is exactly equal to $O_{i-1}$ . Let's substitute it in:

o^\prime_i = \frac{O_{i-1} e^{m_{i-1}-m_i} + \sum_{j \in T_i} e^{x_j-m_i}V_j}{d^\prime_i}

Now, look at the entire numerator: $O_{i-1} e^{m_{i-1}-m_i} + \sum_{j \in T_i} e^{x_j-m_i}V_j$ . This is precisely the definition of $O_i$ from Method 2.

So, we can substitute $O_i$ for the numerator:

o^\prime_i = \frac{O_i}{d^\prime_i}

This completes the proof. We have shown that if the relationship holds for step i-1, it must hold for step i. Since it holds for the base case i=1, it holds for all steps.

Intuitive Analogy: Calculating a Weighted Average

Think about calculating the final grade in a class. You have different assignments with different weights (scores).

Method 1 (Normalize at Each Step): After you get your first grade, you calculate your current average. When you get your second grade, you update your average based on the new grade and its weight relative to the first. You keep re-calculating your "running average" after every single assignment.
Method 2 (Normalize at the End): You collect all your points scored for each assignment (score * weight) in one column. You collect the total possible points (sum of all weights) in another column. You do this for the whole semester. At the very end, you do one single division: (total points scored) / (total possible points).

Both methods give you the exact same final grade. The second method simply defers the division. We can apply the same principle in FlashAttention: accumulate the un-normalized numerator (o_running) and the un-normalized denominator (d_running) separately and effectively perform the division at the end.

Introduction to FlashAttention

Vanilla self-attention mechanism without FlashAttention

Step 1: Calculate the Full Attention Score Matrix (A = softmax(X))

1a. Find the Global Maximum (m)

1b. Compute the Exponentials and the Denominator (d)

1c. Normalize to get the Final Attention Vector A

Step 2: Multiply Attention Scores by the Value Matrix ( O=A⋅VO = A \cdot VO=A⋅V )

Comparison and Key Differences

Conceptual Overview: Fused Tiled Attention

FlashAttention by hand

Link back to the FlashAttention algorithm

Step 1: Process Tile 1 (i=1)

1a. Find the New Maximum

Link back to the FlashAttention algorithm

1b. Calculate Local Denominator and Local Output

Link back to the FlashAttention algorithm

1c. Update Running Statistics

Link back to the FlashAttention algorithm

Step 2: Process Tile 2 (i=2)

2a. Find the New Maximum

Link back to the FlashAttention algorithm

2b. Calculate Local Denominator and Local Output

Link back to the FlashAttention algorithm

2c. Update Running Statistics

Link back to the FlashAttention algorithm

Final Result

Link back to the FlashAttention algorithm

Walkthrough of the FlashAttention diagram

High-Level Overview

Mapping the Diagram Components to Our Walkthrough

1. The Loops

2. The Memory Hierarchy (SRAM vs. HBM)

3. Putting it all Together: Tracing Our Walkthrough on the Diagram

Appendix A - How to derive activation matrices from weight matrices

The Two Sets of Q, K, V

The Omitted "Prequel" Step: Creating Q, K, and V

Connecting to the Logits

Summary

Appendix B - The Paper's pseudocode vs. this walkthrough

Why normalizing only at the last step works

The Two Methods

The Proof of Equivalence

1. Base Case (i=1)

2. Inductive Hypothesis

3. Inductive Step

Intuitive Analogy: Calculating a Weighted Average

Step 1: Calculate the Full Attention Score Matrix (`A = softmax(X)`)

1a. Find the Global Maximum (`m`)

1b. Compute the Exponentials and the Denominator (`d`)

1c. Normalize to get the Final Attention Vector `A`

Step 2: Multiply Attention Scores by the Value Matrix ( $O = A \cdot V$ )

Step 1: Process Tile 1 (`i=1`)

Step 2: Process Tile 2 (`i=2`)