Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.
Large language models keep getting bigger.
Context windows have grown from a few thousand tokens to hundreds of thousands, and some models now advertise context lengths measured in millions of tokens.
Yet for years, one part of the Transformer threatened to become the bottleneck:
Attention.
Not because it required too much math.
Because it moved too much data.
FlashAttention is one of the most important optimizations in modern AI infrastructure because it attacks exactly that problem. It doesn't change the Transformer architecture, approximate attention, or introduce a new model design. Instead, it rethinks how attention is executed on GPUs.
The result is dramatically lower memory usage, faster training, faster inference, and practical long-context models.
Let's see how it works.
1. The Real Cost of Attention
Every Transformer layer computes attention using three matrices:
- Query (Q)
- Key (K)
- Value (V)
The standard attention equation is:
Attention(Q,K,V) = softmax(QK^T / sqrt(d))V
Conceptually, every token compares itself against every other token.
For a sequence of N tokens, the attention score matrix contains:
N x N
elements.
A sequence length of 16,384 tokens requires:
16,384 x 16,384 = 268 million
attention scores.
The first instinct is usually:
"That's a lot of computation."
But on modern GPUs, the bigger problem is often memory traffic.
2. GPUs Are Often Memory-Bound, Not Compute-Bound
Modern GPUs can perform enormous numbers of floating-point operations per second.
What they cannot do nearly as efficiently is move massive amounts of data between memory levels.
A simplified GPU memory hierarchy looks like:
HBM (GPU memory)
|
L2 Cache
|
Shared Memory
|
Registers
The farther away the data is from the compute units, the more expensive it is to access.
Traditional attention repeatedly writes and reads the full attention matrix from HBM.
The workflow looks roughly like this:
Load Q
Load K
Compute QK^T
Write attention matrix to memory
Read attention matrix
Apply softmax
Write result
Read result
Multiply by V
Write output
A huge amount of time is spent moving data rather than performing useful computation.
This is the problem FlashAttention solves.
3. The Core Insight: Never Materialize the Attention Matrix
The key observation behind FlashAttention is surprisingly simple:
You don't actually need to store the full attention matrix.
Traditional attention explicitly creates:
QK^T
and writes it to memory.
FlashAttention never does.
Instead, it processes attention in small blocks.
Conceptually:
Q Block 1 x K Block 1
Q Block 1 x K Block 2
Q Block 1 x K Block 3
...
Each block is loaded, processed, contributes to the final result, and is discarded.
The gigantic attention matrix never exists in memory.
This immediately removes one of the largest memory bottlenecks in Transformer execution.
4. Why This Is Harder Than It Sounds
At first glance, block-wise processing seems impossible.
The softmax operation requires information from an entire row.
softmax(x_i) = exp(x_i) / sum(exp(x_j))
To compute a single probability, you need a denominator that depends on all scores.
If only one block is visible at a time, how can softmax be computed correctly?
This is where FlashAttention becomes clever.
Instead of storing all scores, it maintains running statistics while processing blocks.
These include:
- Running maximum
- Running normalization term
- Running output accumulator
As each block arrives, these values are updated.
When all blocks have been processed, the result is mathematically identical to standard attention.
Not approximate.
Not close.
Exactly identical.
5. Online Softmax: The Trick That Makes It Work
The breakthrough behind FlashAttention is often called Online Softmax.
Imagine attention scores arriving in chunks:
Block A
Block B
Block C
A naive implementation would need every score before computing softmax.
Online Softmax instead maintains enough information to update the result incrementally.
The algorithm keeps track of:
Current maximum score
Current normalization factor
Current weighted output
When a new block arrives:
Update maximum
Rescale previous values
Accumulate new contributions
This allows FlashAttention to process attention as a stream rather than as a giant matrix.
The memory savings are enormous.
More importantly, the final output is identical to what standard attention would have produced.
6. Tiling: Making GPUs Happy
FlashAttention is often described as an IO-aware algorithm.
IO-aware means the algorithm is designed around memory movement rather than purely around arithmetic complexity.
The implementation uses tiling.
Instead of operating on huge matrices, FlashAttention loads small chunks into fast on-chip memory:
Load tile into shared memory
Compute attention
Update softmax statistics
Accumulate output
Discard tile
Load next tile
Because shared memory is dramatically faster than HBM, the GPU spends much more time performing useful work.
A useful mental model is:
Traditional Attention:
Memory -> Compute -> Memory -> Compute -> Memory
FlashAttention:
Memory -> Compute -> Compute -> Compute
The computation remains approximately O(N^2), but memory traffic is dramatically reduced.
That's where most of the speedup comes from.
7. FlashAttention-2, FlashAttention-3, and Production Systems
The original FlashAttention paper introduced the core idea.
FlashAttention-2 improved:
- GPU utilization
- Parallelism
- Work partitioning
- Training throughput
The result was significantly higher performance on modern accelerators.
FlashAttention-3 pushed things further for newer NVIDIA Hopper GPUs and introduced support for modern low-precision formats such as FP8.
Today, FlashAttention is used throughout the AI ecosystem.
You'll find it in:
- PyTorch
- Hugging Face Transformers
- vLLM
- TensorRT-LLM
- Many open-weight LLMs
For many developers, enabling FlashAttention can be as simple as selecting an optimized attention backend.
The model architecture remains unchanged.
The performance characteristics improve dramatically.
What FlashAttention Does Not Fix
FlashAttention is powerful, but it doesn't magically eliminate all scaling problems.
The computation is still approximately:
O(N^2)
with respect to sequence length.
Very long contexts still require substantial compute.
FlashAttention primarily reduces memory traffic and memory footprint.
It makes attention much more efficient, but it does not change the fundamental quadratic interaction pattern of standard attention.
That's why researchers continue exploring alternatives such as:
- Sliding-window attention
- Linear attention
- State-space models
- Hybrid architectures
These approaches attempt to address the computational scaling problem itself rather than the memory movement problem.
Conclusion
FlashAttention is one of the rare breakthroughs that became foundational almost immediately.
It didn't replace Transformers.
It didn't invent a new attention mechanism.
It didn't require retraining models.
Instead, it recognized that modern GPUs spend an enormous amount of time moving data around and redesigned attention to minimize that movement.
By treating memory access as the bottleneck rather than arithmetic, FlashAttention transformed attention from a memory-heavy operation into a much more hardware-efficient one.
Many of today's long-context LLMs would be significantly slower, more expensive, or simply impractical without it.
As AI systems continue to scale, FlashAttention serves as an important reminder that sometimes the biggest breakthroughs don't come from changing the algorithm itself. They come from understanding how that algorithm interacts with real hardware.
What do you think has been the most important infrastructure breakthrough for LLMs so far: FlashAttention, quantization, KV caching, speculative decoding, or something else entirely?
*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*
Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.
HexmosTech
/
git-lrc
Free, Micro AI Code Reviews That Run on Commit
| 🇩🇰 Dansk | 🇪🇸 Español | 🇮🇷 Farsi | 🇫🇮 Suomi | 🇯🇵 日本語 | 🇳🇴 Norsk | 🇵🇹 Português | 🇷🇺 Русский | 🇦🇱 Shqip | 🇨🇳 中文 | 🇮🇳 हिन्दी |
git-lrc
Free, Micro AI Code Reviews That Run on Commit
AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.
See It In Action
See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements
git-lrc-intro-60s.mp4
Why
- 🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
- 🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
- …
Top comments (0)