Shrijith Venkatramana

Posted on Jun 11

FlashAttention Explained: The Optimization That Made Modern LLMs Practical

#ai #webdev #programming #productivity

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Large language models keep getting bigger.

Context windows have grown from a few thousand tokens to hundreds of thousands, and some models now advertise context lengths measured in millions of tokens.

Yet for years, one part of the Transformer threatened to become the bottleneck:

Attention.

Not because it required too much math.

Because it moved too much data.

FlashAttention is one of the most important optimizations in modern AI infrastructure because it attacks exactly that problem. It doesn't change the Transformer architecture, approximate attention, or introduce a new model design. Instead, it rethinks how attention is executed on GPUs.

The result is dramatically lower memory usage, faster training, faster inference, and practical long-context models.

Let's see how it works.

1. The Real Cost of Attention

Every Transformer layer computes attention using three matrices:

Query (Q)
Key (K)
Value (V)

The standard attention equation is:

Attention(Q,K,V) = softmax(QK^T / sqrt(d))V

Conceptually, every token compares itself against every other token.

For a sequence of N tokens, the attention score matrix contains:

N x N

elements.

A sequence length of 16,384 tokens requires:

16,384 x 16,384 = 268 million

attention scores.

The first instinct is usually:

"That's a lot of computation."

But on modern GPUs, the bigger problem is often memory traffic.

2. GPUs Are Often Memory-Bound, Not Compute-Bound

Modern GPUs can perform enormous numbers of floating-point operations per second.

What they cannot do nearly as efficiently is move massive amounts of data between memory levels.

A simplified GPU memory hierarchy looks like:

HBM (GPU memory)
    |
L2 Cache
    |
Shared Memory
    |
Registers

The farther away the data is from the compute units, the more expensive it is to access.

Traditional attention repeatedly writes and reads the full attention matrix from HBM.

The workflow looks roughly like this:

Load Q
Load K

Compute QK^T

Write attention matrix to memory

Read attention matrix

Apply softmax

Write result

Read result

Multiply by V

Write output

A huge amount of time is spent moving data rather than performing useful computation.

This is the problem FlashAttention solves.

3. The Core Insight: Never Materialize the Attention Matrix

The key observation behind FlashAttention is surprisingly simple:

You don't actually need to store the full attention matrix.

Traditional attention explicitly creates:

QK^T

and writes it to memory.

FlashAttention never does.

Instead, it processes attention in small blocks.

Conceptually:

Q Block 1 x K Block 1
Q Block 1 x K Block 2
Q Block 1 x K Block 3
...

Each block is loaded, processed, contributes to the final result, and is discarded.

The gigantic attention matrix never exists in memory.

This immediately removes one of the largest memory bottlenecks in Transformer execution.

4. Why This Is Harder Than It Sounds

At first glance, block-wise processing seems impossible.

The softmax operation requires information from an entire row.

softmax(x_i) = exp(x_i) / sum(exp(x_j))

To compute a single probability, you need a denominator that depends on all scores.

If only one block is visible at a time, how can softmax be computed correctly?

This is where FlashAttention becomes clever.

Instead of storing all scores, it maintains running statistics while processing blocks.

These include:

Running maximum
Running normalization term
Running output accumulator

As each block arrives, these values are updated.

When all blocks have been processed, the result is mathematically identical to standard attention.

Not approximate.

Not close.

Exactly identical.

5. Online Softmax: The Trick That Makes It Work

The breakthrough behind FlashAttention is often called Online Softmax.

Imagine attention scores arriving in chunks:

Block A
Block B
Block C

A naive implementation would need every score before computing softmax.

Online Softmax instead maintains enough information to update the result incrementally.

The algorithm keeps track of:

Current maximum score
Current normalization factor
Current weighted output

When a new block arrives:

Update maximum

Rescale previous values

Accumulate new contributions

This allows FlashAttention to process attention as a stream rather than as a giant matrix.

The memory savings are enormous.

More importantly, the final output is identical to what standard attention would have produced.

6. Tiling: Making GPUs Happy

FlashAttention is often described as an IO-aware algorithm.

IO-aware means the algorithm is designed around memory movement rather than purely around arithmetic complexity.

The implementation uses tiling.

Instead of operating on huge matrices, FlashAttention loads small chunks into fast on-chip memory:

Load tile into shared memory

Compute attention

Update softmax statistics

Accumulate output

Discard tile

Load next tile

Because shared memory is dramatically faster than HBM, the GPU spends much more time performing useful work.

A useful mental model is:

Traditional Attention:

Memory -> Compute -> Memory -> Compute -> Memory

FlashAttention:

Memory -> Compute -> Compute -> Compute

The computation remains approximately O(N^2), but memory traffic is dramatically reduced.

That's where most of the speedup comes from.

7. FlashAttention-2, FlashAttention-3, and Production Systems

The original FlashAttention paper introduced the core idea.

FlashAttention-2 improved:

GPU utilization
Parallelism
Work partitioning
Training throughput

The result was significantly higher performance on modern accelerators.

FlashAttention-3 pushed things further for newer NVIDIA Hopper GPUs and introduced support for modern low-precision formats such as FP8.

Today, FlashAttention is used throughout the AI ecosystem.

You'll find it in:

PyTorch
Hugging Face Transformers
vLLM
TensorRT-LLM
Many open-weight LLMs

For many developers, enabling FlashAttention can be as simple as selecting an optimized attention backend.

The model architecture remains unchanged.

The performance characteristics improve dramatically.

What FlashAttention Does Not Fix

FlashAttention is powerful, but it doesn't magically eliminate all scaling problems.

The computation is still approximately:

O(N^2)

with respect to sequence length.

Very long contexts still require substantial compute.

FlashAttention primarily reduces memory traffic and memory footprint.

It makes attention much more efficient, but it does not change the fundamental quadratic interaction pattern of standard attention.

That's why researchers continue exploring alternatives such as:

Sliding-window attention
Linear attention
State-space models
Hybrid architectures

These approaches attempt to address the computational scaling problem itself rather than the memory movement problem.

Conclusion

FlashAttention is one of the rare breakthroughs that became foundational almost immediately.

It didn't replace Transformers.

It didn't invent a new attention mechanism.

It didn't require retraining models.

Instead, it recognized that modern GPUs spend an enormous amount of time moving data around and redesigned attention to minimize that movement.

By treating memory access as the bottleneck rather than arithmetic, FlashAttention transformed attention from a memory-heavy operation into a much more hardware-efficient one.

Many of today's long-context LLMs would be significantly slower, more expensive, or simply impractical without it.

As AI systems continue to scale, FlashAttention serves as an important reminder that sometimes the biggest breakthroughs don't come from changing the algorithm itself. They come from understanding how that algorithm interacts with real hardware.

What do you think has been the most important infrastructure breakthrough for LLMs so far: FlashAttention, quantization, KV caching, speculative decoding, or something else entirely?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
…

View on GitHub

DEV Community

FlashAttention Explained: The Optimization That Made Modern LLMs Practical

1. The Real Cost of Attention

2. GPUs Are Often Memory-Bound, Not Compute-Bound

3. The Core Insight: Never Materialize the Attention Matrix

4. Why This Is Harder Than It Sounds

5. Online Softmax: The Trick That Makes It Work

6. Tiling: Making GPUs Happy

7. FlashAttention-2, FlashAttention-3, and Production Systems

What FlashAttention Does Not Fix

Conclusion

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

See It In Action

Why

Top comments (0)