DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

FlashAttention Explained: The Optimization That Made Modern LLMs Practical

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.


Large language models keep getting bigger.

Context windows have grown from a few thousand tokens to hundreds of thousands, and some models now advertise context lengths measured in millions of tokens.

Yet for years, one part of the Transformer threatened to become the bottleneck:

Attention.

Not because it required too much math.

Because it moved too much data.

FlashAttention is one of the most important optimizations in modern AI infrastructure because it attacks exactly that problem. It doesn't change the Transformer architecture, approximate attention, or introduce a new model design. Instead, it rethinks how attention is executed on GPUs.

The result is dramatically lower memory usage, faster training, faster inference, and practical long-context models.

Let's see how it works.

1. The Real Cost of Attention

Every Transformer layer computes attention using three matrices:

  • Query (Q)
  • Key (K)
  • Value (V)

The standard attention equation is:

Attention(Q,K,V) = softmax(QK^T / sqrt(d))V
Enter fullscreen mode Exit fullscreen mode

Conceptually, every token compares itself against every other token.

For a sequence of N tokens, the attention score matrix contains:

N x N
Enter fullscreen mode Exit fullscreen mode

elements.

A sequence length of 16,384 tokens requires:

16,384 x 16,384 = 268 million
Enter fullscreen mode Exit fullscreen mode

attention scores.

The first instinct is usually:

"That's a lot of computation."

But on modern GPUs, the bigger problem is often memory traffic.

2. GPUs Are Often Memory-Bound, Not Compute-Bound

Modern GPUs can perform enormous numbers of floating-point operations per second.

What they cannot do nearly as efficiently is move massive amounts of data between memory levels.

A simplified GPU memory hierarchy looks like:

HBM (GPU memory)
    |
L2 Cache
    |
Shared Memory
    |
Registers
Enter fullscreen mode Exit fullscreen mode

The farther away the data is from the compute units, the more expensive it is to access.

Traditional attention repeatedly writes and reads the full attention matrix from HBM.

The workflow looks roughly like this:

Load Q
Load K

Compute QK^T

Write attention matrix to memory

Read attention matrix

Apply softmax

Write result

Read result

Multiply by V

Write output
Enter fullscreen mode Exit fullscreen mode

A huge amount of time is spent moving data rather than performing useful computation.

This is the problem FlashAttention solves.

3. The Core Insight: Never Materialize the Attention Matrix

The key observation behind FlashAttention is surprisingly simple:

You don't actually need to store the full attention matrix.

Traditional attention explicitly creates:

QK^T
Enter fullscreen mode Exit fullscreen mode

and writes it to memory.

FlashAttention never does.

Instead, it processes attention in small blocks.

Conceptually:

Q Block 1 x K Block 1
Q Block 1 x K Block 2
Q Block 1 x K Block 3
...
Enter fullscreen mode Exit fullscreen mode

Each block is loaded, processed, contributes to the final result, and is discarded.

The gigantic attention matrix never exists in memory.

This immediately removes one of the largest memory bottlenecks in Transformer execution.

4. Why This Is Harder Than It Sounds

At first glance, block-wise processing seems impossible.

The softmax operation requires information from an entire row.

softmax(x_i) = exp(x_i) / sum(exp(x_j))
Enter fullscreen mode Exit fullscreen mode

To compute a single probability, you need a denominator that depends on all scores.

If only one block is visible at a time, how can softmax be computed correctly?

This is where FlashAttention becomes clever.

Instead of storing all scores, it maintains running statistics while processing blocks.

These include:

  • Running maximum
  • Running normalization term
  • Running output accumulator

As each block arrives, these values are updated.

When all blocks have been processed, the result is mathematically identical to standard attention.

Not approximate.

Not close.

Exactly identical.

5. Online Softmax: The Trick That Makes It Work

The breakthrough behind FlashAttention is often called Online Softmax.

Imagine attention scores arriving in chunks:

Block A
Block B
Block C
Enter fullscreen mode Exit fullscreen mode

A naive implementation would need every score before computing softmax.

Online Softmax instead maintains enough information to update the result incrementally.

The algorithm keeps track of:

Current maximum score
Current normalization factor
Current weighted output
Enter fullscreen mode Exit fullscreen mode

When a new block arrives:

Update maximum

Rescale previous values

Accumulate new contributions
Enter fullscreen mode Exit fullscreen mode

This allows FlashAttention to process attention as a stream rather than as a giant matrix.

The memory savings are enormous.

More importantly, the final output is identical to what standard attention would have produced.

6. Tiling: Making GPUs Happy

FlashAttention is often described as an IO-aware algorithm.

IO-aware means the algorithm is designed around memory movement rather than purely around arithmetic complexity.

The implementation uses tiling.

Instead of operating on huge matrices, FlashAttention loads small chunks into fast on-chip memory:

Load tile into shared memory

Compute attention

Update softmax statistics

Accumulate output

Discard tile

Load next tile
Enter fullscreen mode Exit fullscreen mode

Because shared memory is dramatically faster than HBM, the GPU spends much more time performing useful work.

A useful mental model is:

Traditional Attention:

Memory -> Compute -> Memory -> Compute -> Memory
Enter fullscreen mode Exit fullscreen mode

FlashAttention:

Memory -> Compute -> Compute -> Compute
Enter fullscreen mode Exit fullscreen mode

The computation remains approximately O(N^2), but memory traffic is dramatically reduced.

That's where most of the speedup comes from.

7. FlashAttention-2, FlashAttention-3, and Production Systems

The original FlashAttention paper introduced the core idea.

FlashAttention-2 improved:

  • GPU utilization
  • Parallelism
  • Work partitioning
  • Training throughput

The result was significantly higher performance on modern accelerators.

FlashAttention-3 pushed things further for newer NVIDIA Hopper GPUs and introduced support for modern low-precision formats such as FP8.

Today, FlashAttention is used throughout the AI ecosystem.

You'll find it in:

  • PyTorch
  • Hugging Face Transformers
  • vLLM
  • TensorRT-LLM
  • Many open-weight LLMs

For many developers, enabling FlashAttention can be as simple as selecting an optimized attention backend.

The model architecture remains unchanged.

The performance characteristics improve dramatically.

What FlashAttention Does Not Fix

FlashAttention is powerful, but it doesn't magically eliminate all scaling problems.

The computation is still approximately:

O(N^2)
Enter fullscreen mode Exit fullscreen mode

with respect to sequence length.

Very long contexts still require substantial compute.

FlashAttention primarily reduces memory traffic and memory footprint.

It makes attention much more efficient, but it does not change the fundamental quadratic interaction pattern of standard attention.

That's why researchers continue exploring alternatives such as:

  • Sliding-window attention
  • Linear attention
  • State-space models
  • Hybrid architectures

These approaches attempt to address the computational scaling problem itself rather than the memory movement problem.

Conclusion

FlashAttention is one of the rare breakthroughs that became foundational almost immediately.

It didn't replace Transformers.

It didn't invent a new attention mechanism.

It didn't require retraining models.

Instead, it recognized that modern GPUs spend an enormous amount of time moving data around and redesigned attention to minimize that movement.

By treating memory access as the bottleneck rather than arithmetic, FlashAttention transformed attention from a memory-heavy operation into a much more hardware-efficient one.

Many of today's long-context LLMs would be significantly slower, more expensive, or simply impractical without it.

As AI systems continue to scale, FlashAttention serves as an important reminder that sometimes the biggest breakthroughs don't come from changing the algorithm itself. They come from understanding how that algorithm interacts with real hardware.

What do you think has been the most important infrastructure breakthrough for LLMs so far: FlashAttention, quantization, KV caching, speculative decoding, or something else entirely?


*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

GitHub logo HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit




AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

  • 🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
  • 🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.

Top comments (0)