DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel Fusion

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.


Every few months, a new LLM appears claiming to be 2× faster, 3× cheaper, or capable of serving millions more tokens per second.

Many developers assume the gains come from better GPUs or smaller models.

Often, the real answer is far less glamorous:

Someone removed a few trips to memory.

One of the most important performance techniques in modern LLM inference is kernel fusion. It doesn't change the model architecture. It doesn't improve accuracy. It doesn't make the AI smarter.

It simply makes the hardware spend less time waiting and more time computing.

And in large-scale AI systems, that can mean the difference between serving thousands of users and serving millions.

Let's dig into how fused kernels work, starting from intuition and moving down to GPU-level details.

1. Why LLMs Are Often Memory-Bound

When developers first think about neural network performance, they usually focus on FLOPS.

Modern GPUs advertise enormous numbers:

  • Tens of TFLOPS
  • Hundreds of TFLOPS
  • Even PFLOPS for specialized workloads

Yet many LLM operations don't come close to using that compute capacity.

The reason is that a GPU spends a surprising amount of time moving data around.

Imagine a simple operation:

y = gelu(x + bias)
Enter fullscreen mode Exit fullscreen mode

Conceptually this is tiny.

But naively, the GPU may:

  1. Load x
  2. Load bias
  3. Compute addition
  4. Write result to memory
  5. Read result back
  6. Compute GELU
  7. Write final output

The arithmetic is cheap.

The memory traffic is expensive.

As models grow into billions of parameters, memory movement becomes one of the dominant costs.

2. What Is a GPU Kernel?

Before understanding fusion, we need to understand kernels.

A GPU kernel is essentially a program launched on the GPU.

For example:

z = x + y
Enter fullscreen mode Exit fullscreen mode

might launch one kernel.

Then:

output = relu(z)
Enter fullscreen mode Exit fullscreen mode

might launch another.

Then:

output = output * scale
Enter fullscreen mode Exit fullscreen mode

might launch a third.

Each kernel launch has overhead:

  • Scheduling work
  • Reading memory
  • Writing memory
  • Synchronization

The GPU repeatedly moves intermediate results between global memory and compute units.

Those extra movements add up quickly.

3. The Core Idea of Kernel Fusion

Kernel fusion combines multiple operations into a single GPU kernel.

Instead of:

z = x + bias
a = gelu(z)
output = a * scale
Enter fullscreen mode Exit fullscreen mode

we create one fused operation:

output = scale * gelu(x + bias)
Enter fullscreen mode Exit fullscreen mode

Now the GPU can:

  1. Load input once
  2. Perform all calculations
  3. Write output once

No intermediate tensors are stored in global memory.

Visually:

Without fusion

Memory → Add
          ↓
       Memory
          ↓
        GELU
          ↓
       Memory
          ↓
       Scale
          ↓
       Memory
Enter fullscreen mode Exit fullscreen mode

With fusion

Memory → Add → GELU → Scale → Memory
Enter fullscreen mode Exit fullscreen mode

The computation is identical.

The data movement is dramatically reduced.

4. Where Fusion Appears Inside LLMs

Modern transformers contain many opportunities for fusion.

A few common examples:

Bias + GELU Fusion

Instead of:

hidden = linear(x)
hidden += bias
hidden = gelu(hidden)
Enter fullscreen mode Exit fullscreen mode

The bias addition and activation are fused.

This is common in transformer MLP blocks.

LayerNorm Fusion

Layer normalization requires:

  • Mean computation
  • Variance computation
  • Normalization
  • Scale
  • Shift

Naively these can involve multiple passes through memory.

Optimized kernels perform much of the work in one fused operation.

Softmax Fusion

Attention layers require softmax:

softmax(QK^T)
Enter fullscreen mode Exit fullscreen mode

Implementations often fuse:

  • Scaling
  • Masking
  • Softmax

into a single kernel.

This reduces memory traffic significantly.

5. FlashAttention: The Famous Fusion Example

One of the best-known examples of fusion is Tri Dao's FlashAttention.

The traditional attention pipeline looks roughly like:

QK^T
 ↓
Store matrix
 ↓
Mask
 ↓
Store matrix
 ↓
Softmax
 ↓
Store matrix
 ↓
Multiply by V
Enter fullscreen mode Exit fullscreen mode

The intermediate attention matrix can be enormous.

For long contexts it becomes a major bottleneck.

FlashAttention reorganizes the computation so that large intermediate matrices never need to be materialized in global memory.

Instead:

  • Data is processed in tiles
  • Shared memory is heavily used
  • Multiple attention steps are effectively fused together

The result is dramatically lower memory usage and substantially higher throughput.

This single optimization helped unlock much longer context windows for modern LLMs.

6. How Fusion Works at the GPU Level

Let's go one level deeper.

Modern GPUs have a hierarchy:

Global Memory (HBM)
        ↓
L2 Cache
        ↓
Shared Memory
        ↓
Registers
Enter fullscreen mode Exit fullscreen mode

Global memory is large but relatively slow.

Registers are extremely fast but tiny.

Fusion attempts to keep intermediate values as close to registers as possible.

Instead of:

Compute
 ↓
Write to HBM
 ↓
Read from HBM
 ↓
Compute
Enter fullscreen mode Exit fullscreen mode

we get:

Compute
 ↓
Register
 ↓
Compute
 ↓
Register
 ↓
Compute
Enter fullscreen mode Exit fullscreen mode

This drastically increases arithmetic intensity:

Useful Computation
------------------
Bytes Moved
Enter fullscreen mode Exit fullscreen mode

Higher arithmetic intensity generally means better GPU utilization.

This is why fusion often produces large speedups even when the number of mathematical operations stays exactly the same.

7. Why Fused Kernels Are Hard to Build

If fusion is so beneficial, why not fuse everything?

Because fusion introduces complexity.

Several challenges emerge:

Register Pressure

Every intermediate value consumes registers.

Too many registers reduce occupancy.

Compilation Complexity

A fused kernel may contain dozens of operations.

Generating optimal GPU code becomes difficult.

Hardware Differences

A kernel optimized for:

  • NVIDIA H100
  • NVIDIA A100
  • AMD MI300

may require different strategies.

Debugging

Instead of debugging:

Add
GELU
Multiply
Enter fullscreen mode Exit fullscreen mode

you debug:

FusedAddGeluMultiplyLayerNormKernel_v7
Enter fullscreen mode Exit fullscreen mode

which is considerably less pleasant.

This is one reason projects such as:

  • Triton
  • NVIDIA CUDA
  • TorchInductor

have become increasingly important.

They help automate kernel generation and fusion.

Conclusion

Fused kernels are one of those optimizations that seem almost boring at first glance.

No new model architecture.

No breakthrough algorithm.

No clever prompting technique.

Yet they are responsible for a significant portion of the performance gains that make modern LLM systems practical.

The key insight is simple:

In large-scale AI systems, moving data is often more expensive than computing on it.

Kernel fusion reduces unnecessary memory traffic, keeps data closer to the GPU's compute units, and allows the hardware to spend more time doing useful work.

The next time you hear that a new LLM stack is dramatically faster, don't just ask about quantization, caching, or model architecture.

Ask:

How much of that speedup came from fused kernels?

Question for readers: Have you ever profiled an ML workload and discovered that memory movement—not computation—was the real bottleneck?


*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

GitHub logo HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit




GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

Top comments (0)