Shrijith Venkatramana

Posted on Jun 15

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel Fusion

#ai #productivity #programming #webdev

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Every few months, a new LLM appears claiming to be 2× faster, 3× cheaper, or capable of serving millions more tokens per second.

Many developers assume the gains come from better GPUs or smaller models.

Often, the real answer is far less glamorous:

Someone removed a few trips to memory.

One of the most important performance techniques in modern LLM inference is kernel fusion. It doesn't change the model architecture. It doesn't improve accuracy. It doesn't make the AI smarter.

It simply makes the hardware spend less time waiting and more time computing.

And in large-scale AI systems, that can mean the difference between serving thousands of users and serving millions.

Let's dig into how fused kernels work, starting from intuition and moving down to GPU-level details.

1. Why LLMs Are Often Memory-Bound

When developers first think about neural network performance, they usually focus on FLOPS.

Modern GPUs advertise enormous numbers:

Tens of TFLOPS
Hundreds of TFLOPS
Even PFLOPS for specialized workloads

Yet many LLM operations don't come close to using that compute capacity.

The reason is that a GPU spends a surprising amount of time moving data around.

Imagine a simple operation:

y = gelu(x + bias)

Conceptually this is tiny.

But naively, the GPU may:

Load x
Load bias
Compute addition
Write result to memory
Read result back
Compute GELU
Write final output

The arithmetic is cheap.

The memory traffic is expensive.

As models grow into billions of parameters, memory movement becomes one of the dominant costs.

2. What Is a GPU Kernel?

Before understanding fusion, we need to understand kernels.

A GPU kernel is essentially a program launched on the GPU.

For example:

z = x + y

might launch one kernel.

Then:

output = relu(z)

might launch another.

Then:

output = output * scale

might launch a third.

Each kernel launch has overhead:

Scheduling work
Reading memory
Writing memory
Synchronization

The GPU repeatedly moves intermediate results between global memory and compute units.

Those extra movements add up quickly.

3. The Core Idea of Kernel Fusion

Kernel fusion combines multiple operations into a single GPU kernel.

Instead of:

z = x + bias
a = gelu(z)
output = a * scale

we create one fused operation:

output = scale * gelu(x + bias)

Now the GPU can:

Load input once
Perform all calculations
Write output once

No intermediate tensors are stored in global memory.

Visually:

Without fusion

Memory → Add
          ↓
       Memory
          ↓
        GELU
          ↓
       Memory
          ↓
       Scale
          ↓
       Memory

With fusion

Memory → Add → GELU → Scale → Memory

The computation is identical.

The data movement is dramatically reduced.

4. Where Fusion Appears Inside LLMs

Modern transformers contain many opportunities for fusion.

A few common examples:

Bias + GELU Fusion

Instead of:

hidden = linear(x)
hidden += bias
hidden = gelu(hidden)

The bias addition and activation are fused.

This is common in transformer MLP blocks.

LayerNorm Fusion

Layer normalization requires:

Mean computation
Variance computation
Normalization
Scale
Shift

Naively these can involve multiple passes through memory.

Optimized kernels perform much of the work in one fused operation.

Softmax Fusion

Attention layers require softmax:

softmax(QK^T)

Implementations often fuse:

Scaling
Masking
Softmax

into a single kernel.

This reduces memory traffic significantly.

5. FlashAttention: The Famous Fusion Example

One of the best-known examples of fusion is Tri Dao's FlashAttention.

The traditional attention pipeline looks roughly like:

QK^T
 ↓
Store matrix
 ↓
Mask
 ↓
Store matrix
 ↓
Softmax
 ↓
Store matrix
 ↓
Multiply by V

The intermediate attention matrix can be enormous.

For long contexts it becomes a major bottleneck.

FlashAttention reorganizes the computation so that large intermediate matrices never need to be materialized in global memory.

Instead:

Data is processed in tiles
Shared memory is heavily used
Multiple attention steps are effectively fused together

The result is dramatically lower memory usage and substantially higher throughput.

This single optimization helped unlock much longer context windows for modern LLMs.

6. How Fusion Works at the GPU Level

Let's go one level deeper.

Modern GPUs have a hierarchy:

Global Memory (HBM)
        ↓
L2 Cache
        ↓
Shared Memory
        ↓
Registers

Global memory is large but relatively slow.

Registers are extremely fast but tiny.

Fusion attempts to keep intermediate values as close to registers as possible.

Instead of:

Compute
 ↓
Write to HBM
 ↓
Read from HBM
 ↓
Compute

we get:

Compute
 ↓
Register
 ↓
Compute
 ↓
Register
 ↓
Compute

This drastically increases arithmetic intensity:

Useful Computation
------------------
Bytes Moved

Higher arithmetic intensity generally means better GPU utilization.

This is why fusion often produces large speedups even when the number of mathematical operations stays exactly the same.

7. Why Fused Kernels Are Hard to Build

If fusion is so beneficial, why not fuse everything?

Because fusion introduces complexity.

Several challenges emerge:

Register Pressure

Every intermediate value consumes registers.

Too many registers reduce occupancy.

Compilation Complexity

A fused kernel may contain dozens of operations.

Generating optimal GPU code becomes difficult.

Hardware Differences

A kernel optimized for:

NVIDIA H100
NVIDIA A100
AMD MI300

may require different strategies.

Debugging

Instead of debugging:

Add
GELU
Multiply

you debug:

FusedAddGeluMultiplyLayerNormKernel_v7

which is considerably less pleasant.

This is one reason projects such as:

Triton
NVIDIA CUDA
TorchInductor

have become increasingly important.

They help automate kernel generation and fusion.

Conclusion

Fused kernels are one of those optimizations that seem almost boring at first glance.

No new model architecture.

No breakthrough algorithm.

No clever prompting technique.

Yet they are responsible for a significant portion of the performance gains that make modern LLM systems practical.

The key insight is simple:

In large-scale AI systems, moving data is often more expensive than computing on it.

Kernel fusion reduces unnecessary memory traffic, keeps data closer to the GPU's compute units, and allows the hardware to spend more time doing useful work.

The next time you hear that a new LLM stack is dramatically faster, don't just ask about quantization, caching, or model architecture.

Ask:

How much of that speedup came from fused kernels?

Question for readers: Have you ever profiled an ML workload and discovered that memory movement—not computation—was the real bottleneck?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

View on GitHub