DEV Community

Cover image for The Real Cost of LLM Inference: Memory Bandwidth, Not FLOPs
Avik
Avik

Posted on

The Real Cost of LLM Inference: Memory Bandwidth, Not FLOPs

For years, AI performance discussions focused on a single metric: FLOPs — floating-point operations per second.

But in 2025, FLOPs are no longer the real bottleneck for LLM inference.

If you run any modern model (Llama 3, Qwen2.5, Mistral, Gemma, DeepSeek), you’ll notice something strange:

Your GPU is idle, but your VRAM is choking.

This is not a software issue.

It’s a fundamental hardware constraint.

This post explains why.


1. LLMs Don’t Compute — They Fetch

During inference, an LLM does almost no “heavy math.”

Each token only requires a small number of matrix multiplies.

The real work is:

Loading billions of parameters from memory

over and over again

into the GPU compute cores.

If those parameters sit in VRAM or system RAM, the GPU must continuously stream them into the tensor cores.

And memory bandwidth is finite.

For example:

  • A100 GPU memory bandwidth: 2 TB/s
  • RTX 4090 memory bandwidth: 1 TB/s
  • Llama-3-70B FP16 weights: 140 GB
  • Llama-3-70B Q4_K_M weights: ~38 GB

Even with quantization:

You simply cannot move tens of GB through a memory bus fast enough to feed the compute units.

So compute sits idle.


2. Why FLOPs Are Misleading for LLMs

LLMs are not like vision models.

They don’t process entire batches.

They generate tokens one at a time.

For each token, the model must:

  1. Read every attention layer’s parameters
  2. Read every MLP block's parameters
  3. Read rotary / positional / Softmax data
  4. Run a tiny amount of math
  5. Output a few thousand probabilities

In most layers—especially attention—the math is tiny compared to the weight-loading cost.

So even if your GPU has 100 TFLOPs, it will likely use only 30–40% of that during LLM inference.

Because compute waits for memory.


3. A Simple Example: Why Bigger Models Don’t Always Run Slower

Consider two models:

  • Qwen2.5-7B — 7 billion params
  • Llama3-8B — also 7–8B, similar size

Both might run at:

  • 40 tokens/s on 4090
  • 200 tokens/s on A100 batch

Now scale to a 70B model:

  • 2–4 tokens/s on a consumer GPU
  • 12–15 tokens/s on powerful A100/H100 clusters

Compute did not grow 10× slower.

Memory movement did.

The attention layers now load 10× more weights every token → bottleneck explodes.


4. Why Quantization Helps So Much

Quantization is not magic.

It doesn’t “optimize math.”

It solves a different bottleneck:

It reduces the amount of data that must be read each token.

Examples:

Format Size Reduction Result
FP16 → INT8 2× smaller 2× less memory bandwidth used
INT8 → Q4 ~4× smaller 4× faster weight loading
Q4 → Q2 ~8× smaller Only used on small models

Quantization makes models memory-bandwidth friendly, not “compute-friendly.”

That’s why Qwen2.5-3B-Q4 can run >150 tok/s on a laptop.


5. KV Cache: The Hidden Memory Killer

During inference, each generated token gets stored as a key/value vector.

For long contexts (100K–1M tokens), KV cache becomes massive:

  • Qwen2.5-7B → 80–120 MB per 1K tokens
  • Llama3-70B → 600–800 MB per 1K tokens

Even if weights fit in VRAM, the KV cache bandwidth becomes the new bottleneck.

This is why:

  • Long contexts slow down generation
  • Sliding-window attention models run faster
  • Mamba / RWKV / SSMs are becoming popular

Transformer inference breaks under the weight of its own memory access patterns.


6. Why Future LLMs Must Be “Memory-First” Models

Model architectures that solve the memory bottleneck will dominate.

Three directions already emerging:

1. State-Space Models (SSMs) — Mamba, RWKV

They avoid quadratic attention → less bandwidth per token.

2. Sparse / MoE architectures

Only load 1–2 experts instead of all weights.

3. Flash Attention / Flash Decoding

More efficient caching, fewer memory reads.

4. On-device compression formats

LLM weights stored in compressed form and decompressed during compute.

All aim at one thing:

Reduce memory traffic.


7. The Hard Truth: GPUs Are Overpowered for LLMs

Modern GPUs like A100/H100/4090 have compute units so fast that transformers can’t feed them fast enough.

This is why:

  • Token generation rates plateau
  • Adding more GPUs doesn’t scale linearly
  • Smaller models feel “snappier” than huge ones
  • Flash decoding gives big gains
  • CPU inference is becoming viable again
  • On-device LLMs are exploding

The bottleneck is bandwidth — not FLOPs, not cores, not tensor units.


Final Thoughts

If you want to optimize LLM inference:

  • Don’t chase FLOPs
  • Optimize memory
  • Quantize aggressively
  • Use SSMs where possible
  • Reduce context window
  • Monitor KV cache growth
  • Use Flash-specific kernels
  • Keep batch small unless you’re serving multiple users

Modern LLM speed depends on how fast your hardware can move bytes, not how fast it can multiply matrices.

The future of AI is not compute-first.

It’s memory-first architecture.


Top comments (0)