DEV Community

vaibhav ahluwalia
vaibhav ahluwalia

Posted on

Caching Strategies for LLM Systems (Part 2): KV Cache and the Mathematics of Fast Transformer Inference

Transformer self-attention diagram showing query, key, and value projections and how they interact in the attention mechanism

Diagram of self‑attention in transformers: inputs are transformed into Q (queries), K (keys), and V (values), and attention weights are computed using scaled dot products.

(Source: ResearchGate)

Autoregressive decoding in transformers is computationally expensive due to repeated self-attention over an ever-growing context. A naïve implementation recomputes attention over all previous tokens at every decoding step, leading to quadratic complexity per token. Transformer Key-Value (KV) caching eliminates this redundancy by storing attention keys and values from previous steps and reusing them during decoding.

If you missed it, in Part 1 of this series, we explored Exact-Match and Semantic Caching for LLMs, showing how we can reuse previous responses to reduce latency and token cost. In this post, we move to KV caching, a complementary technique that optimizes autoregressive decoding itself.


The Autoregressive Problem

Language models generate text one token at a time.

Mathematically, a model learns:

p(x₁, x₂, …, x_T) = ∏ₜ p(x_t | x₁, …, x_{t-1})
Enter fullscreen mode Exit fullscreen mode

Which means:

Each new token depends on all previous tokens.

So when generating token x_t, the model must look at the entire prefix:

x₁, x₂, …, x_{t-1}
Enter fullscreen mode Exit fullscreen mode

This is where attention comes in.


The Attention Equation (Core of Transformers)

Inside every transformer layer, attention is computed as:

Attention(Q, K, V) = softmax(QKᵀ / √d) · V
Enter fullscreen mode Exit fullscreen mode

Where:

  • Q = Query vectors
  • K = Key vectors
  • V = Value vectors
  • d = dimension of each attention head

For a sequence of length t, we compute:

Q ∈ R^(t × d)
K ∈ R^(t × d)
V ∈ R^(t × d)
Enter fullscreen mode Exit fullscreen mode

So attention builds a t × t similarity matrix.


The Hidden Inefficiency

During generation, tokens arrive one by one:

x₁ → x₂ → x₃ → ... → x_t
Enter fullscreen mode Exit fullscreen mode

At step t, the model computes attention over:

x₁, x₂, ..., x_t
Enter fullscreen mode Exit fullscreen mode

But notice something important:

The Key and Value vectors of previous tokens never change.

Once token x₁ is processed:

  • its Key = fixed
  • its Value = fixed

Yet naïve decoding recomputes them again and again.

So at step t, the model recomputes:

K₁, K₂, ..., K_{t-1}
V₁, V₂, ..., V_{t-1}
Enter fullscreen mode Exit fullscreen mode

even though they were already computed at step t-1.

This is pure wasted compute.


Complexity Explosion

Let’s look at the cost.

At step t, attention costs:

O(t²)
Enter fullscreen mode Exit fullscreen mode

So total decoding cost becomes:

Σₜ O(t²) = O(T³)
Enter fullscreen mode Exit fullscreen mode

Which means:

  • Latency explodes with context length
  • Long prompts become impossible
  • Real-time chat breaks

This is why early transformers were painfully slow at inference.


The Key Insight: Keys and Values Are Invariant

Here is the beautiful observation:

Once a token is processed, its Key and Value never change.

So instead of recomputing them, we store them.

This is Key-Value caching.


KV Cache: The Mathematical Fix

This image illustrates the KV (Key-Value) Caching process used in Transformer models to speed up text generation. It breaks the process into two primary phases: Prefill and Decode.

At each step t, we do:

  1. Compute Key and Value for the new token:
   k_t, v_t
Enter fullscreen mode Exit fullscreen mode
  1. Append them to cache:
   K₁:t = [K₁:t-1, k_t]
   V₁:t = [V₁:t-1, v_t]
Enter fullscreen mode Exit fullscreen mode
  1. Compute attention using cached keys and values:
   Attention(q_t, K₁:t, V₁:t)
Enter fullscreen mode Exit fullscreen mode

Now we only compute attention between:

  • 1 query vector
  • t cached key vectors

So attention becomes linear.


Complexity Reduction

Diagram showing transformer attention computation without KV cache versus with KV cache, where cached keys and values prevent recomputation for previous tokens

Without KV Cache

Per token cost:

O(t²)
Enter fullscreen mode Exit fullscreen mode

Total decoding:

O(T³)
Enter fullscreen mode Exit fullscreen mode

With KV Cache

Per token cost:

O(t)
Enter fullscreen mode Exit fullscreen mode

Total decoding:

O(T²)
Enter fullscreen mode Exit fullscreen mode

This is a massive reduction.

This is why modern LLMs can handle:

  • Long conversations
  • Streaming responses
  • Real-time chat

Memory Tradeoff

KV cache trades compute for memory.

Each token stores:

  • its Key vector
  • its Value vector

So memory grows linearly with context length:

Memory ∝ T
Enter fullscreen mode Exit fullscreen mode

This is why:

  • Long context models need large GPUs
  • Paged KV cache exists
  • KV quantization is popular

KV Cache Memory Analysis

KV caching trades extra memory for faster computation during autoregressive inference.
Each new token contributes one key and one value vector per attention head per layer.


Notation

Symbol Meaning
B Batch size (number of sequences processed together)
N Sequence length (context + generated tokens)
L Number of transformer layers
H Number of attention heads per layer
d_head Dimension of each attention head
d_model Hidden size of the model, typically d_model = H × d_head
b Bits per element (e.g., 16 for FP16, 4 for int4)

KV Cache Formula

For a single batch element, each token requires:

numbers_per_token = 2 × L × d_model
Enter fullscreen mode Exit fullscreen mode

For a full sequence of length N, the total number of floats stored in KV cache is:

total_numbers = B × N × 2 × L × d_model
Enter fullscreen mode Exit fullscreen mode

Converting to bytes, for b bits per number:

KV_cache_bytes = B × N × 2 × L × d_model × (b / 8)
Enter fullscreen mode Exit fullscreen mode

Example: 7B Model (FP16)

Consider a 7B-parameter model with:

  • L = 32 layers
  • d_model = 4096 hidden size
  • Sequence length N = 4096
  • Batch size B = 1
  • FP16 → b = 16 bits (2 bytes per float)
KV_cache_bytes = 1 × 4096 × 2 × 32 × 4096 × 2
               = 2,147,483,648 bytes
               ≈ 2 GB
Enter fullscreen mode Exit fullscreen mode

This shows how KV cache grows linearly with layers L, sequence length N, and batch B — which can become a memory bottleneck for large models.


Optimization: 4-bit KV Cache Quantization

If we store keys and values in 4-bit precision instead of FP16:

KV_cache_4bit = 2 GB × (4 / 16) = 0.5 GB
Enter fullscreen mode Exit fullscreen mode

This dramatically reduces memory, enabling:

  • Longer context windows
  • Larger batch sizes
  • GPU-friendly inference

Quick Reference Table

Model Seq Len Layers Hidden Precision KV Memory
7B 4096 32 4096 FP16 2 GB
7B 4096 32 4096 4-bit 0.5 GB

Sources: Hugging Face Transformers documentation & blog (on KV caching and cache quantization), and NVIDIA Developer blog on LLM inference memory formulas.


Final Intuition

KV cache turns transformers from slow batch models into fast streaming systems.

It is the difference between waiting minutes for a response and seeing tokens appear instantly.


Connect with me:

LinkedIn: [https://www.linkedin.com/in/vaibhav-ahluwalia-83887a227/)

Top comments (0)