vaibhav ahluwalia

Posted on Jan 19 • Edited on Jan 21

Caching Strategies for LLM Systems (Part 2): KV Cache and the Mathematics of Fast Transformer Inference

#ai #nlp #machinelearning #deeplearning

Diagram of self‑attention in transformers: inputs are transformed into Q (queries), K (keys), and V (values), and attention weights are computed using scaled dot products.

(Source: ResearchGate)

Autoregressive decoding in transformers is computationally expensive due to repeated self-attention over an ever-growing context. A naïve implementation recomputes attention over all previous tokens at every decoding step, leading to quadratic complexity per token. Transformer Key-Value (KV) caching eliminates this redundancy by storing attention keys and values from previous steps and reusing them during decoding.

If you missed it, in Part 1 of this series, we explored Exact-Match and Semantic Caching for LLMs, showing how we can reuse previous responses to reduce latency and token cost. In this post, we move to KV caching, a complementary technique that optimizes autoregressive decoding itself.

The Autoregressive Problem

Language models generate text one token at a time.

Mathematically, a model learns:

p(x₁, x₂, …, x_T) = ∏ₜ p(x_t | x₁, …, x_{t-1})

Which means:

Each new token depends on all previous tokens.

So when generating token x_t, the model must look at the entire prefix:

x₁, x₂, …, x_{t-1}

This is where attention comes in.

The Attention Equation (Core of Transformers)

Inside every transformer layer, attention is computed as:

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

Where:

Q = Query vectors
K = Key vectors
V = Value vectors
d = dimension of each attention head

For a sequence of length t, we compute:

Q ∈ R^(t × d)
K ∈ R^(t × d)
V ∈ R^(t × d)

So attention builds a t × t similarity matrix.

The Hidden Inefficiency

During generation, tokens arrive one by one:

x₁ → x₂ → x₃ → ... → x_t

At step t, the model computes attention over:

x₁, x₂, ..., x_t

But notice something important:

The Key and Value vectors of previous tokens never change.

Once token x₁ is processed:

its Key = fixed
its Value = fixed

Yet naïve decoding recomputes them again and again.

So at step t, the model recomputes:

K₁, K₂, ..., K_{t-1}
V₁, V₂, ..., V_{t-1}

even though they were already computed at step t-1.

This is pure wasted compute.

Complexity Explosion

Let’s look at the cost.

At step t, attention costs:

O(t²)

So total decoding cost becomes:

Σₜ O(t²) = O(T³)

Which means:

Latency explodes with context length
Long prompts become impossible
Real-time chat breaks

This is why early transformers were painfully slow at inference.

The Key Insight: Keys and Values Are Invariant

Here is the beautiful observation:

Once a token is processed, its Key and Value never change.

So instead of recomputing them, we store them.

This is Key-Value caching.

KV Cache: The Mathematical Fix

At each step t, we do:

Compute Key and Value for the new token:

   k_t, v_t

Append them to cache:

   K₁:t = [K₁:t-1, k_t]
   V₁:t = [V₁:t-1, v_t]

Compute attention using cached keys and values:

   Attention(q_t, K₁:t, V₁:t)

Now we only compute attention between:

1 query vector
t cached key vectors

So attention becomes linear.

Complexity Reduction

Without KV Cache

Per token cost:

O(t²)

Total decoding:

O(T³)

With KV Cache

Per token cost:

O(t)

Total decoding:

O(T²)

This is a massive reduction.

This is why modern LLMs can handle:

Long conversations
Streaming responses
Real-time chat

Memory Tradeoff

KV cache trades compute for memory.

Each token stores:

its Key vector
its Value vector

So memory grows linearly with context length:

Memory ∝ T

This is why:

Long context models need large GPUs
Paged KV cache exists
KV quantization is popular

KV Cache Memory Analysis

KV caching trades extra memory for faster computation during autoregressive inference.
Each new token contributes one key and one value vector per attention head per layer.

Notation

Symbol	Meaning
`B`	Batch size (number of sequences processed together)
`N`	Sequence length (context + generated tokens)
`L`	Number of transformer layers
`H`	Number of attention heads per layer
`d_head`	Dimension of each attention head
`d_model`	Hidden size of the model, typically `d_model = H × d_head`
`b`	Bits per element (e.g., 16 for FP16, 4 for int4)

KV Cache Formula

For a single batch element, each token requires:

numbers_per_token = 2 × L × d_model

For a full sequence of length N, the total number of floats stored in KV cache is:

total_numbers = B × N × 2 × L × d_model

Converting to bytes, for b bits per number:

KV_cache_bytes = B × N × 2 × L × d_model × (b / 8)

Example: 7B Model (FP16)

Consider a 7B-parameter model with:

L = 32 layers
d_model = 4096 hidden size
Sequence length N = 4096
Batch size B = 1
FP16 → b = 16 bits (2 bytes per float)

KV_cache_bytes = 1 × 4096 × 2 × 32 × 4096 × 2
               = 2,147,483,648 bytes
               ≈ 2 GB

This shows how KV cache grows linearly with layers L, sequence length N, and batch B — which can become a memory bottleneck for large models.

Optimization: 4-bit KV Cache Quantization

If we store keys and values in 4-bit precision instead of FP16:

KV_cache_4bit = 2 GB × (4 / 16) = 0.5 GB

This dramatically reduces memory, enabling:

Longer context windows
Larger batch sizes
GPU-friendly inference

Quick Reference Table

Model	Seq Len	Layers	Hidden	Precision	KV Memory
7B	4096	32	4096	FP16	2 GB
7B	4096	32	4096	4-bit	0.5 GB

Sources: Hugging Face Transformers documentation & blog (on KV caching and cache quantization), and NVIDIA Developer blog on LLM inference memory formulas.

Final Intuition

KV cache turns transformers from slow batch models into fast streaming systems.

It is the difference between waiting minutes for a response and seeing tokens appear instantly.

Connect with me:

LinkedIn: (https://www.linkedin.com/in/vaibhav-ahluwalia-83887a227/)

DEV Community