Diagram of self‑attention in transformers: inputs are transformed into Q (queries), K (keys), and V (values), and attention weights are computed using scaled dot products.
(Source: ResearchGate)
Autoregressive decoding in transformers is computationally expensive due to repeated self-attention over an ever-growing context. A naïve implementation recomputes attention over all previous tokens at every decoding step, leading to quadratic complexity per token. Transformer Key-Value (KV) caching eliminates this redundancy by storing attention keys and values from previous steps and reusing them during decoding.
If you missed it, in Part 1 of this series, we explored Exact-Match and Semantic Caching for LLMs, showing how we can reuse previous responses to reduce latency and token cost. In this post, we move to KV caching, a complementary technique that optimizes autoregressive decoding itself.
The Autoregressive Problem
Language models generate text one token at a time.
Mathematically, a model learns:
p(x₁, x₂, …, x_T) = ∏ₜ p(x_t | x₁, …, x_{t-1})
Which means:
Each new token depends on all previous tokens.
So when generating token x_t, the model must look at the entire prefix:
x₁, x₂, …, x_{t-1}
This is where attention comes in.
The Attention Equation (Core of Transformers)
Inside every transformer layer, attention is computed as:
Attention(Q, K, V) = softmax(QKᵀ / √d) · V
Where:
- Q = Query vectors
- K = Key vectors
- V = Value vectors
- d = dimension of each attention head
For a sequence of length t, we compute:
Q ∈ R^(t × d)
K ∈ R^(t × d)
V ∈ R^(t × d)
So attention builds a t × t similarity matrix.
The Hidden Inefficiency
During generation, tokens arrive one by one:
x₁ → x₂ → x₃ → ... → x_t
At step t, the model computes attention over:
x₁, x₂, ..., x_t
But notice something important:
The Key and Value vectors of previous tokens never change.
Once token x₁ is processed:
- its Key = fixed
- its Value = fixed
Yet naïve decoding recomputes them again and again.
So at step t, the model recomputes:
K₁, K₂, ..., K_{t-1}
V₁, V₂, ..., V_{t-1}
even though they were already computed at step t-1.
This is pure wasted compute.
Complexity Explosion
Let’s look at the cost.
At step t, attention costs:
O(t²)
So total decoding cost becomes:
Σₜ O(t²) = O(T³)
Which means:
- Latency explodes with context length
- Long prompts become impossible
- Real-time chat breaks
This is why early transformers were painfully slow at inference.
The Key Insight: Keys and Values Are Invariant
Here is the beautiful observation:
Once a token is processed, its Key and Value never change.
So instead of recomputing them, we store them.
This is Key-Value caching.
KV Cache: The Mathematical Fix
At each step t, we do:
- Compute Key and Value for the new token:
k_t, v_t
- Append them to cache:
K₁:t = [K₁:t-1, k_t]
V₁:t = [V₁:t-1, v_t]
- Compute attention using cached keys and values:
Attention(q_t, K₁:t, V₁:t)
Now we only compute attention between:
- 1 query vector
-
tcached key vectors
So attention becomes linear.
Complexity Reduction
Without KV Cache
Per token cost:
O(t²)
Total decoding:
O(T³)
With KV Cache
Per token cost:
O(t)
Total decoding:
O(T²)
This is a massive reduction.
This is why modern LLMs can handle:
- Long conversations
- Streaming responses
- Real-time chat
Memory Tradeoff
KV cache trades compute for memory.
Each token stores:
- its Key vector
- its Value vector
So memory grows linearly with context length:
Memory ∝ T
This is why:
- Long context models need large GPUs
- Paged KV cache exists
- KV quantization is popular
KV Cache Memory Analysis
KV caching trades extra memory for faster computation during autoregressive inference.
Each new token contributes one key and one value vector per attention head per layer.
Notation
| Symbol | Meaning |
|---|---|
B |
Batch size (number of sequences processed together) |
N |
Sequence length (context + generated tokens) |
L |
Number of transformer layers |
H |
Number of attention heads per layer |
d_head |
Dimension of each attention head |
d_model |
Hidden size of the model, typically d_model = H × d_head
|
b |
Bits per element (e.g., 16 for FP16, 4 for int4) |
KV Cache Formula
For a single batch element, each token requires:
numbers_per_token = 2 × L × d_model
For a full sequence of length N, the total number of floats stored in KV cache is:
total_numbers = B × N × 2 × L × d_model
Converting to bytes, for b bits per number:
KV_cache_bytes = B × N × 2 × L × d_model × (b / 8)
Example: 7B Model (FP16)
Consider a 7B-parameter model with:
-
L = 32layers -
d_model = 4096hidden size - Sequence length
N = 4096 - Batch size
B = 1 - FP16 →
b = 16bits (2 bytes per float)
KV_cache_bytes = 1 × 4096 × 2 × 32 × 4096 × 2
= 2,147,483,648 bytes
≈ 2 GB
This shows how KV cache grows linearly with layers L, sequence length N, and batch B — which can become a memory bottleneck for large models.
Optimization: 4-bit KV Cache Quantization
If we store keys and values in 4-bit precision instead of FP16:
KV_cache_4bit = 2 GB × (4 / 16) = 0.5 GB
This dramatically reduces memory, enabling:
- Longer context windows
- Larger batch sizes
- GPU-friendly inference
Quick Reference Table
| Model | Seq Len | Layers | Hidden | Precision | KV Memory |
|---|---|---|---|---|---|
| 7B | 4096 | 32 | 4096 | FP16 | 2 GB |
| 7B | 4096 | 32 | 4096 | 4-bit | 0.5 GB |
Sources: Hugging Face Transformers documentation & blog (on KV caching and cache quantization), and NVIDIA Developer blog on LLM inference memory formulas.
Final Intuition
KV cache turns transformers from slow batch models into fast streaming systems.
It is the difference between waiting minutes for a response and seeing tokens appear instantly.
Connect with me:
LinkedIn: [https://www.linkedin.com/in/vaibhav-ahluwalia-83887a227/)



Top comments (0)