From Theory to Practice: Demystifying the Key-Value Cache in Modern LLMs

#llms #kvcache

The O(N2) Killer: How KV Cache Supercharges LLM Inference?

⁉️Introduction — What is Key-Value Cache and Why we need it?

📜 My Journey into the LLM Landscape

While I don’t hail from a traditional background in data science or deep learning research, my immersion into the fascinating world of AI and Generative AI over the last two to three years has driven me to approach concepts like the KV Cache from a pragmatic perspective. My primary method for building a solid understanding involves a combination of relatable explanations from blogs and technical books, and perhaps most crucially, working through sample code. This approach allows me to translate complex mathematical contexts into tangible, functional components, ensuring I grasp not just what an optimization does, but how it fundamentally changes the execution flow.

Source: https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

The KV Cache (Key-Value Cache) is a crucial optimization technique used in Transformer-based models, especially Large Language Models (LLMs), to accelerate inference (text generation).

It is a dedicated memory space used to store the intermediate Key (K) and Value (V) vectors that are calculated during the self-attention mechanism for tokens that have already been processed.

The KV Cache (Key-Value Cache) is a crucial optimization technique used in Transformer-based models, especially Large Language Models (LLMs), to accelerate inference (text generation).

It is a dedicated memory space used to store the intermediate Key (K) and Value (V) vectors that are calculated during the self-attention mechanism for tokens that have already been processed.

🚀 Why KV Cache is Necessary

Transformer models, like the GPT family, generate text in an autoregressive fashion, meaning they predict one token at a time based on all the preceding tokens.

Without the KV Cache, the model would be highly inefficient because at each new step:

The model must re-process the entire sequence of previous tokens (the input prompt plus all generated tokens so far) to re-calculate their Key (K) and Value (V) vectors.
This repetitive calculation causes the computational complexity to grow quadratically (O(n2)) with the sequence length, n, making long-sequence generation extremely slow and computationally expensive.

🧠 How KV Cache Works

The KV Cache addresses this inefficiency by introducing a form of short-term memory:

1️⃣ First Token / Prefill Phase: When the model first processes the input prompt, it computes the Query (Q), Key (K), and Value (V) vectors for all input tokens. The K and V vectors are stored in the KV Cache.

2️⃣ Subsequent Tokens / Decode Phase: When the model generates the next token:

It only computes the Query (Q) vector for the new token.
It re-uses the previously computed K and V vectors directly from the KV Cache, instead of recomputing them from scratch.
The new token’s own K and V vectors are then calculated and appended to the cache. 3️⃣ Result: This significantly reduces the redundant computations. The overall attention calculation for each new token scales linearly (O(n)) with the sequence length, leading to dramatically faster response times and improved efficiency.

⚖️ Trade-off

The main trade-off of using a KV Cache is that it requires more GPU memory (VRAM) to store the cached K and V vectors, which can become the dominant factor in memory consumption for very long sequences or large batch sizes.

Source: https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms from (Sebastian Raschka, PhD)

Explanation Through Sample Conceptual Code

As mentioned above, very often I need a conceptual code to understand the “stuff” I learn. So, let’s imagine a simple function that simulates the attention step for token generation.

KV_CACHE = {
    'keys': [],   # store K vectors
    'values': []  # store V vectors
}

def generate_next_token(new_token, sequence_so_far):
    """
    Simulates the attention step for a new token, using or updating the KV Cache.
    """

    print(f"\n--- Processing Token: '{new_token}' ---")

    Q_new = f"Q_vec({new_token})"
    print(f"1. Computed Query (Q) for new token: {Q_new}")

    # compute(K) and (V) for the *new* token only
    K_new = f"K_vec({new_token})"
    V_new = f"V_vec({new_token})"
    print(f"2. Computed Key (K) and Value (V) for new token: {K_new}, {V_new}")

    # Optimizing
    K_full = KV_CACHE['keys'] + [K_new]  # K + new K
    V_full = KV_CACHE['values'] + [V_new] # V + new V

    print(f"3. Full Attention Keys (Cached + New): {K_full}")

    Attention_Output = f"Result of Attention({Q_new}, {K_full}, {V_full})"
    print(f"4. Attention Calculation: {Attention_Output}")

    KV_CACHE['keys'].append(K_new)
    KV_CACHE['values'].append(V_new)
    print(f"5. KV Cache Updated. Current size: {len(KV_CACHE['keys'])} tokens.")

    return "Predicted_Token"



# initial prompt
print("=== Initial Prompt Phase: 'Hello, world' ===")
prompt_tokens = ["Hello,", "world"]

# Token #1: "Hello,"
generate_next_token(prompt_tokens[0], [])
# Token #2: "world"
generate_next_token(prompt_tokens[1], prompt_tokens[:1])


# next token generation...
print("\n=== Generation Phase: Predicting the 3rd token ===")

# prediction...
next_token = "(Model predicts 'how')"
generate_next_token(next_token, prompt_tokens)

🔑 Key Takeaways from the Code

Cache Utilization: When processing a new token (e.g., the 3rd token), the model uses KV_CACHE['keys'] and KV_CACHE['values'] which already contain the vectors for all previous tokens.
Minimal Computation: The model only computes Qnew, Knew, and Vnew for the single, newest token.
Efficiency: Without the cache, it would be necessary to re-compute K and V for every single token in the entire history at every single step. The cache avoids this redundant work.

Another conceptual code: a PyTorch-based simulation is a way to understand the K and V concatenation logic (sources in the link section). This conceptual simple class clearly the growing cache (self.cache_k/self.cache_v) and the constant Q size, which is the core principle of the KV Cache optimization.

Instead of using a full, production-ready model like those served by Ollama, you would implement the Multi-Head Attention (MHA) layer from scratch using a framework like PyTorch (which is common for LLM development). This allows you to directly control the K, V, and Q tensors and implement the caching logic yourself.

We can create a custom AttentionLayer class that explicitly has a self.kv_cache variable. You can print out the shapes of the tensors at each step to visibly show:
The Query (Q) tensor’s length is always 1 (representing the new token).
The Key (K) and Value (V) cache tensors’ lengths grow linearly (representing all past tokens + the new one).
The model performs the full attention operation using the small Q and the large, concatenated K/V cache.

import torch

class CustomAttentionWithCache(torch.nn.Module):
    def __init__(self, head_dim):
        super().__init__()
        self.head_dim = head_dim
        self.cache_k = None
        self.cache_v = None

    def forward(self, q_new, k_new, v_new):
        if self.cache_k is None:
            self.cache_k = k_new
            self.cache_v = v_new
        else:
            self.cache_k = torch.cat([self.cache_k, k_new], dim=1)
            self.cache_v = torch.cat([self.cache_v, v_new], dim=1)

        scores = torch.matmul(q_new, self.cache_k.transpose(1, 2)) / (self.head_dim**0.5)

        weights = torch.nn.functional.softmax(scores, dim=-1)

        # Weights * V_cached
        output = torch.matmul(weights, self.cache_v)

        return output, self.cache_k.shape[1]

# One token (B=1, T=1, D=64) - q, k, v are computed vectors for the newest token
q1, k1, v1 = torch.rand(1, 1, 64), torch.rand(1, 1, 64), torch.rand(1, 1, 64)

attn = CustomAttentionWithCache(head_dim=64)
output1, seq_len1 = attn(q1, k1, v1)
print(f"Token 1: Cache Length = {seq_len1}") # Output: 1

# Second token
q2, k2, v2 = torch.rand(1, 1, 64), torch.rand(1, 1, 64), torch.rand(1, 1, 64)
output2, seq_len2 = attn(q2, k2, v2)
print(f"Token 2: Cache Length = {seq_len2}") # Output: 2

Conclusion

To wrap up, understanding the KV Cache moves us past treating Large Language Models as black boxes and reveals one of the most critical optimizations driving real-world Generative AI performance. We’ve seen that the core idea is simple yet revolutionary: by caching the calculated Key (K) and Value (V) vectors from previous tokens, we prevent the model from performing redundant, computationally expensive O(N2) recalculations at every step. This shifts the complexity of the decoding phase to an efficient linear O(N) growth, making long-sequence generation feasible and fast. Crucially, I recognize I’m still very much in a learning curve (despite my age 😉) with many of these advanced concepts, but that’s precisely why I love translating these ideas — from the quadratic problem to the conceptual PyTorch solution — and sharing my learnings with peers. This collaborative approach not only aids my own comprehension but helps build a deeper, collective understanding of the LLM landscape.