Jun Bae

Posted on Apr 22 • Edited on Apr 29

KV Cache and Prompt Caching: How to Leverage them to Cut Time and Costs

#ai #machinelearning #llm

Introduction

A Problem of LLM Inference

In the transformer structure, the model calculates the $\mathbf{K}, \mathbf{V}$ matrices using weight matrices $\mathbf{W}$ . When an input $\mathbf{x}_0$ vector enters the model, it is first multiplied by the $\mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v$ matrices. This yields the $\mathbf{q}_0, \mathbf{k}_0, \mathbf{v}_0$ vectors. As you iterate this process and stack the $\mathbf{k}, \mathbf{v}$ vectors, they form the $\mathbf{K}, \mathbf{V}$ matrices.

Assume you have successfully generated an output token after completing the transformer process, and that token is $\mathbf{x}_1$ . Here lies the problem: for subsequent inference, the model must calculate not only $\mathbf{k}_1, \mathbf{v}_1$ but $\mathbf{k}_0, \mathbf{v}_0$ again. Because the attention score is calculated as $\mathbf{q}_1\mathbf{K}^{T}\mathbf{V}$ , it requires the $\mathbf{k}, \mathbf{v}$ vectors from all previous inputs. This results in redundant computations every time a new token is generated. As the number of input tokens grows, these recomputations become significantly time-consuming.

What is KV cache exactly?

If you understant the problem, you might have already thought of the solution. Yes, the solution is to sotre the previous $\mathbf{k}, \mathbf{v}$ vectors in what is known as a “cache”. It is a relatively straightforward concept.

A Simple example of KV cache

Let’s continue with the example from the introduction. When generating a token subsequent to $\mathbf{x}_1$ , what needs to be computed? First, we calculate $\mathbf{q}_1=\mathbf{x}_1\mathbf{W}_q$ . And then, you need the $\mathbf{K}, \mathbf{V}$ matrices. Let’s look at this step-by-step.

We need to form $\mathbf{K}$ using vectors $\mathbf{k}_0$ and $\mathbf{k}_1$
This requires computing $\mathbf{k}_0=\mathbf{x}_0\mathbf{W}_k$ and $\mathbf{k}_1=\mathbf{x}_1\mathbf{W}_k$ . Notice that we have just recomputed $\mathbf{k}_0$ .
Similarly, we need $\mathbf{V}$ , composed of vectors $\mathbf{v}_0$ and $\mathbf{v}_1$ .
Following the same logic as step 2, we calculate $\mathbf{v}_0=\mathbf{x}_0\mathbf{W}_v$ and $\mathbf{v}_1=\mathbf{x}_1\mathbf{W}_v$ .
Finally, with the $\mathbf{K}, \mathbf{V}$ matrices ready, we can compute $\mathbf{q}_1\mathbf{K}^{T}\mathbf{V}$ .

As you can see, the initial computation for $\mathbf{x}_0$ required three vector calculations( $\mathbf{q}_0,\mathbf{k}_0,\mathbf{v}_0$ ). However, for the next token, that number jumps to five( $\mathbf{k}_0,\mathbf{v}_0,\mathbf{q}_1,\mathbf{k}_1,\mathbf{v}_1$ ). This grows to seven for the third token and nine for the fourth. If your input is 1,000 token long, you would have to perform 2,001 vector computations just for the last token. This is computationally expensive and highly inefficient.

This is where the "cache" comes in. If we store the entire $\mathbf{K}, \mathbf{V}$ matrices after each step, we no longer need to recompute $\mathbf{k}_0,\mathbf{v}_0$ . The number of computations remains constant at three per token $\mathbf{q}_n,\mathbf{k}_n,\mathbf{v}_n$ , regardless of how many tokens have already been processed.

Note: This explanation assumes a decoder-only architecture. You might wonder why we don't cache the $\mathbf{Q}$ matrix in an encoder; however, since most modern LLMs are decoder-only, that isn't a concern here. Additionally, "hitting the cache" requires both the token and its position to be identical due to positional embeddings.

How much fater is it?

Various engineering blogs and papers (e.g., from NVIDIA, Hugging Face, and Databricks) have quantified the benefits of KV caching.

Benchmark: Generating 1,000 Tokens (Llama-2-7B)

With KV Cache: ~20–30 seconds (consistent speed per token).
Without KV Cache: > 2–3 minutes (each token takes progressively longer than the last).

As the number of tokens grows, the gap widens significantly. Given that some SOTA models support now support context lengths of 1M, this type of cache engineering has become indispensable.

How to leverage KV cache when serving models?

In most cases, you don’t need to worry about complex configurations. Since KV cache has become the standard serving method, libraries like vLLM and Huggingface Transformers activate KV cache automatically by default provided you have enough memory.

In Transformers, the use_cache parameter in model.generation_config controls this behavior; its default value is True.

In vLLM, it allocates the remaining memory—defined by gpu_memory_utilization—to the KV cache after the model weights are loaded in the memory.

vLLM is particularly well-known for its sophisticated cache engineering techniques, such as PagedAttention. Because of this, vLLM offers massive throughput with almost no memory fragmentation, making it a unmatched top-tier choice for efficient LLM serving.

The Problem with KV Caching

Memory Consumption

The memory required for the KV cache increases linearly with sequence length and batch size.

KV Cache Memory Formula:

\textbf{Memory}_{KV} = 2\,\times\,\textbf{Batch}\,\times\,\textbf{SeqLen}\,\times\,\textbf{Layers}\,\times\,\textbf{HiddenDim}\,\times\,\textbf{Precision}

2: One for keys and one for values.
Batch: Number of concurrent users being served.
SeqLen: The length of the context/sequence.
Precision: Bytes (e.g., 2 bytes for FP16, 1 byte for FP8).

Let me give you a simple example of how it increases with a typical LLM model.

Example: Llama-3-70B (FP16)

Model Weights: about 140GB
KV Cache (1 user, 1k tokens): ~0.3GB
KV Cache (1 user, 128k tokens): ~40GB
KV Cache (64 users, 16k tokens): ~310GB
KV Cache (64 users, 128k tokens): ~2,560GB

As you can see, the KV cache footprint explodes as you increase context length or batch size. Even with a 1 GB allocation per user, you might need more than one H100 GPU just to serve 100 concurrent users.

Beyond memory capacity, this also impacts latency. While GPUs are exceptionally fast at mathematics, they are limited by memory bandwidth. The GPU must constantly fetch the cached data from VRAM to the chip to generate each new token. This is why you may see high latency even when GPU compute utilization is only at 10%—the memory bandwidth has become the bottleneck.

There are emerging solutions to these problems, such as Grouped-Query Attention(GQA), PagedAttention, Quantization. However I don’t think these are enough. Definitely KV cache is hindering the scalability of LLMs today. If LLMs eventually hit a "dead end," as Yann LeCun has suggested, I believe the inefficiencies of the KV cache may be a contributing factor. It has become so indispensable that developers cannot afford to abandon it. If a new model is too massive to leverage the KV cache with current computing resources, it might be useless in the real world as consumers will not tolerate prohibitively slow inference speeds.

What was once a solution is now a massive problem.. So ironic. We’ll have to see how big tech companies tackle this tricky challenge in the future. Isn’t it exciting?

Prompt (Context) caching

This is another method in LLM engineering. While different from standard KV caching, it significantly improves efficiency and reduces computation time. Furthermore, if you use LLM services via API—such as Gemini or OpenAI—, this method can save you both time and money.

What is Prompt Caching?

Standard KV caching typically occurs within a single turn. Once a turn ends, the cache is cleared and the process restarts from scratch for the next turn.

Prompt caching extends this concept across multiple turns. After one turn is completed, the server saves the $\mathbf{K}, \mathbf{V}$ matrices and retains them for a period of time. When the next turn begins, the model can reuse those exact same $\mathbf{K}, \mathbf{V}$ matrices. Let me give you a simple example.

Example: A Multi-Turn Conversation

First turn

System prompt: Answer user’s question.

User prompt: Hi

Assistant prompt: Hi! How can I help you?

Second turn

User prompt: Who are you?

Assistant prompt:

In the first turn, the input tokens consist of $(\textrm{Answer}), (\textrm{user's}), (\textrm{question}),(\textrm{Hi})$ (I knowingly omitted tokens of ‘\n’, spacing, BOS, EOS or start token of certain prompt for convenience). Of course, the model wil compute the $\mathbf{K}, \mathbf{V}$ matrices for all tokens and begin generating responses. After it finishes generating entire assistant prompt, it will have also stacked the $\mathbf{k}, \mathbf{v}$ vectors for $(\textrm{Hi!}), (\textrm{How}), (\textrm{can}),(\textrm{I}),(\textrm{help}),(\textrm{you?})$ onto the matrices.

Then, in multi-turn structure, what happens during the second turn? As you might expect, the model reprocesses the $\mathbf{k}, \mathbf{v}$ vectors for the previous context: $(\textrm{Answer}), (\textrm{user's}), (\textrm{question}),(\textrm{Hi})$ and $(\textrm{Hi!}), (\textrm{How}), (\textrm{can}),(\textrm{I}),(\textrm{help}),(\textrm{you?})$ . it then computes the vectors for the new user prompt, and begin generating an answer.

Of course, the model could leverage a KV cache, but rebuilding that cache from scratch would be highly inefficient. It would consume too much time performing the exact same operations. However, if we cache the $\mathbf{K}, \mathbf{V}$ matrices from the previous turn, we can avoid recomputing the $\mathbf{k}, \mathbf{v}$ vectors for the initial instructions and the first exchange. By retrieving these vectors from memory, the model can begin processing immediately from the new prompt, “Who are you?”. Here is an example image from the OpenAI website.

As you can see in the image, the token positions must remain exactly the same. If even a single word is inserted, the cache will be invalidated from that point forward because the positional encodings will shift. Therefore, to leverage prompt caching effectively, you should keep your token sequences as consistent as possible.

Note: Currently, the terminology for this technique has not been standardized. OpenAI calls it ‘Prompt Caching,’ vLLM uses ‘Prefix Caching,’ and Google refers to it as ‘Context Caching’. They all refer to the same concept. Unfortunately, this kind of naming inconsistency happens occasionally in the AI field. :(

With this method, tokens from previous turns don’t even need to be re-processed by the transformer layers. As long as the cache exists, the model can skip the heavy computation for those segments entirely. You simply load the cache and move forward without needing to re-evaluate the earlier parts of the conversation. It’s an incredibly efficient way to handle long-context interactions

Saving Money with Prompt Caching

When you successfully "hit" the prompt cache during a multi-turn conversation, several LLM API providers—such as OpenAI and Google Gemini—offer significant discounts. Below, I’ve broken down the current discount policies for both platforms.

OpenAI API

They applies a comprehensive half-price policy to input costs.

While the chart above uses specific examples, most recent models—including the GPT-5 family—benefit from this pricing. They still haven’t updated this chart, despite it being over half a year since GPT-5 launched.

Activation: The minimum token length to trigger prompt caching is 1,024 tokens.
TTL (Time to Live): The cache typically persists for 5 to 10 minutes during peak hours, and up to one hour during off-peak times.
Extended Caching: Certain models (such as gpt-5.2 and gpt-4.1) support extended prompt caching, allowing the cache to be retained for up to 24 hours.

Google Gemini API

Gemini offers two distinct caching tiers: Implicit and Explicit.

Implicit Caching: This is automatically enabled for most Gemini models. The minimum threshold is 1,024 tokens for Flash models and 4,096 tokens for Pro models. However, Google does not strictly guarantee cost savings with this tier; it is primarily an optimization for latency.
Explicit Caching: This allows for a cost-saving guarantee but must be configured manually. You can define specific parameters like this:

cache = client.caches.create(
    model=model,
    config=types.CreateCachedContentConfig(
      display_name='sherlock jr movie', # used to identify the cache
      system_instruction=system_prompt,
      contents=content,
      ttl="300s",
  )

The default TTL is 1 hour and no minimum or maximum time. The billing is based on these factors:

Cache token count: The number of input tokens cached, billed at a reduced rate when included in subsequent prompts.

Storage duration: The amount of time cached tokens are stored (TTL), billed based on the TTL duration of cached token count. There are no minimum or maximum bounds on the TTL.

Other factors: Other charges apply, such as for non-cached input tokens and output tokens.

And also they explicitly show the exact costs of cached tokens like this:

This is Gemini-3-Pro example and it varies depending on models.

Important Note: As seen in the Gemini 3 Pro pricing, you are charged a storage fee. Be mindful when setting a long TTL.

Conclusion

In this post, we have explored two pillars of LLM optimization: KV cache and Prompt caching. These techniques are indispensable for reducing latency and lowering operational costs. However, these efficiencies come with a significant trade-off: a massive increase in memory consumption.

This "memory wall" has forced developers and enterprises to secure high-performance GPUs with vast amounts of VRAM, not just for training, but for inference as well. It is no surprise that memory manufacturers like Samsung, SK Hynix, and Micron have seen such a surge in demand; their hardware is the literal foundation upon which these caching methods reside.

Currently, the memory bottleneck is one of the most significant "sticking points" in LLM architecture. Many of the world's leading tech companies are racing to find more efficient ways to handle these tricky difficulties. If the industry can overcome these scaling limits with new, innovative architectures, it will represent the next major breakthrough in the evolution of Artificial Intelligence.

Top comments (0)

The discussion has been locked. New comments can't be added.