Shrijith Venkatramana

Posted on Jun 13

KV Cache in LLMs: The Optimization That Makes Modern AI Models Feel Fast

#ai #productivity #programming #webdev

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Large Language Models can generate surprisingly intelligent responses. But there's a hidden engineering challenge behind every answer:

LLMs generate text one token at a time. To predict each new token, a transformer model processes the entire sequence of tokens seen so far and uses its attention mechanism to determine which earlier tokens are most relevant for the next prediction. Naively, this means that when generating the 1,000th token, the model would need to repeatedly compute representations for the previous 999 tokens even though those tokens have not changed.

How do you generate the 1,000th token without repeatedly recomputing information for the previous 999 tokens over and over again?

If models had to recompute everything from scratch for every generated token, response times would be painfully slow and inference costs would explode.

The solution is one of the most important optimizations in modern LLM serving infrastructure:

KV Cache.

If you've ever worked with transformers, built AI products, or wondered why prompt length affects latency and memory, understanding KV Cache is essential.

Let's break it down from intuition to implementation.

The Problem: Autoregressive Generation Is Repetitive

LLMs generate text one token at a time.

Imagine the model receives:

The capital of France is

The model predicts:

Paris

Now the input becomes:

The capital of France is Paris

To generate the next token, the model runs another forward pass.

Then:

The capital of France is Paris .

And another forward pass.

And another.

The key observation is that most of the sequence remains unchanged between steps.

The capital of France is

has already been processed.

Recomputing representations for those old tokens every generation step would be wasteful.

This is exactly what KV Cache avoids.

Understanding Attention First

To understand KV Cache, we need a quick refresher on self-attention.

For each token, the transformer computes three vectors:

Query (Q)
Key (K)
Value (V)

A simplified attention calculation looks like:

Attention(Q, K, V)
    = softmax(QKᵀ)V

Each token creates its own K and V vectors.

During generation, when a new token arrives, it needs to attend to all previous tokens.

For example:

Token 1 → K₁, V₁
Token 2 → K₂, V₂
Token 3 → K₃, V₃
...

When generating token 1000, the model needs access to:

K₁ ... K₉₉₉
V₁ ... V₉₉₉

The question becomes:

Why recompute them if they never changed?

The Core Idea of KV Cache

Instead of recalculating Keys and Values for previous tokens, we simply store them.

When token N is generated:

Compute K and V for the new token.
Append them to the cache.
Reuse all previously stored K and V tensors.

Visually:

Step 1

Token A
  ↓
Compute K₁,V₁
  ↓
Store in cache


Cache:
[K₁]
[V₁]

Step 2

Token B
  ↓
Compute K₂,V₂

Cache:
[K₁ K₂]
[V₁ V₂]

Step 3

Token C
  ↓
Compute K₃,V₃

Cache:
[K₁ K₂ K₃]
[V₁ V₂ V₃]

Now attention only requires computing the Query for the newest token and using cached Keys and Values from earlier tokens.

This dramatically reduces computation.

What Actually Gets Saved?

Many developers initially assume the cache stores hidden states.

It doesn't.

The cache stores:

Keys
Values

for every attention layer.

Suppose a model has:

32 layers
32 attention heads

Each layer maintains its own KV cache.

Conceptually:

Layer 1
 ├── Keys
 └── Values

Layer 2
 ├── Keys
 └── Values

...

Layer 32
 ├── Keys
 └── Values

This means cache memory grows with:

Number of layers
Number of heads
Head dimension
Sequence length

This is why long-context inference can become memory-intensive.

Why KV Cache Makes Inference Faster

Without caching:

Generation Step 1000

Recompute tokens:
1...999

Then compute token 1000

With caching:

Generation Step 1000

Reuse:
1...999

Compute only:
1000

The complexity improvement is substantial.

Naively:

O(n³)

behavior emerges across repeated generation steps.

With KV caching:

O(n²)

total generation cost.

The exact complexity depends on implementation details, but the key takeaway is that cached inference avoids repeatedly processing the entire prefix.

In production systems, this difference is enormous.

Without KV caching, modern chat systems would be far slower and significantly more expensive to operate.

The Hidden Tradeoff: Memory

KV Cache speeds up computation, but memory usage increases.

A rough intuition:

Longer conversation
    ↓
More tokens
    ↓
Larger KV cache
    ↓
More GPU memory consumed

This creates one of the biggest bottlenecks in LLM serving.

For example:

1 user
    = small cache

10,000 users
    = 10,000 caches

Serving infrastructure must allocate GPU memory for every active session.

This is why inference platforms spend significant effort on:

Cache compression
Cache sharing
Paged attention
Prefix caching
Quantized KV caches

In large deployments, memory often becomes the limiting factor before raw compute.

Advanced Optimization: Prefix Reuse

Suppose many users share the same system prompt:

You are a helpful coding assistant...

Without optimization:

User A → Build KV cache
User B → Build KV cache
User C → Build KV cache

The same work is repeated.

Modern inference engines often support prefix caching.

Shared Prompt
      ↓
Shared KV Cache
      ↓
Reused Across Requests

Frameworks such as vLLM and other high-performance serving systems heavily exploit this idea.

For workloads with large shared prompts, the savings can be dramatic.

How KV Cache Appears in Code

In Hugging Face Transformers, KV Cache is often exposed as:

past_key_values

A simplified generation loop looks like:

outputs = model(
    input_ids=input_ids,
    past_key_values=cache,
    use_cache=True
)

cache = outputs.past_key_values

The first pass creates the cache.

Subsequent passes reuse it.

Under the hood, the model only computes attention state for newly generated tokens while leveraging cached Keys and Values from earlier tokens.

Most developers never need to implement KV caching manually, but understanding it helps explain performance behavior.

Why Every LLM Engineer Should Understand KV Cache

When developers encounter:

Slower responses on long prompts
GPU memory explosions
Context-length limitations
Throughput bottlenecks
Inference scaling challenges

KV Cache is often part of the explanation.

It is one of those rare optimizations that fundamentally changed the economics of LLM serving.

The transformer architecture made large language models possible.

KV Cache made them practical.

Without it, the conversational AI products we use every day would feel dramatically slower and cost far more to operate.

What other LLM inference optimization would you like to see explained next—Paged Attention, Speculative Decoding, Continuous Batching, or FlashAttention?

If models had to recompute everything from scratch for every generated token, response times would be painfully slow and inference costs would explode.

The solution is one of the most important optimizations in modern LLM serving infrastructure:

KV Cache.

If you've ever worked with transformers, built AI products, or wondered why prompt length affects latency and memory, understanding KV Cache is essential.

Let's break it down from intuition to implementation.

While ChatGPT is a well-known example, KV Cache is not specific to ChatGPT. It is used across most transformer-based autoregressive models, including GPT-style models, Llama, Mistral, Claude, Gemini, and many open-source LLMs.

The Problem: Autoregressive Generation Is Repetitive

LLMs generate text one token at a time.

Imagine the model receives:

The capital of France is

The model predicts:

Paris

Now the input becomes:

The capital of France is Paris

To generate the next token, the model runs another forward pass.

Then:

The capital of France is Paris .

And another forward pass.

And another.

The key observation is that most of the sequence remains unchanged between steps.

The capital of France is

has already been processed.

Recomputing representations for those old tokens every generation step would be wasteful.

This is exactly what KV Cache avoids.

Understanding Attention First

To understand KV Cache, we need a quick refresher on self-attention.

For each token, the transformer computes three vectors:

Query (Q)
Key (K)
Value (V)

A simplified attention calculation looks like:

Attention(Q, K, V)
    = softmax(QKᵀ)V

Each token creates its own K and V vectors.

During generation, when a new token arrives, it needs to attend to all previous tokens.

For example:

Token 1 → K₁, V₁
Token 2 → K₂, V₂
Token 3 → K₃, V₃
...

When generating token 1000, the model needs access to:

K₁ ... K₉₉₉
V₁ ... V₉₉₉

The question becomes:

Why recompute them if they never changed?

The Core Idea of KV Cache

Instead of recalculating Keys and Values for previous tokens, we simply store them.

When token N is generated:

Compute K and V for the new token.
Append them to the cache.
Reuse all previously stored K and V tensors.

Visually:

Step 1

Token A
  ↓
Compute K₁,V₁
  ↓
Store in cache


Cache:
[K₁]
[V₁]

Step 2

Token B
  ↓
Compute K₂,V₂

Cache:
[K₁ K₂]
[V₁ V₂]

Step 3

Token C
  ↓
Compute K₃,V₃

Cache:
[K₁ K₂ K₃]
[V₁ V₂ V₃]

Now attention only requires computing the Query for the newest token and using cached Keys and Values from earlier tokens.

This dramatically reduces computation.

What Actually Gets Saved?

Many developers initially assume the cache stores hidden states.

It doesn't.

The cache stores:

Keys
Values

for every attention layer.

Suppose a model has:

32 layers
32 attention heads

Each layer maintains its own KV cache.

Conceptually:

Layer 1
 ├── Keys
 └── Values

Layer 2
 ├── Keys
 └── Values

...

Layer 32
 ├── Keys
 └── Values

This means cache memory grows with:

Number of layers
Number of heads
Head dimension
Sequence length

This is why long-context inference can become memory-intensive.

Why KV Cache Makes Inference Faster

Without caching:

Generation Step 1000

Recompute tokens:
1...999

Then compute token 1000

With caching:

Generation Step 1000

Reuse:
1...999

Compute only:
1000

The complexity improvement is substantial.

Naively:

O(n³)

behavior emerges across repeated generation steps.

With KV caching:

O(n²)

total generation cost.

The exact complexity depends on implementation details, but the key takeaway is that cached inference avoids repeatedly processing the entire prefix.

In production systems, this difference is enormous.

Without KV caching, modern AI assistants, coding copilots, chatbots, and text-generation systems would be far slower and significantly more expensive to operate.

The Hidden Tradeoff: Memory

KV Cache speeds up computation, but memory usage increases.

A rough intuition:

Longer conversation
    ↓
More tokens
    ↓
Larger KV cache
    ↓
More GPU memory consumed

This creates one of the biggest bottlenecks in LLM serving.

For example:

1 user
    = small cache

10,000 users
    = 10,000 caches

Serving infrastructure must allocate GPU memory for every active session.

This is why inference platforms spend significant effort on:

Cache compression
Cache sharing
Paged attention
Prefix caching
Quantized KV caches

In large deployments, memory often becomes the limiting factor before raw compute.

Advanced Optimization: Prefix Reuse

Suppose many users share the same system prompt:

You are a helpful coding assistant...

Without optimization:

User A → Build KV cache
User B → Build KV cache
User C → Build KV cache

The same work is repeated.

Modern inference engines often support prefix caching.

Shared Prompt
      ↓
Shared KV Cache
      ↓
Reused Across Requests

Frameworks such as vLLM and other high-performance serving systems heavily exploit this idea.

For workloads with large shared prompts, the savings can be dramatic.

How KV Cache Appears in Code

In Hugging Face Transformers, KV Cache is often exposed as:

past_key_values

A simplified generation loop looks like:

outputs = model(
    input_ids=input_ids,
    past_key_values=cache,
    use_cache=True
)

cache = outputs.past_key_values

The first pass creates the cache.

Subsequent passes reuse it.

Under the hood, the model only computes attention state for newly generated tokens while leveraging cached Keys and Values from earlier tokens.

Most developers never need to implement KV caching manually, but understanding it helps explain performance behavior.

Why Every LLM Engineer Should Understand KV Cache

When developers encounter:

Slower responses on long prompts
GPU memory explosions
Context-length limitations
Throughput bottlenecks
Inference scaling challenges

KV Cache is often part of the explanation.

It is one of those rare optimizations that fundamentally changed the economics of LLM serving.

The transformer architecture made large language models possible.

KV Cache made them practical.

Without it, the AI products we use every day—from chatbots and coding assistants to search and agent systems—would feel dramatically slower and cost far more to operate.

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
…

View on GitHub

Top comments (1)

Mudassir Khan • Jun 20

the prefix caching section is where this gets weird in production. when you have 10k users on the same system prompt, the shared KV cache optimization is massive. but personalized context in the prompt (account level instructions, org specific tool descriptions) makes the prefix hash diverge per user — and you lose the benefit entirely.

we ran into this building an agent with MCP tool descriptions in the system prompt. every schema added made prefix caching effectively useless because each user's prompt was unique after the shared preamble.

curious if you've seen setups that split the prompt into a shared cacheable prefix + a per user suffix? or is quantized KV cache the more practical path once the prefix breaks down?