DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

KV Cache in LLMs: The Optimization That Makes Modern AI Models Feel Fast

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.


Large Language Models can generate surprisingly intelligent responses. But there's a hidden engineering challenge behind every answer:

LLMs generate text one token at a time. To predict each new token, a transformer model processes the entire sequence of tokens seen so far and uses its attention mechanism to determine which earlier tokens are most relevant for the next prediction. Naively, this means that when generating the 1,000th token, the model would need to repeatedly compute representations for the previous 999 tokens even though those tokens have not changed.

How do you generate the 1,000th token without repeatedly recomputing information for the previous 999 tokens over and over again?

If models had to recompute everything from scratch for every generated token, response times would be painfully slow and inference costs would explode.

The solution is one of the most important optimizations in modern LLM serving infrastructure:

KV Cache.

If you've ever worked with transformers, built AI products, or wondered why prompt length affects latency and memory, understanding KV Cache is essential.

Let's break it down from intuition to implementation.

The Problem: Autoregressive Generation Is Repetitive

LLMs generate text one token at a time.

Imagine the model receives:

The capital of France is
Enter fullscreen mode Exit fullscreen mode

The model predicts:

Paris
Enter fullscreen mode Exit fullscreen mode

Now the input becomes:

The capital of France is Paris
Enter fullscreen mode Exit fullscreen mode

To generate the next token, the model runs another forward pass.

Then:

The capital of France is Paris .
Enter fullscreen mode Exit fullscreen mode

And another forward pass.

And another.

And another.

The key observation is that most of the sequence remains unchanged between steps.

The capital of France is
Enter fullscreen mode Exit fullscreen mode

has already been processed.

Recomputing representations for those old tokens every generation step would be wasteful.

This is exactly what KV Cache avoids.

Understanding Attention First

To understand KV Cache, we need a quick refresher on self-attention.

For each token, the transformer computes three vectors:

  • Query (Q)
  • Key (K)
  • Value (V)

A simplified attention calculation looks like:

Attention(Q, K, V)
    = softmax(QKᵀ)V
Enter fullscreen mode Exit fullscreen mode

Each token creates its own K and V vectors.

During generation, when a new token arrives, it needs to attend to all previous tokens.

For example:

Token 1 → K₁, V₁
Token 2 → K₂, V₂
Token 3 → K₃, V₃
...
Enter fullscreen mode Exit fullscreen mode

When generating token 1000, the model needs access to:

K₁ ... K₉₉₉
V₁ ... V₉₉₉
Enter fullscreen mode Exit fullscreen mode

The question becomes:

Why recompute them if they never changed?

The Core Idea of KV Cache

Instead of recalculating Keys and Values for previous tokens, we simply store them.

When token N is generated:

  1. Compute K and V for the new token.
  2. Append them to the cache.
  3. Reuse all previously stored K and V tensors.

Visually:

Step 1

Token A
  ↓
Compute K₁,V₁
  ↓
Store in cache


Cache:
[K₁]
[V₁]
Enter fullscreen mode Exit fullscreen mode
Step 2

Token B
  ↓
Compute K₂,V₂

Cache:
[K₁ K₂]
[V₁ V₂]
Enter fullscreen mode Exit fullscreen mode
Step 3

Token C
  ↓
Compute K₃,V₃

Cache:
[K₁ K₂ K₃]
[V₁ V₂ V₃]
Enter fullscreen mode Exit fullscreen mode

Now attention only requires computing the Query for the newest token and using cached Keys and Values from earlier tokens.

This dramatically reduces computation.

What Actually Gets Saved?

Many developers initially assume the cache stores hidden states.

It doesn't.

The cache stores:

Keys
Values
Enter fullscreen mode Exit fullscreen mode

for every attention layer.

Suppose a model has:

32 layers
32 attention heads
Enter fullscreen mode Exit fullscreen mode

Each layer maintains its own KV cache.

Conceptually:

Layer 1
 ├── Keys
 └── Values

Layer 2
 ├── Keys
 └── Values

...

Layer 32
 ├── Keys
 └── Values
Enter fullscreen mode Exit fullscreen mode

This means cache memory grows with:

  • Number of layers
  • Number of heads
  • Head dimension
  • Sequence length

This is why long-context inference can become memory-intensive.

Why KV Cache Makes Inference Faster

Without caching:

Generation Step 1000

Recompute tokens:
1...999

Then compute token 1000
Enter fullscreen mode Exit fullscreen mode

With caching:

Generation Step 1000

Reuse:
1...999

Compute only:
1000
Enter fullscreen mode Exit fullscreen mode

The complexity improvement is substantial.

Naively:

O(n³)
Enter fullscreen mode Exit fullscreen mode

behavior emerges across repeated generation steps.

With KV caching:

O(n²)
Enter fullscreen mode Exit fullscreen mode

total generation cost.

The exact complexity depends on implementation details, but the key takeaway is that cached inference avoids repeatedly processing the entire prefix.

In production systems, this difference is enormous.

Without KV caching, modern chat systems would be far slower and significantly more expensive to operate.

The Hidden Tradeoff: Memory

KV Cache speeds up computation, but memory usage increases.

A rough intuition:

Longer conversation
    ↓
More tokens
    ↓
Larger KV cache
    ↓
More GPU memory consumed
Enter fullscreen mode Exit fullscreen mode

This creates one of the biggest bottlenecks in LLM serving.

For example:

1 user
    = small cache

10,000 users
    = 10,000 caches
Enter fullscreen mode Exit fullscreen mode

Serving infrastructure must allocate GPU memory for every active session.

This is why inference platforms spend significant effort on:

  • Cache compression
  • Cache sharing
  • Paged attention
  • Prefix caching
  • Quantized KV caches

In large deployments, memory often becomes the limiting factor before raw compute.

Advanced Optimization: Prefix Reuse

Suppose many users share the same system prompt:

You are a helpful coding assistant...
Enter fullscreen mode Exit fullscreen mode

Without optimization:

User A → Build KV cache
User B → Build KV cache
User C → Build KV cache
Enter fullscreen mode Exit fullscreen mode

The same work is repeated.

Modern inference engines often support prefix caching.

Shared Prompt
      ↓
Shared KV Cache
      ↓
Reused Across Requests
Enter fullscreen mode Exit fullscreen mode

Frameworks such as vLLM and other high-performance serving systems heavily exploit this idea.

For workloads with large shared prompts, the savings can be dramatic.

How KV Cache Appears in Code

In Hugging Face Transformers, KV Cache is often exposed as:

past_key_values
Enter fullscreen mode Exit fullscreen mode

A simplified generation loop looks like:

outputs = model(
    input_ids=input_ids,
    past_key_values=cache,
    use_cache=True
)

cache = outputs.past_key_values
Enter fullscreen mode Exit fullscreen mode

The first pass creates the cache.

Subsequent passes reuse it.

Under the hood, the model only computes attention state for newly generated tokens while leveraging cached Keys and Values from earlier tokens.

Most developers never need to implement KV caching manually, but understanding it helps explain performance behavior.

Why Every LLM Engineer Should Understand KV Cache

When developers encounter:

  • Slower responses on long prompts
  • GPU memory explosions
  • Context-length limitations
  • Throughput bottlenecks
  • Inference scaling challenges

KV Cache is often part of the explanation.

It is one of those rare optimizations that fundamentally changed the economics of LLM serving.

The transformer architecture made large language models possible.

KV Cache made them practical.

Without it, the conversational AI products we use every day would feel dramatically slower and cost far more to operate.

What other LLM inference optimization would you like to see explained next—Paged Attention, Speculative Decoding, Continuous Batching, or FlashAttention?

If models had to recompute everything from scratch for every generated token, response times would be painfully slow and inference costs would explode.

The solution is one of the most important optimizations in modern LLM serving infrastructure:

KV Cache.

If you've ever worked with transformers, built AI products, or wondered why prompt length affects latency and memory, understanding KV Cache is essential.

Let's break it down from intuition to implementation.

While ChatGPT is a well-known example, KV Cache is not specific to ChatGPT. It is used across most transformer-based autoregressive models, including GPT-style models, Llama, Mistral, Claude, Gemini, and many open-source LLMs.

The Problem: Autoregressive Generation Is Repetitive

LLMs generate text one token at a time.

Imagine the model receives:

The capital of France is
Enter fullscreen mode Exit fullscreen mode

The model predicts:

Paris
Enter fullscreen mode Exit fullscreen mode

Now the input becomes:

The capital of France is Paris
Enter fullscreen mode Exit fullscreen mode

To generate the next token, the model runs another forward pass.

Then:

The capital of France is Paris .
Enter fullscreen mode Exit fullscreen mode

And another forward pass.

And another.

And another.

The key observation is that most of the sequence remains unchanged between steps.

The capital of France is
Enter fullscreen mode Exit fullscreen mode

has already been processed.

Recomputing representations for those old tokens every generation step would be wasteful.

This is exactly what KV Cache avoids.

Understanding Attention First

To understand KV Cache, we need a quick refresher on self-attention.

For each token, the transformer computes three vectors:

  • Query (Q)
  • Key (K)
  • Value (V)

A simplified attention calculation looks like:

Attention(Q, K, V)
    = softmax(QKᵀ)V
Enter fullscreen mode Exit fullscreen mode

Each token creates its own K and V vectors.

During generation, when a new token arrives, it needs to attend to all previous tokens.

For example:

Token 1 → K₁, V₁
Token 2 → K₂, V₂
Token 3 → K₃, V₃
...
Enter fullscreen mode Exit fullscreen mode

When generating token 1000, the model needs access to:

K₁ ... K₉₉₉
V₁ ... V₉₉₉
Enter fullscreen mode Exit fullscreen mode

The question becomes:

Why recompute them if they never changed?

The Core Idea of KV Cache

Instead of recalculating Keys and Values for previous tokens, we simply store them.

When token N is generated:

  1. Compute K and V for the new token.
  2. Append them to the cache.
  3. Reuse all previously stored K and V tensors.

Visually:

Step 1

Token A
  ↓
Compute K₁,V₁
  ↓
Store in cache


Cache:
[K₁]
[V₁]
Enter fullscreen mode Exit fullscreen mode
Step 2

Token B
  ↓
Compute K₂,V₂

Cache:
[K₁ K₂]
[V₁ V₂]
Enter fullscreen mode Exit fullscreen mode
Step 3

Token C
  ↓
Compute K₃,V₃

Cache:
[K₁ K₂ K₃]
[V₁ V₂ V₃]
Enter fullscreen mode Exit fullscreen mode

Now attention only requires computing the Query for the newest token and using cached Keys and Values from earlier tokens.

This dramatically reduces computation.

What Actually Gets Saved?

Many developers initially assume the cache stores hidden states.

It doesn't.

The cache stores:

Keys
Values
Enter fullscreen mode Exit fullscreen mode

for every attention layer.

Suppose a model has:

32 layers
32 attention heads
Enter fullscreen mode Exit fullscreen mode

Each layer maintains its own KV cache.

Conceptually:

Layer 1
 ├── Keys
 └── Values

Layer 2
 ├── Keys
 └── Values

...

Layer 32
 ├── Keys
 └── Values
Enter fullscreen mode Exit fullscreen mode

This means cache memory grows with:

  • Number of layers
  • Number of heads
  • Head dimension
  • Sequence length

This is why long-context inference can become memory-intensive.

Why KV Cache Makes Inference Faster

Without caching:

Generation Step 1000

Recompute tokens:
1...999

Then compute token 1000
Enter fullscreen mode Exit fullscreen mode

With caching:

Generation Step 1000

Reuse:
1...999

Compute only:
1000
Enter fullscreen mode Exit fullscreen mode

The complexity improvement is substantial.

Naively:

O(n³)
Enter fullscreen mode Exit fullscreen mode

behavior emerges across repeated generation steps.

With KV caching:

O(n²)
Enter fullscreen mode Exit fullscreen mode

total generation cost.

The exact complexity depends on implementation details, but the key takeaway is that cached inference avoids repeatedly processing the entire prefix.

In production systems, this difference is enormous.

Without KV caching, modern AI assistants, coding copilots, chatbots, and text-generation systems would be far slower and significantly more expensive to operate.

The Hidden Tradeoff: Memory

KV Cache speeds up computation, but memory usage increases.

A rough intuition:

Longer conversation
    ↓
More tokens
    ↓
Larger KV cache
    ↓
More GPU memory consumed
Enter fullscreen mode Exit fullscreen mode

This creates one of the biggest bottlenecks in LLM serving.

For example:

1 user
    = small cache

10,000 users
    = 10,000 caches
Enter fullscreen mode Exit fullscreen mode

Serving infrastructure must allocate GPU memory for every active session.

This is why inference platforms spend significant effort on:

  • Cache compression
  • Cache sharing
  • Paged attention
  • Prefix caching
  • Quantized KV caches

In large deployments, memory often becomes the limiting factor before raw compute.

Advanced Optimization: Prefix Reuse

Suppose many users share the same system prompt:

You are a helpful coding assistant...
Enter fullscreen mode Exit fullscreen mode

Without optimization:

User A → Build KV cache
User B → Build KV cache
User C → Build KV cache
Enter fullscreen mode Exit fullscreen mode

The same work is repeated.

Modern inference engines often support prefix caching.

Shared Prompt
      ↓
Shared KV Cache
      ↓
Reused Across Requests
Enter fullscreen mode Exit fullscreen mode

Frameworks such as vLLM and other high-performance serving systems heavily exploit this idea.

For workloads with large shared prompts, the savings can be dramatic.

How KV Cache Appears in Code

In Hugging Face Transformers, KV Cache is often exposed as:

past_key_values
Enter fullscreen mode Exit fullscreen mode

A simplified generation loop looks like:

outputs = model(
    input_ids=input_ids,
    past_key_values=cache,
    use_cache=True
)

cache = outputs.past_key_values
Enter fullscreen mode Exit fullscreen mode

The first pass creates the cache.

Subsequent passes reuse it.

Under the hood, the model only computes attention state for newly generated tokens while leveraging cached Keys and Values from earlier tokens.

Most developers never need to implement KV caching manually, but understanding it helps explain performance behavior.

Why Every LLM Engineer Should Understand KV Cache

When developers encounter:

  • Slower responses on long prompts
  • GPU memory explosions
  • Context-length limitations
  • Throughput bottlenecks
  • Inference scaling challenges

KV Cache is often part of the explanation.

It is one of those rare optimizations that fundamentally changed the economics of LLM serving.

The transformer architecture made large language models possible.

KV Cache made them practical.

Without it, the AI products we use every day—from chatbots and coding assistants to search and agent systems—would feel dramatically slower and cost far more to operate.


*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

GitHub logo HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit




AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

  • 🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
  • 🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.

Top comments (0)