Shrijith Venkatramana

Posted on Jun 14

Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours

#ai #productivity #programming #webdev

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

Every developer eventually discovers the same frustrating pattern.

Your application sends a 20,000-token prompt to an LLM. The first request takes 2 seconds. The next request contains the exact same 20,000 tokens plus a tiny user message at the end.

And somehow the model processes the entire thing again.

At least, that's what many developers assume.

Modern LLM systems have a trick called prompt caching that can dramatically reduce latency and cost by reusing work from previous requests. But unlike traditional application caches, prompt caching isn't storing generated text. It's storing something much deeper inside the model.

To understand how prompt caching works, we need to follow a prompt all the way through the transformer itself.

The Expensive Part of Processing a Prompt

When a prompt enters a transformer model, it isn't immediately generating text.

First, the model must process every input token through every layer of the network.

Imagine a prompt like:

System: You are a helpful coding assistant.

Project Documentation:
[20,000 tokens of documentation]

User: How does authentication work?

Before generating a single output token, the model performs:

Tokenization
Embedding lookup
Multi-head attention
Feed-forward networks
Layer normalization

...across dozens or even hundreds of transformer layers.

For a large model, this preprocessing is often more expensive than generating a short answer.

If another user asks:

System: You are a helpful coding assistant.

Project Documentation:
[Same 20,000 tokens]

User: Explain the database schema.

Most of the prompt is identical.

Without caching, the model would recompute everything from scratch.

Prompt caching exists to avoid that waste.

The Key Insight: Cache Internal Transformer State, Not Text

A common misconception is that prompt caching stores prompt text.

That's not particularly useful because the model would still need to process the text again.

Instead, modern systems cache the transformer's internal representations.

After processing a token through the network, the model produces vectors that represent the token's state at various stages.

The most important cached data is usually:

Key vectors (K)
Value vectors (V)

These are generated during self-attention.

Once a prefix has been processed, those K/V tensors can often be reused.

Conceptually:

Prompt
  ↓
Token Embeddings
  ↓
Transformer Layers
  ↓
Key/Value Tensors
  ↓
Cache

When a future request begins with the same prefix, the system loads the cached tensors rather than recomputing them.

The model effectively starts from the middle of the computation.

Understanding KV Cache First

Prompt caching builds directly on a mechanism called the KV cache.

During inference, each attention layer creates:

Q = Query
K = Key
V = Value

Attention is computed roughly as:

\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V

When generating token 501, the model doesn't want to recompute attention for tokens 1-500.

Instead it stores the previous K and V tensors.

This is the standard KV cache used during autoregressive generation.

Prompt caching extends the same idea across requests.

Instead of caching:

Request A token 1-500

it caches:

Shared prompt prefix

which can then be reused by:

Request B
Request C
Request D

as long as the prefix remains identical.

What Actually Gets Stored?

Let's use a realistic example.

Suppose we have:

System Prompt: 2,000 tokens
Repository Documentation: 18,000 tokens
User Message: 100 tokens

Total:

20,100 tokens

Assume a model has:

80 layers
Hidden size 8,192
Multiple attention heads

For each layer, the system stores K and V tensors for every processed token.

Conceptually:

Layer 1:
  K[20000]
  V[20000]

Layer 2:
  K[20000]
  V[20000]

...

Layer 80:
  K[20000]
  V[20000]

The cache may occupy hundreds of megabytes or even gigabytes depending on:

Context length
Layer count
Precision (FP16, BF16, FP8)
Number of heads

This is why prompt caching isn't free.

The system trades memory for computation.

GPU memory is expensive, but recomputing a 20,000-token prompt repeatedly is often even more expensive.

Prefix Matching: Why Exact Equality Matters

Most production systems perform prompt caching using prefix matching.

Consider:

[System Prompt]
[Documentation]
User: Explain auth

and

[System Prompt]
[Documentation]
User: Explain database

The shared prefix is:

[System Prompt]
[Documentation]

Everything after that differs.

The cache can be reused because the transformer state for the shared prefix is identical.

But even small changes can invalidate the cache:

Version 1:
Repository version: 2.1

Version 2:
Repository version: 2.2

That tiny change alters tokenization.

Different tokens produce different embeddings.

Different embeddings produce different K/V tensors.

The entire downstream computation changes.

This is why prompt caching systems often require exact token-level matches rather than semantic similarity.

How Providers Implement Prompt Caching

Different providers implement prompt caching differently, but the general architecture is similar.

Incoming Request
       ↓
Prefix Detection
       ↓
Cache Lookup
       ↓
Cache Hit?
    /      \
  Yes      No
   |         |
Load KV   Compute KV
   |         |
Generate Response

The difficult engineering problems include:

Cache Eviction

GPU memory is limited.

Providers must decide:

Which caches stay resident?
Which caches get removed?
Which users receive priority?

This resembles operating system page management more than traditional web caching.

Distributed Inference

Large serving systems spread requests across many GPUs.

A cached prefix may exist on GPU A while the next request arrives on GPU B.

Providers must either:

Move cache data
Route requests intelligently
Recompute the prefix

Multi-Tenant Isolation

A cache created for one customer should not leak information to another customer.

Production systems must maintain strict isolation boundaries.

Why RAG Applications Benefit So Much

Retrieval-Augmented Generation systems are perfect candidates for prompt caching.

Imagine a code assistant.

Every request includes:

System Prompt
Repository Rules
Architecture Docs
Coding Standards

Only the user question changes.

Without caching:

20,000 tokens processed
20,000 tokens processed
20,000 tokens processed
20,000 tokens processed

With caching:

20,000-token prefix processed once

Request 2:
reuse cache

Request 3:
reuse cache

Request 4:
reuse cache

Latency drops.

GPU utilization drops.

Cost drops.

This is one reason why modern coding assistants can feel much faster than their raw context sizes would suggest.

The Future: Beyond Exact Prefix Caching

Today's prompt caching mostly relies on exact token matches.

Researchers are exploring more ambitious ideas:

Partial-prefix reuse
Semantic cache matching
Attention state compression
Cross-session cache persistence
Hierarchical context caching

The challenge is preserving correctness.

Exact matches guarantee identical transformer states.

Approximate matches introduce uncertainty.

Future systems may combine both approaches, using exact caches when possible and semantic reuse when beneficial.

Final Thoughts

Prompt caching is one of the least visible but most impactful optimizations in modern LLM serving.

The important realization is that the cache is not storing text and it is not storing generated responses.

It is storing the expensive internal transformer state—primarily key and value tensors—that would otherwise need to be recomputed.

Once you understand that, prompt caching starts looking less like an application-level optimization and more like a CPU instruction cache or an operating system memory cache: a mechanism for avoiding repeated work by preserving computation that has already been paid for.

As context windows continue growing from tens of thousands to millions of tokens, do you think exact prefix caching will remain dominant, or will future LLM systems need semantic and approximate caching techniques to stay efficient?

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
…

View on GitHub

DEV Community

Prompt Caching in LLMs: The Hidden Optimization Saving Millions of GPU Hours

The Expensive Part of Processing a Prompt

The Key Insight: Cache Internal Transformer State, Not Text

Understanding KV Cache First

What Actually Gets Stored?

Prefix Matching: Why Exact Equality Matters

How Providers Implement Prompt Caching

Cache Eviction

Distributed Inference

Multi-Tenant Isolation

Why RAG Applications Benefit So Much

The Future: Beyond Exact Prefix Caching

Final Thoughts

HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit

git-lrc

Free, Micro AI Code Reviews That Run on Commit

See It In Action

Why

Top comments (0)