Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.
Every developer eventually discovers the same frustrating pattern.
Your application sends a 20,000-token prompt to an LLM. The first request takes 2 seconds. The next request contains the exact same 20,000 tokens plus a tiny user message at the end.
And somehow the model processes the entire thing again.
At least, that's what many developers assume.
Modern LLM systems have a trick called prompt caching that can dramatically reduce latency and cost by reusing work from previous requests. But unlike traditional application caches, prompt caching isn't storing generated text. It's storing something much deeper inside the model.
To understand how prompt caching works, we need to follow a prompt all the way through the transformer itself.
The Expensive Part of Processing a Prompt
When a prompt enters a transformer model, it isn't immediately generating text.
First, the model must process every input token through every layer of the network.
Imagine a prompt like:
System: You are a helpful coding assistant.
Project Documentation:
[20,000 tokens of documentation]
User: How does authentication work?
Before generating a single output token, the model performs:
- Tokenization
- Embedding lookup
- Multi-head attention
- Feed-forward networks
- Layer normalization
...across dozens or even hundreds of transformer layers.
For a large model, this preprocessing is often more expensive than generating a short answer.
If another user asks:
System: You are a helpful coding assistant.
Project Documentation:
[Same 20,000 tokens]
User: Explain the database schema.
Most of the prompt is identical.
Without caching, the model would recompute everything from scratch.
Prompt caching exists to avoid that waste.
The Key Insight: Cache Internal Transformer State, Not Text
A common misconception is that prompt caching stores prompt text.
That's not particularly useful because the model would still need to process the text again.
Instead, modern systems cache the transformer's internal representations.
After processing a token through the network, the model produces vectors that represent the token's state at various stages.
The most important cached data is usually:
- Key vectors (K)
- Value vectors (V)
These are generated during self-attention.
Once a prefix has been processed, those K/V tensors can often be reused.
Conceptually:
Prompt
↓
Token Embeddings
↓
Transformer Layers
↓
Key/Value Tensors
↓
Cache
When a future request begins with the same prefix, the system loads the cached tensors rather than recomputing them.
The model effectively starts from the middle of the computation.
Understanding KV Cache First
Prompt caching builds directly on a mechanism called the KV cache.
During inference, each attention layer creates:
Q = Query
K = Key
V = Value
Attention is computed roughly as:
\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V
When generating token 501, the model doesn't want to recompute attention for tokens 1-500.
Instead it stores the previous K and V tensors.
This is the standard KV cache used during autoregressive generation.
Prompt caching extends the same idea across requests.
Instead of caching:
Request A token 1-500
it caches:
Shared prompt prefix
which can then be reused by:
Request B
Request C
Request D
as long as the prefix remains identical.
What Actually Gets Stored?
Let's use a realistic example.
Suppose we have:
System Prompt: 2,000 tokens
Repository Documentation: 18,000 tokens
User Message: 100 tokens
Total:
20,100 tokens
Assume a model has:
- 80 layers
- Hidden size 8,192
- Multiple attention heads
For each layer, the system stores K and V tensors for every processed token.
Conceptually:
Layer 1:
K[20000]
V[20000]
Layer 2:
K[20000]
V[20000]
...
Layer 80:
K[20000]
V[20000]
The cache may occupy hundreds of megabytes or even gigabytes depending on:
- Context length
- Layer count
- Precision (FP16, BF16, FP8)
- Number of heads
This is why prompt caching isn't free.
The system trades memory for computation.
GPU memory is expensive, but recomputing a 20,000-token prompt repeatedly is often even more expensive.
Prefix Matching: Why Exact Equality Matters
Most production systems perform prompt caching using prefix matching.
Consider:
[System Prompt]
[Documentation]
User: Explain auth
and
[System Prompt]
[Documentation]
User: Explain database
The shared prefix is:
[System Prompt]
[Documentation]
Everything after that differs.
The cache can be reused because the transformer state for the shared prefix is identical.
But even small changes can invalidate the cache:
Version 1:
Repository version: 2.1
Version 2:
Repository version: 2.2
That tiny change alters tokenization.
Different tokens produce different embeddings.
Different embeddings produce different K/V tensors.
The entire downstream computation changes.
This is why prompt caching systems often require exact token-level matches rather than semantic similarity.
How Providers Implement Prompt Caching
Different providers implement prompt caching differently, but the general architecture is similar.
Incoming Request
↓
Prefix Detection
↓
Cache Lookup
↓
Cache Hit?
/ \
Yes No
| |
Load KV Compute KV
| |
Generate Response
The difficult engineering problems include:
Cache Eviction
GPU memory is limited.
Providers must decide:
- Which caches stay resident?
- Which caches get removed?
- Which users receive priority?
This resembles operating system page management more than traditional web caching.
Distributed Inference
Large serving systems spread requests across many GPUs.
A cached prefix may exist on GPU A while the next request arrives on GPU B.
Providers must either:
- Move cache data
- Route requests intelligently
- Recompute the prefix
Multi-Tenant Isolation
A cache created for one customer should not leak information to another customer.
Production systems must maintain strict isolation boundaries.
Why RAG Applications Benefit So Much
Retrieval-Augmented Generation systems are perfect candidates for prompt caching.
Imagine a code assistant.
Every request includes:
System Prompt
Repository Rules
Architecture Docs
Coding Standards
Only the user question changes.
Without caching:
20,000 tokens processed
20,000 tokens processed
20,000 tokens processed
20,000 tokens processed
With caching:
20,000-token prefix processed once
Request 2:
reuse cache
Request 3:
reuse cache
Request 4:
reuse cache
Latency drops.
GPU utilization drops.
Cost drops.
This is one reason why modern coding assistants can feel much faster than their raw context sizes would suggest.
The Future: Beyond Exact Prefix Caching
Today's prompt caching mostly relies on exact token matches.
Researchers are exploring more ambitious ideas:
- Partial-prefix reuse
- Semantic cache matching
- Attention state compression
- Cross-session cache persistence
- Hierarchical context caching
The challenge is preserving correctness.
Exact matches guarantee identical transformer states.
Approximate matches introduce uncertainty.
Future systems may combine both approaches, using exact caches when possible and semantic reuse when beneficial.
Final Thoughts
Prompt caching is one of the least visible but most impactful optimizations in modern LLM serving.
The important realization is that the cache is not storing text and it is not storing generated responses.
It is storing the expensive internal transformer state—primarily key and value tensors—that would otherwise need to be recomputed.
Once you understand that, prompt caching starts looking less like an application-level optimization and more like a CPU instruction cache or an operating system memory cache: a mechanism for avoiding repeated work by preserving computation that has already been paid for.
As context windows continue growing from tens of thousands to millions of tokens, do you think exact prefix caching will remain dominant, or will future LLM systems need semantic and approximate caching techniques to stay efficient?
*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*
Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.
HexmosTech
/
git-lrc
Free, Micro AI Code Reviews That Run on Commit
| 🇩🇰 Dansk | 🇪🇸 Español | 🇮🇷 Farsi | 🇫🇮 Suomi | 🇯🇵 日本語 | 🇳🇴 Norsk | 🇵🇹 Português | 🇷🇺 Русский | 🇦🇱 Shqip | 🇨🇳 中文 | 🇮🇳 हिन्दी |
git-lrc
Free, Micro AI Code Reviews That Run on Commit
AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.
See It In Action
See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements
git-lrc-intro-60s.mp4
Why
- 🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
- 🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
- …
Top comments (0)