How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results
LLM inference is expensive. The prefill step — processing the prompt — is the biggest cost. If you've seen the same prompt before, you shouldn't have to recompute it.
That's the core idea behind KV-cache reuse. But in a distributed system with multiple inference nodes, a new problem emerges: where is the cached prefix stored, and how do you route requests to maximize reuse?
I built llm-serving-cache to answer that question — a metadata-driven control plane for LLM KV-cache placement and routing.
The Problem
In a single-node setup, KV-cache reuse is straightforward. The cache is local and the router is trivial.
In a distributed setup:
- Cached prefixes are scattered across nodes
- The same prompt might be cached on node-a but the request lands on node-b
- Cache misses are expensive — full prefill cost, every time
- GPU memory is finite — you need admission control and eviction You need a control plane that knows where every cached prefix lives and routes requests intelligently.
The Architecture
Client Request
↓
Router
↓
Session Affinity Check → route to same node if session exists
↓
Exact Cache Hit? → reuse cached result, skip prefill
↓
Prefix Match? → reuse partial computation
↓
Cache Miss → select best node, trigger cache fill
↓
[If full] Evict → remove oldest inactive request
↓
Inference + Register → store new cache entry
↓
WAL-backed Metadata Store
Core Components
Router — handles exact hits, prefix matches, session affinity, and cache misses.
Node Registry — tracks available nodes, GPU memory, and utilization.
Metadata Store — persists cache entries and session routes via a WAL-backed KV engine (VeriStore).
Placement Policy — best-fit node selection based on available GPU memory blocks.
Benchmark Results
I ran controlled benchmarks across five cache strategies:
| Scenario | Avg Latency | P95 Latency | Hit Rate | Rejection Rate |
|---|---|---|---|---|
| No Cache | 1405 ms | 1405 ms | 0% | 0% |
| Prefix Reuse | 985 ms | 1405 ms | 50% | 0% |
| Exact Cache | 205 ms | 205 ms | 100% | 0% |
| GPU-Aware | 843 ms | 1405 ms | 25% | 25% |
| GPU-Aware + Eviction | 1895 ms | 4205 ms | 25% | 0% |
Key observations:
- Exact cache reuse reduces latency by ~85% vs no cache
- Prefix reuse improves average latency but not tail latency — P95 stays high when misses are still present
- Eviction reduces rejection but increases latency by admitting previously rejected expensive requests
Real Inference Validation (Ollama)
Benchmarks are useful, but I wanted to validate against real inference. I integrated Ollama running Llama 3.1 8B and ran controlled experiments:
| Scenario | Total Latency | Prompt Eval | Decode |
|---|---|---|---|
| Cold request | ~8,488 ms | 177 ms | 5,238 ms |
| Warm request | ~5,520 ms | 47 ms | 5,372 ms |
| Prefix-related | ~5,891 ms | 47 ms | 5,747 ms |
Warm requests drop prompt evaluation from 177ms → 47ms. But total latency is still ~5.5 seconds because decode dominates.
This is the key insight: caching helps prefill, but token generation is the real bottleneck in real inference systems.
GPU Memory Model
GPU memory is modeled as discrete fixed-size blocks (16MB each):
total_blocks = total_vram_mb / block_size
required_blocks = ceil(kv_size_mb / block_size)
Best-fit placement selects the node with the minimum leftover blocks after allocation, reducing fragmentation.
Under memory pressure:
- Attempt allocation
- If insufficient → trigger eviction of oldest inactive request
- Retry allocation
4. If still insufficient → reject request with explicit reason
Admission Control Under Load
The most important result from the concurrent benchmark:
| Concurrency | Avg Latency | P95 Latency | Throughput |
|---|---|---|---|
| 1 | 5,771 ms | 5,771 ms | 0.17 req/s |
| 3 | 10,963 ms | 16,299 ms | 0.18 req/s |
| 5 | 16,560 ms | 27,744 ms | 0.18 req/s |
| 10 | 29,040 ms | 53,525 ms | 0.19 req/s |
Throughput stays flat while latency explodes. This is classic queuing behavior — the bottleneck is the inference runtime, not the control plane.
With admission control (--max-active=3):
| No Control | With Control | |
|---|---|---|
| Accepted | 10 | 3 |
| P95 Latency | ~53.5s | ~20.7s |
Good systems don't try to serve everyone. They protect latency by rejecting excess load.
What I Learned
Prefix reuse is valuable but not sufficient. Caching eliminates prefill cost but generation cost dominates real LLM serving. Effective optimization needs to address both.
Single-request latency is misleading. Always benchmark under concurrency. P95 at concurrency=10 was nearly 3× the single-request time.
Admission control is more important than caching. A system that accepts everything under load will have terrible tail latency. Reject early, protect your SLA.
WAL-backed metadata is fast. Storage recovery for 5,000 cache entries takes ~20ms — completely invisible compared to inference latency. Persistence is free at this scale.
Try It Yourself
git clone --recurse-submodules https://github.com/NasitSony/llm-serving-cache.git
cd llm-serving-cache
cmake -S . -B build
cmake --build build
./build/routing_demo
./build/cache_register_demo
GitHub: https://github.com/NasitSony/llm-serving-cache
This project is the inference serving layer of a larger AI infrastructure stack. The storage layer underneath is VeriStore. The workload orchestration layer above is Veriflow.
If you found this useful, a ⭐ on GitHub goes a long way!
Top comments (0)