DEV Community

Nasit Sony
Nasit Sony

Posted on

How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results

How I Built a KV-Cache Control Plane for LLM Inference — With Real Benchmark Results

LLM inference is expensive. The prefill step — processing the prompt — is the biggest cost. If you've seen the same prompt before, you shouldn't have to recompute it.

That's the core idea behind KV-cache reuse. But in a distributed system with multiple inference nodes, a new problem emerges: where is the cached prefix stored, and how do you route requests to maximize reuse?

I built llm-serving-cache to answer that question — a metadata-driven control plane for LLM KV-cache placement and routing.


The Problem

In a single-node setup, KV-cache reuse is straightforward. The cache is local and the router is trivial.

In a distributed setup:

  • Cached prefixes are scattered across nodes
  • The same prompt might be cached on node-a but the request lands on node-b
  • Cache misses are expensive — full prefill cost, every time
  • GPU memory is finite — you need admission control and eviction You need a control plane that knows where every cached prefix lives and routes requests intelligently.

The Architecture

Client Request
      ↓
Router
      ↓
Session Affinity Check   → route to same node if session exists
      ↓
Exact Cache Hit?         → reuse cached result, skip prefill
      ↓
Prefix Match?            → reuse partial computation
      ↓
Cache Miss               → select best node, trigger cache fill
      ↓
[If full] Evict          → remove oldest inactive request
      ↓
Inference + Register     → store new cache entry
      ↓
WAL-backed Metadata Store
Enter fullscreen mode Exit fullscreen mode

Core Components

Router — handles exact hits, prefix matches, session affinity, and cache misses.

Node Registry — tracks available nodes, GPU memory, and utilization.

Metadata Store — persists cache entries and session routes via a WAL-backed KV engine (VeriStore).

Placement Policy — best-fit node selection based on available GPU memory blocks.


Benchmark Results

I ran controlled benchmarks across five cache strategies:

Scenario Avg Latency P95 Latency Hit Rate Rejection Rate
No Cache 1405 ms 1405 ms 0% 0%
Prefix Reuse 985 ms 1405 ms 50% 0%
Exact Cache 205 ms 205 ms 100% 0%
GPU-Aware 843 ms 1405 ms 25% 25%
GPU-Aware + Eviction 1895 ms 4205 ms 25% 0%

Key observations:

  • Exact cache reuse reduces latency by ~85% vs no cache
  • Prefix reuse improves average latency but not tail latency — P95 stays high when misses are still present

- Eviction reduces rejection but increases latency by admitting previously rejected expensive requests

Real Inference Validation (Ollama)

Benchmarks are useful, but I wanted to validate against real inference. I integrated Ollama running Llama 3.1 8B and ran controlled experiments:

Scenario Total Latency Prompt Eval Decode
Cold request ~8,488 ms 177 ms 5,238 ms
Warm request ~5,520 ms 47 ms 5,372 ms
Prefix-related ~5,891 ms 47 ms 5,747 ms

Warm requests drop prompt evaluation from 177ms → 47ms. But total latency is still ~5.5 seconds because decode dominates.

This is the key insight: caching helps prefill, but token generation is the real bottleneck in real inference systems.


GPU Memory Model

GPU memory is modeled as discrete fixed-size blocks (16MB each):

total_blocks = total_vram_mb / block_size
required_blocks = ceil(kv_size_mb / block_size)
Enter fullscreen mode Exit fullscreen mode

Best-fit placement selects the node with the minimum leftover blocks after allocation, reducing fragmentation.

Under memory pressure:

  1. Attempt allocation
  2. If insufficient → trigger eviction of oldest inactive request
  3. Retry allocation

4. If still insufficient → reject request with explicit reason

Admission Control Under Load

The most important result from the concurrent benchmark:

Concurrency Avg Latency P95 Latency Throughput
1 5,771 ms 5,771 ms 0.17 req/s
3 10,963 ms 16,299 ms 0.18 req/s
5 16,560 ms 27,744 ms 0.18 req/s
10 29,040 ms 53,525 ms 0.19 req/s

Throughput stays flat while latency explodes. This is classic queuing behavior — the bottleneck is the inference runtime, not the control plane.

With admission control (--max-active=3):

No Control With Control
Accepted 10 3
P95 Latency ~53.5s ~20.7s

Good systems don't try to serve everyone. They protect latency by rejecting excess load.


What I Learned

Prefix reuse is valuable but not sufficient. Caching eliminates prefill cost but generation cost dominates real LLM serving. Effective optimization needs to address both.

Single-request latency is misleading. Always benchmark under concurrency. P95 at concurrency=10 was nearly 3× the single-request time.

Admission control is more important than caching. A system that accepts everything under load will have terrible tail latency. Reject early, protect your SLA.

WAL-backed metadata is fast. Storage recovery for 5,000 cache entries takes ~20ms — completely invisible compared to inference latency. Persistence is free at this scale.


Try It Yourself

git clone --recurse-submodules https://github.com/NasitSony/llm-serving-cache.git
cd llm-serving-cache
cmake -S . -B build
cmake --build build

./build/routing_demo
./build/cache_register_demo
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/NasitSony/llm-serving-cache


This project is the inference serving layer of a larger AI infrastructure stack. The storage layer underneath is VeriStore. The workload orchestration layer above is Veriflow.

If you found this useful, a ⭐ on GitHub goes a long way!

Top comments (0)