DEV Community

Cover image for tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits
prasanna kanagasabai
prasanna kanagasabai

Posted on

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

The Problem

When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time the same prompt reappears.

tierKV intercepts evicted KV blocks, quantizes them, ships them to a vault on a LAN machine, and restores them on the next cache miss — injecting directly into vLLM's paged KV buffer with no attention recomputation. It integrates via vLLM's KVConnectorBase_V1 plugin API with no source changes required.


Benchmarks (Qwen3.6-35B-A3B, Apple FY2025 10-K, 30,561 tokens)

We ran the Apple FY2025 10-K filing through three scenarios. A full cold prefill with no cache took 10.75 seconds. A GPU cache hit (blocks already in VRAM) dropped that to 1.19 seconds. The cold vault restore came in at 0.52 seconds — 20× faster than a full prefill, and faster than the GPU cache hit.

Vault restore beats GPU cache hit because it bypasses attention computation entirely. GPU hits still run partial attention; vault blocks go straight into the buffer. The gap widens with context length — projected ~35× speedup at 128k tokens since prefill is O(n²) and restore is O(n) + network.

tierKV also supports EXO via a post-install patch. On an 8,000-token prompt: 30.83s cold → 4.11s restored (7.3×).


Architecture

Three tiers:

[Hot]  GPU KV cache  — VRAM, in-engine prefix cache
[Cold] KV vault      — LAN machine RAM, ~0.5ms away, gRPC
[Cold] SSM vault     — separate LAN machine for SSM/linear-attention layers
Enter fullscreen mode Exit fullscreen mode

Eviction path: GPU block evicted → TurboQuant INT8 encode → fire-and-forget gRPC Store → GPU block freed immediately.

Restore path: Cache miss → BatchPromote RPC (all layers, one round-trip) → parallel rayon decode (GIL released) → tensors injected into paged KV buffer.

TurboQuant is a per-group INT8 quantizer written in Rust. Groups are aligned to attention head boundaries (group size = head dim, e.g. 256 for Qwen3.6-35B-A3B), so outlier heads can't corrupt neighboring groups. Result: 3.9× compression at ≥52 dB SNR.

Hybrid models like Qwen3.6-35B-A3B (10 full-attention + 30 linear-attention layers) route the two layer types to separate vaults automatically — no manual config per model.


Setup

Step 1 — Install on all machines:

pip install tierkv
Enter fullscreen mode Exit fullscreen mode

No Cargo, no cmake. The Rust core is bundled in the wheel.

Step 2 — Configure each machine (tierkv.toml):

Inference node:

[cluster]
role = "inference"
kv_cold = "192.168.1.10:50051"
ssm_cold = "192.168.1.11:50051"

[turbo_quant]
enabled = true
kv_dim = 256  # match your model's attention head dimension
Enter fullscreen mode Exit fullscreen mode

KV vault machine:

[cluster]
role = "kv_cold"

[vault]
max_bytes = 24_000_000_000  # 24 GB
Enter fullscreen mode Exit fullscreen mode

Step 3 — Start vault servers on cold machines:

tierkv vault
Enter fullscreen mode Exit fullscreen mode

Step 4 — Verify connectivity:

tierkv status
Enter fullscreen mode Exit fullscreen mode

Step 5 — Launch vLLM:

vllm serve Qwen/Qwen3-30B-A3B \
  --kv-transfer-config '{
    "kv_connector": "TierKVConnector",
    "kv_connector_module_path": "tierkv.connectors.vllm.connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"config_path": "/path/to/tierkv.toml"}
  }' \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --block-size 16
Enter fullscreen mode Exit fullscreen mode

That's it — no vLLM source changes, no rebuilding. tierKV intercepts eviction and restore automatically.

EXO users: tierkv install --exo-path /path/to/exo patches EXO in place. Then launch EXO as normal.


Our Test Cluster

  • Inference node: NVIDIA DGX Spark (GB10, 96 GB HBM) — runs vLLM or EXO
  • KV cold vault: Apple Mac Pro (M2 Pro, 32 GB RAM) — 24 GB reserved for KV blocks
  • SSM cold vault: Apple Mac Air (M2, 16 GB RAM) — 12 GB reserved for SSM states
  • Network: 5GbE LAN, ~0.5ms RTT

Deliberately modest hardware. The vault nodes are otherwise idle machines — no GPU required.


When tierKV Helps

  • Repeated long-context prompts (RAG over fixed docs, chat history, system prompts)
  • Multi-user serving with shared prefixes — first request warms the vault, all others benefit
  • Hybrid MoE + SSM models where both layer types need separate cold storage
  • Tight VRAM budget relative to context length

When It Doesn't Help

  • Single-shot prompts that never repeat
  • High-latency networks (WiFi, WAN) — assumes sub-5ms LAN RTT
  • Tensor-parallel multi-GPU inference — not yet supported
  • Very short prompts on hybrid models (below HMA block size threshold)
  • Applications requiring bit-for-bit identical output (use turbo_quant = false)

Links

Top comments (0)