prasanna kanagasabai

Posted on May 9

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

#llm #rust #machinelearning #opensource

The Problem

When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time the same prompt reappears.

tierKV intercepts evicted KV blocks, quantizes them, ships them to a vault on a LAN machine, and restores them on the next cache miss — injecting directly into vLLM's paged KV buffer with no attention recomputation. It integrates via vLLM's KVConnectorBase_V1 plugin API with no source changes required.

Benchmarks (Qwen3.6-35B-A3B, Apple FY2025 10-K, 30,561 tokens)

We ran the Apple FY2025 10-K filing through three scenarios. A full cold prefill with no cache took 10.75 seconds. A GPU cache hit (blocks already in VRAM) dropped that to 1.19 seconds. The cold vault restore came in at 0.52 seconds — 20× faster than a full prefill, and faster than the GPU cache hit.

Vault restore beats GPU cache hit because it bypasses attention computation entirely. GPU hits still run partial attention; vault blocks go straight into the buffer. The gap widens with context length — projected ~35× speedup at 128k tokens since prefill is O(n²) and restore is O(n) + network.

tierKV also supports EXO via a post-install patch. On an 8,000-token prompt: 30.83s cold → 4.11s restored (7.3×).

Architecture

Three tiers:

[Hot]  GPU KV cache  — VRAM, in-engine prefix cache
[Cold] KV vault      — LAN machine RAM, ~0.5ms away, gRPC
[Cold] SSM vault     — separate LAN machine for SSM/linear-attention layers

Eviction path: GPU block evicted → TurboQuant INT8 encode → fire-and-forget gRPC Store → GPU block freed immediately.

Restore path: Cache miss → BatchPromote RPC (all layers, one round-trip) → parallel rayon decode (GIL released) → tensors injected into paged KV buffer.

TurboQuant is a per-group INT8 quantizer written in Rust. Groups are aligned to attention head boundaries (group size = head dim, e.g. 256 for Qwen3.6-35B-A3B), so outlier heads can't corrupt neighboring groups. Result: 3.9× compression at ≥52 dB SNR.

Hybrid models like Qwen3.6-35B-A3B (10 full-attention + 30 linear-attention layers) route the two layer types to separate vaults automatically — no manual config per model.

Setup

Step 1 — Install on all machines:

pip install tierkv

No Cargo, no cmake. The Rust core is bundled in the wheel.

Step 2 — Configure each machine (tierkv.toml):

Inference node:

[cluster]
role = "inference"
kv_cold = "192.168.1.10:50051"
ssm_cold = "192.168.1.11:50051"

[turbo_quant]
enabled = true
kv_dim = 256  # match your model's attention head dimension

KV vault machine:

[cluster]
role = "kv_cold"

[vault]
max_bytes = 24_000_000_000  # 24 GB

Step 3 — Start vault servers on cold machines:

tierkv vault

Step 4 — Verify connectivity:

tierkv status

Step 5 — Launch vLLM:

vllm serve Qwen/Qwen3-30B-A3B \
  --kv-transfer-config '{
    "kv_connector": "TierKVConnector",
    "kv_connector_module_path": "tierkv.connectors.vllm.connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"config_path": "/path/to/tierkv.toml"}
  }' \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --block-size 16

That's it — no vLLM source changes, no rebuilding. tierKV intercepts eviction and restore automatically.

EXO users: tierkv install --exo-path /path/to/exo patches EXO in place. Then launch EXO as normal.

Our Test Cluster

Inference node: NVIDIA DGX Spark (GB10, 96 GB HBM) — runs vLLM or EXO
KV cold vault: Apple Mac Pro (M2 Pro, 32 GB RAM) — 24 GB reserved for KV blocks
SSM cold vault: Apple Mac Air (M2, 16 GB RAM) — 12 GB reserved for SSM states
Network: 5GbE LAN, ~0.5ms RTT

Deliberately modest hardware. The vault nodes are otherwise idle machines — no GPU required.

When tierKV Helps

Repeated long-context prompts (RAG over fixed docs, chat history, system prompts)
Multi-user serving with shared prefixes — first request warms the vault, all others benefit
Hybrid MoE + SSM models where both layer types need separate cold storage
Tight VRAM budget relative to context length

When It Doesn't Help

Single-shot prompts that never repeat
High-latency networks (WiFi, WAN) — assumes sub-5ms LAN RTT
Tensor-parallel multi-GPU inference — not yet supported
Very short prompts on hybrid models (below HMA block size threshold)
Applications requiring bit-for-bit identical output (use turbo_quant = false)

DEV Community