The Problem
When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time the same prompt reappears.
tierKV intercepts evicted KV blocks, quantizes them, ships them to a vault on a LAN machine, and restores them on the next cache miss — injecting directly into vLLM's paged KV buffer with no attention recomputation. It integrates via vLLM's KVConnectorBase_V1 plugin API with no source changes required.
Benchmarks (Qwen3.6-35B-A3B, Apple FY2025 10-K, 30,561 tokens)
We ran the Apple FY2025 10-K filing through three scenarios. A full cold prefill with no cache took 10.75 seconds. A GPU cache hit (blocks already in VRAM) dropped that to 1.19 seconds. The cold vault restore came in at 0.52 seconds — 20× faster than a full prefill, and faster than the GPU cache hit.
Vault restore beats GPU cache hit because it bypasses attention computation entirely. GPU hits still run partial attention; vault blocks go straight into the buffer. The gap widens with context length — projected ~35× speedup at 128k tokens since prefill is O(n²) and restore is O(n) + network.
tierKV also supports EXO via a post-install patch. On an 8,000-token prompt: 30.83s cold → 4.11s restored (7.3×).
Architecture
Three tiers:
[Hot] GPU KV cache — VRAM, in-engine prefix cache
[Cold] KV vault — LAN machine RAM, ~0.5ms away, gRPC
[Cold] SSM vault — separate LAN machine for SSM/linear-attention layers
Eviction path: GPU block evicted → TurboQuant INT8 encode → fire-and-forget gRPC Store → GPU block freed immediately.
Restore path: Cache miss → BatchPromote RPC (all layers, one round-trip) → parallel rayon decode (GIL released) → tensors injected into paged KV buffer.
TurboQuant is a per-group INT8 quantizer written in Rust. Groups are aligned to attention head boundaries (group size = head dim, e.g. 256 for Qwen3.6-35B-A3B), so outlier heads can't corrupt neighboring groups. Result: 3.9× compression at ≥52 dB SNR.
Hybrid models like Qwen3.6-35B-A3B (10 full-attention + 30 linear-attention layers) route the two layer types to separate vaults automatically — no manual config per model.
Setup
Step 1 — Install on all machines:
pip install tierkv
No Cargo, no cmake. The Rust core is bundled in the wheel.
Step 2 — Configure each machine (tierkv.toml):
Inference node:
[cluster]
role = "inference"
kv_cold = "192.168.1.10:50051"
ssm_cold = "192.168.1.11:50051"
[turbo_quant]
enabled = true
kv_dim = 256 # match your model's attention head dimension
KV vault machine:
[cluster]
role = "kv_cold"
[vault]
max_bytes = 24_000_000_000 # 24 GB
Step 3 — Start vault servers on cold machines:
tierkv vault
Step 4 — Verify connectivity:
tierkv status
Step 5 — Launch vLLM:
vllm serve Qwen/Qwen3-30B-A3B \
--kv-transfer-config '{
"kv_connector": "TierKVConnector",
"kv_connector_module_path": "tierkv.connectors.vllm.connector",
"kv_role": "kv_both",
"kv_connector_extra_config": {"config_path": "/path/to/tierkv.toml"}
}' \
--enable-prefix-caching \
--no-disable-hybrid-kv-cache-manager \
--block-size 16
That's it — no vLLM source changes, no rebuilding. tierKV intercepts eviction and restore automatically.
EXO users:
tierkv install --exo-path /path/to/exopatches EXO in place. Then launch EXO as normal.
Our Test Cluster
- Inference node: NVIDIA DGX Spark (GB10, 96 GB HBM) — runs vLLM or EXO
- KV cold vault: Apple Mac Pro (M2 Pro, 32 GB RAM) — 24 GB reserved for KV blocks
- SSM cold vault: Apple Mac Air (M2, 16 GB RAM) — 12 GB reserved for SSM states
- Network: 5GbE LAN, ~0.5ms RTT
Deliberately modest hardware. The vault nodes are otherwise idle machines — no GPU required.
When tierKV Helps
- Repeated long-context prompts (RAG over fixed docs, chat history, system prompts)
- Multi-user serving with shared prefixes — first request warms the vault, all others benefit
- Hybrid MoE + SSM models where both layer types need separate cold storage
- Tight VRAM budget relative to context length
When It Doesn't Help
- Single-shot prompts that never repeat
- High-latency networks (WiFi, WAN) — assumes sub-5ms LAN RTT
- Tensor-parallel multi-GPU inference — not yet supported
- Very short prompts on hybrid models (below HMA block size threshold)
- Applications requiring bit-for-bit identical output (use
turbo_quant = false)
Links
- GitHub: github.com/tierkv/tierkv
- Full writeup: Substack
pip install tierkv
Top comments (0)