<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: prasanna kanagasabai</title>
    <description>The latest articles on DEV Community by prasanna kanagasabai (@prasanna_kanagasabai_4ae7).</description>
    <link>https://dev.to/prasanna_kanagasabai_4ae7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3920980%2Fe8bb3720-6d13-40eb-8899-14a351df4181.png</url>
      <title>DEV Community: prasanna kanagasabai</title>
      <link>https://dev.to/prasanna_kanagasabai_4ae7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prasanna_kanagasabai_4ae7"/>
    <language>en</language>
    <item>
      <title>tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits</title>
      <dc:creator>prasanna kanagasabai</dc:creator>
      <pubDate>Sat, 09 May 2026 03:01:14 +0000</pubDate>
      <link>https://dev.to/prasanna_kanagasabai_4ae7/tierkv-a-distributed-kv-cache-that-makes-evicted-blocks-faster-to-restore-than-gpu-cache-hits-1ghd</link>
      <guid>https://dev.to/prasanna_kanagasabai_4ae7/tierkv-a-distributed-kv-cache-that-makes-evicted-blocks-faster-to-restore-than-gpu-cache-hits-1ghd</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time the same prompt reappears.&lt;/p&gt;

&lt;p&gt;tierKV intercepts evicted KV blocks, quantizes them, ships them to a vault on a LAN machine, and restores them on the next cache miss — injecting directly into vLLM's paged KV buffer with no attention recomputation. It integrates via vLLM's &lt;code&gt;KVConnectorBase_V1&lt;/code&gt; plugin API with no source changes required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarks (Qwen3.6-35B-A3B, Apple FY2025 10-K, 30,561 tokens)
&lt;/h2&gt;

&lt;p&gt;We ran the Apple FY2025 10-K filing through three scenarios. A full cold prefill with no cache took &lt;strong&gt;10.75 seconds&lt;/strong&gt;. A GPU cache hit (blocks already in VRAM) dropped that to &lt;strong&gt;1.19 seconds&lt;/strong&gt;. The cold vault restore came in at &lt;strong&gt;0.52 seconds&lt;/strong&gt; — 20× faster than a full prefill, and faster than the GPU cache hit.&lt;/p&gt;

&lt;p&gt;Vault restore beats GPU cache hit because it bypasses attention computation entirely. GPU hits still run partial attention; vault blocks go straight into the buffer. The gap widens with context length — projected ~35× speedup at 128k tokens since prefill is O(n²) and restore is O(n) + network.&lt;/p&gt;

&lt;p&gt;tierKV also supports &lt;strong&gt;EXO&lt;/strong&gt; via a post-install patch. On an 8,000-token prompt: 30.83s cold → 4.11s restored (7.3×).&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Three tiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Hot]  GPU KV cache  — VRAM, in-engine prefix cache
[Cold] KV vault      — LAN machine RAM, ~0.5ms away, gRPC
[Cold] SSM vault     — separate LAN machine for SSM/linear-attention layers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Eviction path:&lt;/strong&gt; GPU block evicted → TurboQuant INT8 encode → fire-and-forget gRPC &lt;code&gt;Store&lt;/code&gt; → GPU block freed immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Restore path:&lt;/strong&gt; Cache miss → &lt;code&gt;BatchPromote&lt;/code&gt; RPC (all layers, one round-trip) → parallel rayon decode (GIL released) → tensors injected into paged KV buffer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TurboQuant&lt;/strong&gt; is a per-group INT8 quantizer written in Rust. Groups are aligned to attention head boundaries (group size = head dim, e.g. 256 for Qwen3.6-35B-A3B), so outlier heads can't corrupt neighboring groups. Result: &lt;strong&gt;3.9× compression&lt;/strong&gt; at &lt;strong&gt;≥52 dB SNR&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Hybrid models like Qwen3.6-35B-A3B (10 full-attention + 30 linear-attention layers) route the two layer types to separate vaults automatically — no manual config per model.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Install on all machines:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;tierkv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Cargo, no cmake. The Rust core is bundled in the wheel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Configure each machine (&lt;code&gt;tierkv.toml&lt;/code&gt;):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inference node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[cluster]&lt;/span&gt;
&lt;span class="py"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"inference"&lt;/span&gt;
&lt;span class="py"&gt;kv_cold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"192.168.1.10:50051"&lt;/span&gt;
&lt;span class="py"&gt;ssm_cold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"192.168.1.11:50051"&lt;/span&gt;

&lt;span class="nn"&gt;[turbo_quant]&lt;/span&gt;
&lt;span class="py"&gt;enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="py"&gt;kv_dim&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;  &lt;span class="c"&gt;# match your model's attention head dimension&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;KV vault machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[cluster]&lt;/span&gt;
&lt;span class="py"&gt;role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"kv_cold"&lt;/span&gt;

&lt;span class="nn"&gt;[vault]&lt;/span&gt;
&lt;span class="py"&gt;max_bytes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;24_000_000_000&lt;/span&gt;  &lt;span class="c"&gt;# 24 GB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 — Start vault servers on cold machines:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;tierkv vault
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4 — Verify connectivity:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;tierkv status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 5 — Launch vLLM:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vllm serve Qwen/Qwen3-30B-A3B &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--kv-transfer-config&lt;/span&gt; &lt;span class="s1"&gt;'{
    "kv_connector": "TierKVConnector",
    "kv_connector_module_path": "tierkv.connectors.vllm.connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"config_path": "/path/to/tierkv.toml"}
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-disable-hybrid-kv-cache-manager&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--block-size&lt;/span&gt; 16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it — no vLLM source changes, no rebuilding. tierKV intercepts eviction and restore automatically.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;EXO users:&lt;/strong&gt; &lt;code&gt;tierkv install --exo-path /path/to/exo&lt;/code&gt; patches EXO in place. Then launch EXO as normal.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Our Test Cluster
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inference node:&lt;/strong&gt; NVIDIA DGX Spark (GB10, 96 GB HBM) — runs vLLM or EXO&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KV cold vault:&lt;/strong&gt; Apple Mac Pro (M2 Pro, 32 GB RAM) — 24 GB reserved for KV blocks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSM cold vault:&lt;/strong&gt; Apple Mac Air (M2, 16 GB RAM) — 12 GB reserved for SSM states&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network:&lt;/strong&gt; 5GbE LAN, ~0.5ms RTT&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deliberately modest hardware. The vault nodes are otherwise idle machines — no GPU required.&lt;/p&gt;




&lt;h2&gt;
  
  
  When tierKV Helps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Repeated long-context prompts (RAG over fixed docs, chat history, system prompts)&lt;/li&gt;
&lt;li&gt;Multi-user serving with shared prefixes — first request warms the vault, all others benefit&lt;/li&gt;
&lt;li&gt;Hybrid MoE + SSM models where both layer types need separate cold storage&lt;/li&gt;
&lt;li&gt;Tight VRAM budget relative to context length&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When It Doesn't Help
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Single-shot prompts that never repeat&lt;/li&gt;
&lt;li&gt;High-latency networks (WiFi, WAN) — assumes sub-5ms LAN RTT&lt;/li&gt;
&lt;li&gt;Tensor-parallel multi-GPU inference — not yet supported&lt;/li&gt;
&lt;li&gt;Very short prompts on hybrid models (below HMA block size threshold)&lt;/li&gt;
&lt;li&gt;Applications requiring bit-for-bit identical output (use &lt;code&gt;turbo_quant = false&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/tierkv/tierkv" rel="noopener noreferrer"&gt;github.com/tierkv/tierkv&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full writeup:&lt;/strong&gt; &lt;a href="https://prasannakanagasabai126786.substack.com/p/your-llm-is-doing-math-it-already" rel="noopener noreferrer"&gt;Substack&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pip install tierkv&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>rust</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
