Two separate problems with local LLMs, one mechanism fixes both:
Long context doesn't fit. The KV-cache grows linearly with every token; on 8 GB it dies around 110k.
Memory doesn't survive a restart. Kill the process and the cache is gone — next session re-reads everything from scratch (that's all RAG is: re-reading text, every query).
What the KV-cache actually is: as the model reads, each layer stores a key/value vector per token — this is the model's working state, the thing it attends over to pick the next token. It's not text, it's computed internal state. Normally it lives entirely in VRAM and only exists while the process is alive.
The mechanism:
Keep only the last N tokens' KV in VRAM (the hot window).
Stream everything older to CPU RAM (8-bit), then disk. It's not deleted, just moved somewhere 10× cheaper.
Keep a tiny index in VRAM: one vector per sentence. On a query, that index finds the few relevant chunks and copies only those KV vectors back into VRAM for the attention step.
So VRAM stays flat — it holds the hot window plus a small retrieval budget, regardless of whether total context is 50k or 800k. The cap moves from VRAM to your RAM stick.
The part that makes it memory and not just offload: you write that cold KV to disk and reload it in a different process days later. The model resumes from its own stored state — a few ms of memcpy, no re-reading the source. Two things make the positions line up across the gap: keys are stored already rotary-rotated at their absolute position, and the query is injected at its true position, so relative distances stay exact even between tokens 700k apart.
Result (RTX 5060 8 GB, 4-bit): a planted fact is the top prediction from 120k–500k tokens, top-2 at 800k, VRAM flat ~6 GB — where a normal full cache OOMs at 65k. The live store behind my assistant is 2.5M tokens of past sessions in 467 MB, answered in ~1s on CPU. Miss → LOW CONFIDENCE, not a hallucination.
Why Qwen2.5-7B-1M and not anything else — two independent requirements, most models fail one:
Architecture must keep a growing KV. Mamba/SSM and linear-attention models (Granite-4, RWKV, Qwen3.5's hybrid layers) compress history into one fixed-size state — there's no per-token KV to stream out or pull back. The whole method is inapplicable to them.
The model must actually read at that range. Effective context is ~50–70% of the advertised number (RULER). MiniCPM-1B (128k nominal) goes to garbage by ~120k even with a perfect full cache in VRAM — that's the model's limit, not the cache's. No infrastructure fixes it.
Qwen2.5-7B-1M passes both and fits 8 GB in 4-bit. Substrate alone OOMs at 65k; the model alone can't persist across restarts. The combination is the whole point.
Limits: confidence drops with distance (p 0.98 → 0.21 at 800k); many near-identical facts confuse the cheap index (need full attention there); on messy logs generation can override the retrieved KV with its own prior, so I quote stored text verbatim instead. Research PoC — the core (spilling KV to CPU) has precedent (ArkVale, RetrievalAttention), credited in the README.
Code (MIT): https://github.com/helgard-orlm/living-kv-cache
Top comments (0)