DEV Community

Aman Sachan
Aman Sachan

Posted on

KVQuant: Run 70B LLMs on 8GB RAM with Real-Time KV Cache Compression

I built KVQuant because I wanted to run 70B parameter models on my gaming laptop. The problem? Even with 4-bit quantization, a 128K context window needs 256GB RAM just for the KV cache.

The Problem

When you run an LLM, the memory bottleneck is not the model weights - it is the KV cache.

Model Weights (4-bit) KV Cache (128K ctx) Total
Llama-3-8B 5GB 64GB 69GB
Llama-3-70B 40GB 256GB 296GB

The Solution

KVQuant compresses the KV cache in real-time using per-position adaptive quantization.

Result: 4-6x compression with less than 1% perplexity increase.

Usage

from kvquant import KVQuant
compressor = KVQuant(target_memory_gb=8)
model = compressor.wrap(model)
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/AmSach/kvquant

Top comments (0)