KVQuant: Run 70B LLMs on 8GB RAM with Real-Time KV Cache Compression

#ai #python #opensource #llm

I built KVQuant because I wanted to run 70B parameter models on my gaming laptop. The problem? Even with 4-bit quantization, a 128K context window needs 256GB RAM just for the KV cache.

The Problem

When you run an LLM, the memory bottleneck is not the model weights - it is the KV cache.

Model	Weights (4-bit)	KV Cache (128K ctx)	Total
Llama-3-8B	5GB	64GB	69GB
Llama-3-70B	40GB	256GB	296GB

The Solution

KVQuant compresses the KV cache in real-time using per-position adaptive quantization.

Result: 4-6x compression with less than 1% perplexity increase.

Usage

from kvquant import KVQuant
compressor = KVQuant(target_memory_gb=8)
model = compressor.wrap(model)

GitHub: https://github.com/AmSach/kvquant

DEV Community

KVQuant: Run 70B LLMs on 8GB RAM with Real-Time KV Cache Compression

The Problem

The Solution

Usage

Top comments (0)