KVQuant: Run 70B LLMs on 8GB RAM with KV Cache Quantization

#python #llm #quantization

I built KVQuant because running large LLMs locally is a nightmare — not because of model weights, but because of the KV cache.

The Problem

Model	Weights (4-bit)	KV Cache (128K ctx)	Total
Llama-3-70B	40GB	256GB	296GB

Existing quantization (llama.cpp, etc.) only compresses weights. The KV cache still explodes your memory on long conversations.

What KVQuant Does

Compresses the KV cache with adaptive quantization based on token importance:

Token Position	Bits	Reason
Recent (0-256)	4-bit	Attention often attends here
Mid (256-1024)	3-bit	Medium importance
Old (1024+)	2-bit	Distant context

Features

4-6x KV cache compression with <1% perplexity increase
Drop-in — single pip install, no model recompilation
Real-time — adds <5ms latency per token
Cross-platform — CUDA, MPS (Apple Silicon), CPU

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from kvquant import KVQuant

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

with KVQuant(model, target_memory_gb=4.0):
    inputs = tokenizer("Hello, how are you?", return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)