DEV Community

Aman Sachan
Aman Sachan

Posted on

KVQuant: Run 70B LLMs on 8GB RAM with KV Cache Quantization

I built KVQuant because running large LLMs locally is a nightmare — not because of model weights, but because of the KV cache.

The Problem

Model Weights (4-bit) KV Cache (128K ctx) Total
Llama-3-70B 40GB 256GB 296GB

Existing quantization (llama.cpp, etc.) only compresses weights. The KV cache still explodes your memory on long conversations.

What KVQuant Does

Compresses the KV cache with adaptive quantization based on token importance:

Token Position Bits Reason
Recent (0-256) 4-bit Attention often attends here
Mid (256-1024) 3-bit Medium importance
Old (1024+) 2-bit Distant context

Features

  • 4-6x KV cache compression with <1% perplexity increase
  • Drop-in — single pip install, no model recompilation
  • Real-time — adds <5ms latency per token
  • Cross-platform — CUDA, MPS (Apple Silicon), CPU

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from kvquant import KVQuant

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

with KVQuant(model, target_memory_gb=4.0):
    inputs = tokenizer("Hello, how are you?", return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
Enter fullscreen mode Exit fullscreen mode

Benchmarks

Model Context Original KV Compressed KV Ratio
Llama-3-8B 128K 32GB 8GB 4x

GitHub: https://github.com/AmSach/kvquant

Top comments (0)