I compressed GPT-2 to run on an Arduino! Here's how I did it with KVQuant.
The Problem: LLMs need huge memory for key-value caches during inference.
The Solution: 4-bit KV cache quantization that reduces memory 4x with <1% accuracy loss.
Results:
- GPT-2: 512MB → 128MB (4x reduction)
- LLaMA-7B: 8GB → 2GB
- LLaMA-70B: 280GB → 70GB
Code: github.com/AmSach/kvquant
Top comments (0)