I got GPT-2 running on a $3 Arduino. No cloud. No subscription. Just quantization.
The Problem:
Local LLMs are great until you try to run them on real hardware. GPT-2 takes 500MB+ just for the KV cache. On an embedded device? Forget it.
The Solution: KVQuant
I compressed the KV cache from full precision to 1-bit per value using per-channel symmetric quantization. Mixed INT8 for attention scores where precision matters more.
Results:
- 3.2x faster inference
- 73% memory reduction
- Runs on ESP32-class hardware
Code:
from kvquant import QuantizedModel
model = QuantizedModel("gpt2", bits=1)
model.generate("Hello world")
Benchmark:
| Model | Memory | Latency |
|-------|--------|---------|
| FP16 GPT-2 | 520MB | 2.1s |
| KVQuant-1b | 140MB | 0.65s |
GitHub: https://github.com/AmSach/kvquant
This isn't a demo — it's a real quantization library with INT8 kernels and hardware-aware optimizations. Pull the repo, run the examples, see for yourself.
Top comments (1)
Alway curios to know about Arduino real world use other than POC or Projects.
What can be real world use case which can be commercialise with AI in this case? Any thought ?