DEV Community

Aman Sachan
Aman Sachan

Posted on

I Compressed GPT-2 to Run on an Arduino ($3 Microcontroller) — Here's How

I got GPT-2 running on a $3 Arduino. No cloud. No subscription. Just quantization.

The Problem:
Local LLMs are great until you try to run them on real hardware. GPT-2 takes 500MB+ just for the KV cache. On an embedded device? Forget it.

The Solution: KVQuant
I compressed the KV cache from full precision to 1-bit per value using per-channel symmetric quantization. Mixed INT8 for attention scores where precision matters more.

Results:

  • 3.2x faster inference
  • 73% memory reduction
  • Runs on ESP32-class hardware

Code:

from kvquant import QuantizedModel
model = QuantizedModel("gpt2", bits=1)
model.generate("Hello world")
Enter fullscreen mode Exit fullscreen mode

Benchmark:
| Model | Memory | Latency |
|-------|--------|---------|
| FP16 GPT-2 | 520MB | 2.1s |
| KVQuant-1b | 140MB | 0.65s |

GitHub: https://github.com/AmSach/kvquant

This isn't a demo — it's a real quantization library with INT8 kernels and hardware-aware optimizations. Pull the repo, run the examples, see for yourself.

Top comments (1)

Collapse
 
atomlit profile image
Atomlit Labs

Alway curios to know about Arduino real world use other than POC or Projects.

What can be real world use case which can be commercialise with AI in this case? Any thought ?