How does a 70-billion-parameter model run on a gaming GPU — or a 7B model on your laptop? Quantization. Store each weight in fewer bits and the model gets 4–8× smaller and faster, with surprisingly little quality loss. Here's the idea, live.
🗜️ Quantize real weights (FP32 → INT4): https://dev48v.infy.uk/ai/days/day20-quantization.html
The memory math
A weight in FP32 is 4 bytes. A 7B model = 7B × 4 = 28 GB just for weights — won't fit a consumer GPU. Drop to:
- FP16/BF16 → 14 GB
- INT8 → 7 GB
- INT4 → 3.5 GB
Same model, a quarter of the memory.
How it works
Map the weights' range (min…max) to a small set of discrete levels (2^bits of them) with a scale factor, then round each weight to the nearest level. Store the tiny integer + the scale; reconstruct (dequantize) on the fly. The demo snaps continuous weights onto the INT4 grid and shows the rounding error per weight.
The tradeoff
Fewer bits = smaller + faster + cheaper, with a little accuracy loss (negligible at INT8, noticeable but usable at INT4). Tricks like keeping outlier weights in higher precision (mixed precision) and quantization-aware training claw most of it back.
In the wild
GGUF (llama.cpp), GPTQ, AWQ, bitsandbytes load_in_4bit. This is what democratized running big models locally.
🔨 Built from scratch (scale = range/levels → round → dequantize → measure error+size) on the page: https://dev48v.infy.uk/ai/days/day20-quantization.html
Part of AIFromZero. 🌐 https://dev48v.infy.uk
Top comments (0)