Quantization: How 70B Models Run on a Gaming GPU

#ai #beginners #llm #machinelearning

How does a 70-billion-parameter model run on a gaming GPU — or a 7B model on your laptop? Quantization. Store each weight in fewer bits and the model gets 4–8× smaller and faster, with surprisingly little quality loss. Here's the idea, live.

🗜️ Quantize real weights (FP32 → INT4): https://dev48v.infy.uk/ai/days/day20-quantization.html

The memory math

A weight in FP32 is 4 bytes. A 7B model = 7B × 4 = 28 GB just for weights — won't fit a consumer GPU. Drop to:

FP16/BF16 → 14 GB
INT8 → 7 GB
INT4 → 3.5 GB

Same model, a quarter of the memory.

How it works

Map the weights' range (min…max) to a small set of discrete levels (2^bits of them) with a scale factor, then round each weight to the nearest level. Store the tiny integer + the scale; reconstruct (dequantize) on the fly. The demo snaps continuous weights onto the INT4 grid and shows the rounding error per weight.

The tradeoff

Fewer bits = smaller + faster + cheaper, with a little accuracy loss (negligible at INT8, noticeable but usable at INT4). Tricks like keeping outlier weights in higher precision (mixed precision) and quantization-aware training claw most of it back.

In the wild

GGUF (llama.cpp), GPTQ, AWQ, bitsandbytes load_in_4bit. This is what democratized running big models locally.

🔨 Built from scratch (scale = range/levels → round → dequantize → measure error+size) on the page: https://dev48v.infy.uk/ai/days/day20-quantization.html

Part of AIFromZero. 🌐 https://dev48v.infy.uk

DEV Community

Quantization: How 70B Models Run on a Gaming GPU

The memory math

How it works

The tradeoff

In the wild

Top comments (0)