From FP16 to Q4: Understanding Quantization in Ollama

#ai #llm

If you run your own local models, you might have heard of the term quantization.

What is quantization?

A normal LLM stores weights as float32 (FP32) and float16 (FP16).

Quantization is when we store and compute those weights using fewer bits.

For example:

FP16 – 16 bits
INT8 – 8 bits
INT4 – 4 bits
INT2 – 2 bits

So if we take a number like:

0.12345678 (32-bit float)

We can approximate it to:

0.12 (8-bit/4-bit)

Ollama quantization formats

Now let's understand how to read a model name and understand its quantization.

When you see model names like:

llama3:8b-q4_K_M
mistral:7b-q8_0

The suffix is the quantization format.

Here is a table to understand:

Format	Bits	Meaning
Q2	~2	Extreme compression, bad quality
Q4_0	4	Fast, lower quality
Q4_K	4	Kernel-optimized
Q4_K_M	4	Best Q4 tradeoff
Q5_K_M	5	Better quality, more RAM
Q6_K	6	Near-FP16 quality
Q8_0	8	Very high quality
FP16	16	Almost original

Wrapping up

Hope you now got a little bit of understanding of what quantization means and what those values actually mean.

Running an LLM locally gives you many things to learn, and this is just one of them.

If you are someone who is looking forward to improving their workflow and getting access to a good toolset, here is a suggestion for you.

If you’ve ever struggled with repetitive tasks, obscure commands, or debugging headaches, this platform is here to make your life easier. It’s free, open-source, and built with developers in mind.