If you run your own local models, you might have heard of the term quantization.
What is quantization?
A normal LLM stores weights as float32 (FP32) and float16 (FP16).
Quantization is when we store and compute those weights using fewer bits.
For example:
- FP16 – 16 bits
- INT8 – 8 bits
- INT4 – 4 bits
- INT2 – 2 bits
So if we take a number like:
0.12345678 (32-bit float)
We can approximate it to:
0.12 (8-bit/4-bit)
Ollama quantization formats
Now let's understand how to read a model name and understand its quantization.
When you see model names like:
llama3:8b-q4_K_M
mistral:7b-q8_0
The suffix is the quantization format.
Here is a table to understand:
| Format | Bits | Meaning |
|---|---|---|
| Q2 | ~2 | Extreme compression, bad quality |
| Q4_0 | 4 | Fast, lower quality |
| Q4_K | 4 | Kernel-optimized |
| Q4_K_M | 4 | Best Q4 tradeoff |
| Q5_K_M | 5 | Better quality, more RAM |
| Q6_K | 6 | Near-FP16 quality |
| Q8_0 | 8 | Very high quality |
| FP16 | 16 | Almost original |
Wrapping up
Hope you now got a little bit of understanding of what quantization means and what those values actually mean.
Running an LLM locally gives you many things to learn, and this is just one of them.
If you are someone who is looking forward to improving their workflow and getting access to a good toolset, here is a suggestion for you.
If you’ve ever struggled with repetitive tasks, obscure commands, or debugging headaches, this platform is here to make your life easier. It’s free, open-source, and built with developers in mind.
👉 Explore the tools: FreeDevTools
👉 Star the repo: freedevtools

Top comments (0)