DEV Community

Cover image for From FP16 to Q4: Understanding Quantization in Ollama
Rijul Rajesh
Rijul Rajesh

Posted on

From FP16 to Q4: Understanding Quantization in Ollama

If you run your own local models, you might have heard of the term quantization.

What is quantization?

A normal LLM stores weights as float32 (FP32) and float16 (FP16).

Quantization is when we store and compute those weights using fewer bits.

For example:

  • FP16 – 16 bits
  • INT8 – 8 bits
  • INT4 – 4 bits
  • INT2 – 2 bits

So if we take a number like:

0.12345678 (32-bit float)
Enter fullscreen mode Exit fullscreen mode

We can approximate it to:

0.12 (8-bit/4-bit)
Enter fullscreen mode Exit fullscreen mode

Ollama quantization formats

Now let's understand how to read a model name and understand its quantization.

When you see model names like:

llama3:8b-q4_K_M
mistral:7b-q8_0
Enter fullscreen mode Exit fullscreen mode

The suffix is the quantization format.

Here is a table to understand:

Format Bits Meaning
Q2 ~2 Extreme compression, bad quality
Q4_0 4 Fast, lower quality
Q4_K 4 Kernel-optimized
Q4_K_M 4 Best Q4 tradeoff
Q5_K_M 5 Better quality, more RAM
Q6_K 6 Near-FP16 quality
Q8_0 8 Very high quality
FP16 16 Almost original

Wrapping up

Hope you now got a little bit of understanding of what quantization means and what those values actually mean.

Running an LLM locally gives you many things to learn, and this is just one of them.

If you are someone who is looking forward to improving their workflow and getting access to a good toolset, here is a suggestion for you.

If you’ve ever struggled with repetitive tasks, obscure commands, or debugging headaches, this platform is here to make your life easier. It’s free, open-source, and built with developers in mind.

👉 Explore the tools: FreeDevTools
👉 Star the repo: freedevtools

Top comments (0)