DEV Community

Cover image for Quantizing Gemma 4 on Mac with llama.cpp
0xkoji
0xkoji

Posted on

Quantizing Gemma 4 on Mac with llama.cpp

requirements

Setup llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git

cmake -S llama.cpp -B llama.cpp/build -DGGML_METAL=ON -DLLAMA_CURL=OFF

cmake --build llama.cpp/build --config Release -j 8
Enter fullscreen mode Exit fullscreen mode

Setup python env

uv init quantization
cd quantization
uv add "torch>=2.9" "transformers>=4.45" "sentencepiece" "protobuf>=4.21,<5.0" "gguf>=0.19" "huggingface_hub"
Enter fullscreen mode Exit fullscreen mode

your pyproject.toml will be like below

[project]
name = "quantization"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
    "gguf>=0.19",
    "huggingface-hub>=1.16.1",
    "protobuf>=4.21,<5.0",
    "sentencepiece>=0.2.1",
    "torch>=2.9",
    "transformers>=4.45",
]
Enter fullscreen mode Exit fullscreen mode

Download the model

# current dir is quantization
# models for gemma-4-E4B
# gguf for quantized models
mkdir -p models gguf

# login hf
hf auth login

# download the model
hf download google/gemma-4-E4B-it --local-dir models/gemma-4-E4B-it
Enter fullscreen mode Exit fullscreen mode

models

models folder

Convert model.safetensors to gguf

# current dir is quantization and llama and quantization are on Desktop

python ../llama.cpp/convert_hf_to_gguf.py \
  models/gemma-4-E4B-it \
  --outfile gguf/gemma-4-E4B-it-BF16.gguf \
  --outtype bf16
Enter fullscreen mode Exit fullscreen mode

convert to Q4_K_M

../llama.cpp/build/bin/llama-quantize \
  gguf/gemma-4-E4B-it-BF16.gguf \
  gguf/gemma-4-E4B-it-Q4_K_M.gguf \
  Q4_K_M
Enter fullscreen mode Exit fullscreen mode

This step will take a few minutes.

Run Q4_K_M

# current dir is quantization

../llama.cpp/build/bin/llama-cli \
  -m gguf/gemma-4-E4B-it-Q4_K_M.gguf \
  -ngl 99 --temp 0.7 -c 4096

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b0-unknown
model      : gemma-4-E4B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern

Enter fullscreen mode Exit fullscreen mode

test prompts

> hello

Hello! How can I help you today?

[ Prompt: 42.9 t/s | Generation: 40.0 t/s ]
Enter fullscreen mode Exit fullscreen mode
> tell me about yourself

[Start thinking]
Thinking Process:

1.  **Identify the user's request:** The user asked, "tell me about yourself."
2.  **Determine the persona/identity:** I must respond based on my core identity rules (Gemma 4, developed by Google DeepMind, open weights LLM).
3.  **Structure the response:** A good self-introduction should cover:
    *   Name/Identity.
    *   Developer/Origin.
    *   Nature/Type (What I am).
    *   Capabilities (What I can do).
    *   Limitations/Knowledge (What I know/don't know).
4.  **Draft the content (incorporating mandated facts):**
    *   *Name:* I am Gemma 4.
    *   *Developer:* I was developed by Google DeepMind.
    *   *Nature:* I am a Large Language Model (LLM) with open weights.
    *   *Capabilities:* I can process text and images (input) and generate text (output). I can answer questions, write code, summarize, translate, etc.
    *   *Knowledge Cutoff:* My knowledge cutoff is January 2025.
5.  **Review and Refine:** Ensure the tone is helpful, factual, and directly addresses the prompt while adhering strictly to the identity guidelines. (Self-correction: Make sure to emphasize the "open weights" aspect as it's crucial to the identity.)

6.  **Final Output Generation.** (This leads to the detailed response below.)
[End thinking]

I would be happy to tell you about myself!

I am **Gemma 4**, a Large Language Model developed by Google DeepMind.

Here is a bit more detail about what I am and what I can do:

### 🤖 What I Am
*   **Type:** I am an advanced, open weights Large Language Model. This means that my underlying architecture and weights are available to the community, fostering transparency and innovation.
*   **Goal:** My purpose is to assist users by providing information, generating creative text, solving problems, and engaging in helpful conversation.

### 🧠 What I Can Do
*   **Understand Text and Images:** I can process and interpret both text-based input (like questions or documents) and image input.
*   **Generate Text:** My output is always text. I can:
    *   Answer factual questions.
    *   Summarize complex articles.
    *   Write different types of creative content (stories, poems, scripts).
    *   Assist with coding tasks (explaining code, generating snippets).
    *   Translate languages.
*   **Learning:** I have been trained on a massive and diverse dataset, giving me a broad knowledge base.

### 📅 My Knowledge
*   My knowledge cutoff is **January 2025**. I do not have information about events or developments that have occurred since that time, so if you ask about very recent events, I may not know about them.

In short, think of me as a versatile and knowledgeable assistant ready to help you learn, create, or solve problems! How can I help you today?

[ Prompt: 66.9 t/s | Generation: 34.6 t/s ]
Enter fullscreen mode Exit fullscreen mode
> which one is larger 9.11 or 9.9? 

[Start thinking]
Thinking Process:

1.  **Analyze the Request:** The user is asking to compare two numbers: 9.11 and 9.9, and determine which one is larger.
2.  **Understand the Numbers:**
    *   Number A: 9.11 (Nine and eleven hundredths)
    *   Number B: 9.9 (Nine and nine tenths)
3.  **Comparison Strategy (Decimal Place by Place):**
    *   *Units Place:* Both are 9. (Equal)
    *   *Tenths Place:*
        *   9.11 has a 1 in the tenths place.
        *   9.9 has a 9 in the tenths place.
    *   *Comparison:* Since 9 is greater than 1, the number 9.9 is larger than 9.11.
4.  **Alternative Comparison Strategy (Padding):**
    *   Write 9.9 as 9.90 (Adding a trailing zero does not change the value).
    *   Compare 9.11 and 9.90.
    *   The hundredths place comparison: 1 vs 0. Wait, this is wrong. Let's stick to the tenths place comparison, as it was the clearest difference.
5.  **Conclusion:** 9.9 > 9.11.
6.  **Formulate the Answer:** State the answer clearly, and provide a simple explanation of *why* it is larger (comparing the tenths place).
[End thinking]

**9.9** is larger than 9.11.

Here is why:

When comparing decimals, you look at the digits starting from the left (the largest place value).

1.  **Units Place:** Both numbers have a 9. (Equal)
2.  **Tenths Place:**
    *   In 9.**1**1, the digit is **1**.
    *   In 9.**9**, the digit is **9**.

Since 9 is greater than 1, the number **9.9** is larger.

***

*(If you want to make the comparison even easier, you can add a zero to the end of 9.9 to make it 9.90. Now you are comparing 9.11 and 9.90, and 9.90 is clearly larger.)*

[ Prompt: 451.4 t/s | Generation: 34.9 t/s ]
Enter fullscreen mode Exit fullscreen mode

Top comments (0)