DEV Community

0xkoji
0xkoji

Posted on

Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4

Tested Gemma-4 E2B-it on Raspberry Pi 4.

the way to convert Gemma-4 E2B-it to gguf

models
https://huggingface.co/baxin/gemma-4-E4B-it-E2B-it-Q4_K_M

GitHub logo ggml-org / llama.cpp

LLM inference in C/C++

llama.cpp

llama

License: MIT Release Server

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics


Quick start

…

Step 1 clone the repo

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
Enter fullscreen mode Exit fullscreen mode

Step 2 build

cmake -B build -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release
Enter fullscreen mode Exit fullscreen mode

Step 3 run the model

the command was run from llama.cpp folder and gemma-4-E2B-it-Q4_K_M.gguf is placed in models folder.

folder structure

llama.cpp   models
Enter fullscreen mode Exit fullscreen mode
./build/bin/llama-cli   -m ../models/gemma-4-E2B-it-Q4_K_M.gguf   -t 4   -tb 4   -c 2048   -fa auto   --prio 3   -p "hello"

β–„β–„ β–„β–„
β–ˆβ–ˆ β–ˆβ–ˆ
β–ˆβ–ˆ β–ˆβ–ˆ  β–€β–€β–ˆβ–„ β–ˆβ–ˆβ–ˆβ–„β–ˆβ–ˆβ–ˆβ–„  β–€β–€β–ˆβ–„    β–„β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–„ β–ˆβ–ˆβ–ˆβ–ˆβ–„
β–ˆβ–ˆ β–ˆβ–ˆ β–„β–ˆβ–€β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–„β–ˆβ–€β–ˆβ–ˆ    β–ˆβ–ˆ    β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ
β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–„β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–„β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–€ β–ˆβ–ˆβ–ˆβ–ˆβ–€
                                    β–ˆβ–ˆ    β–ˆβ–ˆ
                                    β–€β–€    β–€β–€

build      : b9425-0821c5fcf
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> hello

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "hello".
2.  **Determine the context/intent:** This is a standard social greeting.
3.  **Formulate an appropriate response:** The response should be friendly, polite, and acknowledge the greeting. Standard responses include reciprocating the greeting and offering further interaction (e.g., asking how the user is or offering assistance).
4.  **Refine the response:** Keep it open-ended and welcoming.

*Self-Correction/Refinement:* A simple "hello" back is fine, but adding a follow-up makes the interaction more engaging.

5.  **Final Output Generation.**
[End thinking]

Hello! How can I help you today?

[ Prompt: 1.3 t/s | Generation: 1.8 t/s ]
Enter fullscreen mode Exit fullscreen mode

Step 4 build with clang

sudo apt install -y clang
rm -rf build
cmake -B build -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_NATIVE=ON \
  -DLLAMA_ARM_NEON=ON

cmake --build build --config Release -j
Enter fullscreen mode Exit fullscreen mode

Step 5 run the model

./build/bin/llama-cli   -m ../models/gemma-4-E2B-it-Q4_K_M.gguf   -t 4   -tb 4   -c 2048   -fa auto   --prio 3   -p "hello"
β–„β–„ β–„β–„
β–ˆβ–ˆ β–ˆβ–ˆ
β–ˆβ–ˆ β–ˆβ–ˆ  β–€β–€β–ˆβ–„ β–ˆβ–ˆβ–ˆβ–„β–ˆβ–ˆβ–ˆβ–„  β–€β–€β–ˆβ–„    β–„β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–„ β–ˆβ–ˆβ–ˆβ–ˆβ–„
β–ˆβ–ˆ β–ˆβ–ˆ β–„β–ˆβ–€β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–„β–ˆβ–€β–ˆβ–ˆ    β–ˆβ–ˆ    β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ
β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–„β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–„β–ˆβ–ˆ β–ˆβ–ˆ β–€β–ˆβ–ˆβ–ˆβ–ˆ β–ˆβ–ˆβ–ˆβ–ˆβ–€ β–ˆβ–ˆβ–ˆβ–ˆβ–€
                                    β–ˆβ–ˆ    β–ˆβ–ˆ
                                    β–€β–€    β–€β–€

build      : b9425-0821c5fcf
model      : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read <file>        add a text file
  /glob <pattern>     add text files using globbing pattern


> hello

[Start thinking]
Thinking Process:

1.  **Analyze the input:** The input is "hello".
2.  **Determine the context:** This is a simple, friendly greeting.
3.  **Formulate the response goal:** The response should be equally friendly, polite, and open-ended (inviting further conversation).
4.  **Draft potential responses:**
    *   "Hello!" (Too brief, but fine.)
    *   "Hi there." (Friendly.)
    *   "Hello! How can I help you today?" (Polite, proactive.)
    *   "Hello! What can I do for you?" (Direct, service-oriented.)
5.  **Select the best response:** A standard friendly greeting followed by an invitation to continue the interaction is usually best.

6.  **Final Output Generation.**
[End thinking]

Hello! How can I help you today?

[ Prompt: 2.4 t/s | Generation: 1.5 t/s ]
Enter fullscreen mode Exit fullscreen mode

Result

Prompt ↗️ but Generation β†˜οΈ
Unfortunately, it doesn't work for an agent.

Also tried to run LiquidAI/LFM2.5-8B-A1B-GGUF

LiquidAI/LFM2.5-8B-A1B-GGUF Β· Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

The result was Prompt: 0.3 t/s | Generation: 0.5t/s β†˜οΈ

Conclusion

Raspberry Pi 5 costs around $305, so if you want to run an LLM with fewer than 10B parameters, it seems better to buy a mini PC with 16GB RAM in the $300–400 range.

Top comments (0)