Tested Gemma-4 E2B-it on Raspberry Pi 4.
the way to convert Gemma-4 E2B-it to gguf
models
https://huggingface.co/baxin/gemma-4-E4B-it-E2B-it-Q4_K_M
llama.cpp
LLM inference in C/C++
Recent API changes
Hot topics
- Hugging Face cache migration: models downloaded with
-hfare now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools. - guide : using the new WebUI of llama.cpp
- guide : running gpt-oss with llama.cpp
- [FEEDBACK] Better packaging for llama.cpp to support downstream consumers π€
- Support for the
gpt-ossmodel with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment - Multimodal support arrived in
llama-server: #12898 | documentation - VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
- Hugging Face Inference Endpoints now support GGUF out of the box! #9669
- Hugging Face GGUF editor: discussion | tool
- WebGPU support is now available in the browser, see a blog/demo introducing it here.
Quick start
β¦Step 1 clone the repo
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
Step 2 build
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
Step 3 run the model
the command was run from llama.cpp folder and gemma-4-E2B-it-Q4_K_M.gguf is placed in models folder.
folder structure
llama.cpp models
./build/bin/llama-cli -m ../models/gemma-4-E2B-it-Q4_K_M.gguf -t 4 -tb 4 -c 2048 -fa auto --prio 3 -p "hello"
ββ ββ
ββ ββ
ββ ββ ββββ ββββββββ ββββ βββββ βββββ βββββ
ββ ββ βββββ ββ ββ ββ βββββ ββ ββ ββ ββ ββ
ββ ββ βββββ ββ ββ ββ βββββ ββ βββββ βββββ βββββ
ββ ββ
ββ ββ
build : b9425-0821c5fcf
model : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
> hello
[Start thinking]
Thinking Process:
1. **Analyze the input:** The input is "hello".
2. **Determine the context/intent:** This is a standard social greeting.
3. **Formulate an appropriate response:** The response should be friendly, polite, and acknowledge the greeting. Standard responses include reciprocating the greeting and offering further interaction (e.g., asking how the user is or offering assistance).
4. **Refine the response:** Keep it open-ended and welcoming.
*Self-Correction/Refinement:* A simple "hello" back is fine, but adding a follow-up makes the interaction more engaging.
5. **Final Output Generation.**
[End thinking]
Hello! How can I help you today?
[ Prompt: 1.3 t/s | Generation: 1.8 t/s ]
Step 4 build with clang
sudo apt install -y clang
rm -rf build
cmake -B build -DCMAKE_BUILD_TYPE=Release \
-DLLAMA_NATIVE=ON \
-DLLAMA_ARM_NEON=ON
cmake --build build --config Release -j
Step 5 run the model
./build/bin/llama-cli -m ../models/gemma-4-E2B-it-Q4_K_M.gguf -t 4 -tb 4 -c 2048 -fa auto --prio 3 -p "hello"
ββ ββ
ββ ββ
ββ ββ ββββ ββββββββ ββββ βββββ βββββ βββββ
ββ ββ βββββ ββ ββ ββ βββββ ββ ββ ββ ββ ββ
ββ ββ βββββ ββ ββ ββ βββββ ββ βββββ βββββ βββββ
ββ ββ
ββ ββ
build : b9425-0821c5fcf
model : gemma-4-E2B-it-Q4_K_M.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read <file> add a text file
/glob <pattern> add text files using globbing pattern
> hello
[Start thinking]
Thinking Process:
1. **Analyze the input:** The input is "hello".
2. **Determine the context:** This is a simple, friendly greeting.
3. **Formulate the response goal:** The response should be equally friendly, polite, and open-ended (inviting further conversation).
4. **Draft potential responses:**
* "Hello!" (Too brief, but fine.)
* "Hi there." (Friendly.)
* "Hello! How can I help you today?" (Polite, proactive.)
* "Hello! What can I do for you?" (Direct, service-oriented.)
5. **Select the best response:** A standard friendly greeting followed by an invitation to continue the interaction is usually best.
6. **Final Output Generation.**
[End thinking]
Hello! How can I help you today?
[ Prompt: 2.4 t/s | Generation: 1.5 t/s ]
Result
Prompt βοΈ but Generation βοΈ
Unfortunately, it doesn't work for an agent.
Also tried to run LiquidAI/LFM2.5-8B-A1B-GGUF
The result was Prompt: 0.3 t/s | Generation: 0.5t/s βοΈ
Conclusion
Raspberry Pi 5 costs around $305, so if you want to run an LLM with fewer than 10B parameters, it seems better to buy a mini PC with 16GB RAM in the $300β400 range.

Top comments (0)