llama.cpp

LLM inference in C/C++

Recent API changes

Hot topics

Hugging Face cache migration: models downloaded with -hf are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.
guide : using the new WebUI of llama.cpp
guide : running gpt-oss with llama.cpp
[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
Support for the gpt-oss model with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment
Multimodal support arrived in llama-server: #12898 | documentation
VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
Hugging Face Inference Endpoints now support GGUF out of the box! #9669
Hugging Face GGUF editor: discussion | tool
WebGPU support is now available in the browser, see a blog/demo introducing it here.

Quick start

…

./build/bin/llama-cli -m ../models/gemma-4-E2B-it-Q4_K_M.gguf -t 4 -tb 4 -c 2048 -fa auto --prio 3 -p "hello" ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b9425-0821c5fcf model : gemma-4-E2B-it-Q4_K_M.gguf modalities : text available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read <file> add a text file /glob <pattern> add text files using globbing pattern > hello [Start thinking] Thinking Process: 1. **Analyze the input:** The input is "hello". 2. **Determine the context/intent:** This is a standard social greeting. 3. **Formulate an appropriate response:** The response should be friendly, polite, and acknowledge the greeting. Standard responses include reciprocating the greeting and offering further interaction (e.g., asking how the user is or offering assistance). 4. **Refine the response:** Keep it open-ended and welcoming. *Self-Correction/Refinement:* A simple "hello" back is fine, but adding a follow-up makes the interaction more engaging. 5. **Final Output Generation.** [End thinking] Hello! How can I help you today? [ Prompt: 1.3 t/s | Generation: 1.8 t/s ]

./build/bin/llama-cli -m ../models/gemma-4-E2B-it-Q4_K_M.gguf -t 4 -tb 4 -c 2048 -fa auto --prio 3 -p "hello" ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b9425-0821c5fcf model : gemma-4-E2B-it-Q4_K_M.gguf modalities : text available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read <file> add a text file /glob <pattern> add text files using globbing pattern > hello [Start thinking] Thinking Process: 1. **Analyze the input:** The input is "hello". 2. **Determine the context:** This is a simple, friendly greeting. 3. **Formulate the response goal:** The response should be equally friendly, polite, and open-ended (inviting further conversation). 4. **Draft potential responses:** * "Hello!" (Too brief, but fine.) * "Hi there." (Friendly.) * "Hello! How can I help you today?" (Polite, proactive.) * "Hello! What can I do for you?" (Direct, service-oriented.) 5. **Select the best response:** A standard friendly greeting followed by an invitation to continue the interaction is usually best. 6. **Final Output Generation.** [End thinking] Hello! How can I help you today? [ Prompt: 2.4 t/s | Generation: 1.5 t/s ]

Result

Prompt ↗️ but Generation ↘️
Unfortunately, it doesn't work for an agent.

Also tried to run LiquidAI/LFM2.5-8B-A1B-GGUF

LiquidAI/LFM2.5-8B-A1B-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

The result was Prompt: 0.3 t/s | Generation: 0.5t/s ↘️

DEV Community

Run Gemma-4 E2B-it with llama.cpp on Raspberry Pi4

Quantizing Gemma 4 on Mac with llama.cpp

ggml-org / llama.cpp

LLM inference in C/C++

llama.cpp

Recent API changes

Hot topics

Quick start

Step 1 clone the repo

Step 2 build

Step 3 run the model

Step 4 build with `clang`

Step 5 run the model

Result

LiquidAI/LFM2.5-8B-A1B-GGUF · Hugging Face

Conclusion

Top comments (0)

Quantizing Gemma 4 on Mac with llama.cpp

ggml-org / llama.cpp

LLM inference in C/C++

llama.cpp

Recent API changes

Hot topics

Quick start

Step 1 clone the repo

Step 2 build

Step 3 run the model

Step 4 build with clang

Step 5 run the model

Result

LiquidAI/LFM2.5-8B-A1B-GGUF · Hugging Face

Conclusion

Step 4 build with `clang`