Part 1: Config
GPU: AMD Radeon RX 7800 XT
Driver Version: 25.30.27.02-260217a-198634C-AMD-Software-Adrenalin-Edition
llama.cpp SHA: ecd99d6a9acbc436bad085783bcd5d0b9ae9e9e9
OS: Windows 11 (10.0.26200 Build 26200)
Ubuntu version: 24.04
Need to consult ROCm compatibility matrix (linked in Part 4) to ensure valid ROCm version, GPU, GFX driver and Ubuntu version.
Part 2: CPU Inference Baseline
Setup WSL and Ubuntu VM:
wsl --install -d Ubuntu-24.04
Launch "Ubuntu" from windows start menu
Grab some utilities
sudo apt update
sudo apt install -y git build-essential cmake curl
Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
ecd99d6a9acb was the latest commit at time of writing, you can do git checkout for max reproducibility.
Grab the model
cd models
curl -L -o mistral.gguf \
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
cd ..
Build llama.cpp
cmake -B build
cmake --build build --config Release
Do CPU inference
./build/bin/llama-cli -m models/mistral.gguf
Part 3: GPU Acceleration
Install ROCm
sudo apt update
wget https://repo.radeon.com/amdgpu-install/7.2/ubuntu/noble/amdgpu-install_7.2.70200-1_all.deb
sudo apt install ./amdgpu-install_7.2.70200-1_all.deb
amdgpu-install -y --usecase=wsl,rocm --no-dkms
Check your ROCm install
rocminfo | grep "gfx"
Should see some output confirming ROCm detects your GPU
Build llama.cpp with HIP support
rm -rf build
cmake -B build -DGGML_HIP=ON
cmake --build build --config Release
Do inferencing on GPU
./build/bin/llama-cli -m models/mistral.gguf -ngl 999
nick@NickWiseman-PC:~/llama/llama.cpp$ ./build/bin/llama-cli -m models/mistral.gguf -ngl 999
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8199-d969e933e
model : mistral.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> Write a short love poem
In the quiet of the moonlit night,
Two hearts entwined, a tender sight,
A dance of souls in gentle grace,
In love's sweet embrace, we find our place.
Your eyes, a mirror to my own,
Reflecting passion, love, and home,
Your voice, a melody that sings,
In every beat, my heart takes wings.
Together we weave a tapestry,
Of promises, of memories,
A bond that's woven strong and bright,
A love that shines, a beacon of light.
In this moment, in this stolen time,
Our hearts unite, two souls entwined,
A love so pure, a love so true,
A love that's mine, a love that's you.
[ Prompt: 149.0 t/s | Generation: 79.7 t/s ]
Note the line confirming we use gfx1101 device.
Mistral 7B Inference Perf Comparison
| Device | Prompt Speed (tok/sec) | Generation Speed (tok/sec) |
|---|---|---|
| AMD Ryzen 5 3600 (CPU) | 1.5 | 6.4 |
| AMD Radeon RX 7800 XT (HIP / ROCm) | 149.0 | 79.7 |
Part 4: Resources
llama.cpp
LLM inference in C/C++
Recent API changes
Hot topics
- guide : using the new WebUI of llama.cpp
- guide : running gpt-oss with llama.cpp
- [FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
- Support for the
gpt-ossmodel with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment - Multimodal support arrived in
llama-server: #12898 | documentation - VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
- Hugging Face Inference Endpoints now support GGUF out of the box! #9669
- Hugging Face GGUF editor: discussion | tool
Quick start
Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:
- Install
llama.cppusing brew, nix or winget - Run with Docker - see our Docker documentation
- Download pre-built binaries from the releases page
- Build…


Top comments (0)