Skip to content

DEV Community

Nicholas Wiseman

Posted on Mar 4

Local LLM Inference on Windows 11 and AMD GPU using WSL and llama.cpp

#ai #linux #llm #tutorial

Part 1: Config

GPU: AMD Radeon RX 7800 XT
Driver Version: 25.30.27.02-260217a-198634C-AMD-Software-Adrenalin-Edition
llama.cpp SHA: ecd99d6a9acbc436bad085783bcd5d0b9ae9e9e9
OS: Windows 11 (10.0.26200 Build 26200)
Ubuntu version: 24.04

Need to consult ROCm compatibility matrix (linked in Part 4) to ensure valid ROCm version, GPU, GFX driver and Ubuntu version.

Part 2: CPU Inference Baseline

Setup WSL and Ubuntu VM:

wsl --install -d Ubuntu-24.04

Launch "Ubuntu" from windows start menu

Grab some utilities

sudo apt update
sudo apt install -y git build-essential cmake curl

Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

ecd99d6a9acb was the latest commit at time of writing, you can do git checkout for max reproducibility.

Grab the model

cd models
curl -L -o mistral.gguf \
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
cd ..

Build llama.cpp

cmake -B build
cmake --build build --config Release

Do CPU inference

./build/bin/llama-cli -m models/mistral.gguf

Part 3: GPU Acceleration

Install ROCm

sudo apt update
wget https://repo.radeon.com/amdgpu-install/7.2/ubuntu/noble/amdgpu-install_7.2.70200-1_all.deb
sudo apt install ./amdgpu-install_7.2.70200-1_all.deb
amdgpu-install -y --usecase=wsl,rocm --no-dkms

Check your ROCm install

rocminfo | grep "gfx"

Should see some output confirming ROCm detects your GPU

Build llama.cpp with HIP support

rm -rf build
cmake -B build -DGGML_HIP=ON
cmake --build build --config Release

Do inferencing on GPU

./build/bin/llama-cli -m models/mistral.gguf -ngl 999

nick@NickWiseman-PC:~/llama/llama.cpp$ ./build/bin/llama-cli -m models/mistral.gguf -ngl 999
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8199-d969e933e
model      : mistral.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Write a short love poem

 In the quiet of the moonlit night,

Two hearts entwined, a tender sight,

A dance of souls in gentle grace,

In love's sweet embrace, we find our place.

Your eyes, a mirror to my own,

Reflecting passion, love, and home,

Your voice, a melody that sings,

In every beat, my heart takes wings.

Together we weave a tapestry,

Of promises, of memories,

A bond that's woven strong and bright,

A love that shines, a beacon of light.

In this moment, in this stolen time,

Our hearts unite, two souls entwined,

A love so pure, a love so true,

A love that's mine, a love that's you.

[ Prompt: 149.0 t/s | Generation: 79.7 t/s ]

Note the line confirming we use gfx1101 device.
Mistral 7B Inference Perf Comparison

Device	Prompt Speed (tok/sec)	Generation Speed (tok/sec)
AMD Ryzen 5 3600 (CPU)	1.5	6.4
AMD Radeon RX 7800 XT (HIP / ROCm)	149.0	79.7

Part 4: Resources

Set up a WSL development environment | Microsoft Learn

Set up a WSL development environment using best practices from this set-by-step guide. Learn how to run Ubuntu, Visual Studio Code or Visual Studio, Git, Windows Credential Manager, MongoDB, MySQL, Docker remote containers and more.

learn.microsoft.com

ggml-org / llama.cpp

LLM inference in C/C++

llama.cpp

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics

guide : using the new WebUI of llama.cpp
guide : running gpt-oss with llama.cpp
[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗
Support for the gpt-oss model with native MXFP4 format has been added | PR | Collaboration with NVIDIA | Comment
Multimodal support arrived in llama-server: #12898 | documentation
VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
Hugging Face Inference Endpoints now support GGUF out of the box! #9669
Hugging Face GGUF editor: discussion | tool

Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:

Install llama.cpp using brew, nix or winget
Run with Docker - see our Docker documentation
Download pre-built binaries from the releases page
Build…

TheBloke/Mistral-7B-Instruct-v0.1-GGUF at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

rocm.docs.amd.com

rocm.docs.amd.com

Top comments (0)

Subscribe