DEV Community

Nicholas Wiseman
Nicholas Wiseman

Posted on

Local LLM Inference on Windows 11 and AMD GPU using WSL and llama.cpp

Part 1: Config

GPU: AMD Radeon RX 7800 XT
Driver Version: 25.30.27.02-260217a-198634C-AMD-Software-Adrenalin-Edition
llama.cpp SHA: ecd99d6a9acbc436bad085783bcd5d0b9ae9e9e9
OS: Windows 11 (10.0.26200 Build 26200)
Ubuntu version: 24.04

Need to consult ROCm compatibility matrix (linked in Part 4) to ensure valid ROCm version, GPU, GFX driver and Ubuntu version.

Part 2: CPU Inference Baseline

Setup WSL and Ubuntu VM:

wsl --install -d Ubuntu-24.04
Enter fullscreen mode Exit fullscreen mode

Launch "Ubuntu" from windows start menu

Grab some utilities

sudo apt update
sudo apt install -y git build-essential cmake curl
Enter fullscreen mode Exit fullscreen mode

Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Enter fullscreen mode Exit fullscreen mode

ecd99d6a9acb was the latest commit at time of writing, you can do git checkout for max reproducibility.

Grab the model

cd models
curl -L -o mistral.gguf \
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf
cd ..
Enter fullscreen mode Exit fullscreen mode

Build llama.cpp

cmake -B build
cmake --build build --config Release
Enter fullscreen mode Exit fullscreen mode

Do CPU inference

./build/bin/llama-cli -m models/mistral.gguf
Enter fullscreen mode Exit fullscreen mode

Part 3: GPU Acceleration

Install ROCm

sudo apt update
wget https://repo.radeon.com/amdgpu-install/7.2/ubuntu/noble/amdgpu-install_7.2.70200-1_all.deb
sudo apt install ./amdgpu-install_7.2.70200-1_all.deb
amdgpu-install -y --usecase=wsl,rocm --no-dkms
Enter fullscreen mode Exit fullscreen mode

Check your ROCm install

rocminfo | grep "gfx"
Enter fullscreen mode Exit fullscreen mode

Should see some output confirming ROCm detects your GPU

Build llama.cpp with HIP support

rm -rf build
cmake -B build -DGGML_HIP=ON
cmake --build build --config Release
Enter fullscreen mode Exit fullscreen mode

Do inferencing on GPU

./build/bin/llama-cli -m models/mistral.gguf -ngl 999
Enter fullscreen mode Exit fullscreen mode
nick@NickWiseman-PC:~/llama/llama.cpp$ ./build/bin/llama-cli -m models/mistral.gguf -ngl 999
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7800 XT, gfx1101 (0x1101), VMM: no, Wave Size: 32

Loading model...


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8199-d969e933e
model      : mistral.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> Write a short love poem

 In the quiet of the moonlit night,

Two hearts entwined, a tender sight,

A dance of souls in gentle grace,

In love's sweet embrace, we find our place.

Your eyes, a mirror to my own,

Reflecting passion, love, and home,

Your voice, a melody that sings,

In every beat, my heart takes wings.

Together we weave a tapestry,

Of promises, of memories,

A bond that's woven strong and bright,

A love that shines, a beacon of light.

In this moment, in this stolen time,

Our hearts unite, two souls entwined,

A love so pure, a love so true,

A love that's mine, a love that's you.

[ Prompt: 149.0 t/s | Generation: 79.7 t/s ]
Enter fullscreen mode Exit fullscreen mode

Note the line confirming we use gfx1101 device.
Mistral 7B Inference Perf Comparison

Device Prompt Speed (tok/sec) Generation Speed (tok/sec)
AMD Ryzen 5 3600 (CPU) 1.5 6.4
AMD Radeon RX 7800 XT (HIP / ROCm) 149.0 79.7

Part 4: Resources

Set up a WSL development environment | Microsoft Learn

Set up a WSL development environment using best practices from this set-by-step guide. Learn how to run Ubuntu, Visual Studio Code or Visual Studio, Git, Windows Credential Manager, MongoDB, MySQL, Docker remote containers and more.

learn.microsoft.com

GitHub logo ggml-org / llama.cpp

LLM inference in C/C++

llama.cpp

llama

License: MIT Release Server

Manifesto / ggml / ops

LLM inference in C/C++

Recent API changes

Hot topics


Quick start

Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:


TheBloke/Mistral-7B-Instruct-v0.1-GGUF at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co


Top comments (0)