This article was originally published on runaihome.com
TL;DR: ROCm 7.2.3 (released May 4, 2026) is the stable Ubuntu path for AMD GPU inference — RDNA 3 setup is rock-solid in under 20 minutes, RDNA 4 works with one Docker workaround for a known gfx1201 bug. AMD delivers 85–92% of equivalent NVIDIA throughput at a lower price point on Ubuntu.
What you'll be able to do after this guide:
- Install ROCm 7.2.3 on Ubuntu 24.04 LTS and run
ollama servewith full GPU acceleration in under 20 minutes - Build llama.cpp with the HIP backend for maximum throughput on RDNA 3 and RDNA 4
- Identify and work around the gfx1201 rocBLASLt crash that kills model loads on the RX 9070 XT
Honest take: On Ubuntu, a used AMD Radeon RX 7900 XTX at ~$800 is the best AMD card for local LLMs in 2026 — 24 GB VRAM, ~96 tok/s on Llama 3.1 8B Q4_K_M, and ROCm 7.2.3 installs without a single environment variable hack.
AMD's local AI story on Linux has changed substantially. A year ago you'd fight missing kernel modules and half-broken pip wheels. Today, if you're on a supported card and Ubuntu 24.04, setup is close to the CUDA experience: download a .deb, add yourself to two groups, reboot once, and ollama pull llama3.1:8b works.
The catches are smaller than they used to be, but they exist. RDNA 4 support (RX 9000 series) is still maturing in a specific way — one rocBLASLt lookup bug can SIGKILL your model load at the 2-minute mark every single time. Knowing where the landmine is before you start saves 90 minutes of frustrating debugging.
This guide covers the native Ubuntu install path. If you need AMD on Windows, see the AMD ROCm 7.2 on Windows guide — RDNA 3 is Linux-only for ROCm and the Windows path is a different story entirely.
Which cards are actually supported on Ubuntu 24.04
ROCm 7.2.3 divides AMD's consumer lineup into three buckets:
Fully supported on Linux:
- RDNA 4: RX 9070 XT, RX 9070, RX 9060 XT LP, Radeon AI PRO R9600D (gfx1201 / gfx1200)
- RDNA 3: RX 7900 XTX, RX 7900 XT, RX 7900 GRE, RX 7800 XT, RX 7700 XT (gfx1100 / gfx1101 / gfx1102)
RDNA 3 consumer cards are Linux-only for ROCm. On Windows, the ROCm stack officially supports only RDNA 4 chips — see above.
Supported via Vulkan only (no ROCm HIP):
RX 7600, RX 6000 series, anything older. These cards can run inference through llama.cpp's Vulkan backend, but won't get vLLM, PyTorch ROCm, or HIP acceleration.
Not supported at all:
RDNA 1 (RX 5000 series) and older. Vulkan may work, but inference speed makes these impractical for anything beyond tiny models.
VRAM is still the ceiling
For context on what each card can run:
- RX 7900 XTX (24 GB): Qwen3-30B-A3B at Q4_K_M fits cleanly. Llama 3.3 70B Q4 in CPU-offload mode. Anything under 20B at Q4 is comfortable.
- RX 9070 XT (16 GB): Llama 3.1 8B at full speed, Qwen3-14B at Q4, 27B MoE models technically fit but saturate the memory bus (6.3 tok/s on Qwen3.5-27B-A3B at Q4).
- RX 9070 GRE (16 GB): Launched globally at $549 on June 2, 2026 — same VRAM and gfx1201 architecture as the 9070 XT, slightly less shader compute.
For a broader AMD vs NVIDIA VRAM comparison at the 16 GB tier, see AMD RX 9070 XT vs RTX 5060 Ti 16GB.
Step 1: Install ROCm 7.2.3 on Ubuntu 24.04
Start from Ubuntu 24.04.3 LTS. The amdgpu-install tool handles both the kernel driver and the ROCm userspace stack in a single package.
# Download the installer for Ubuntu 24.04 (noble)
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb
# Install it
sudo apt install ./amdgpu-install_7.2.3.70203-1_all.deb
sudo apt update
# Install ROCm with the rocm usecase
sudo amdgpu-install --usecase=rocm --no-dkms
The --no-dkms flag skips DKMS kernel module compilation. On Ubuntu 24.04.3 with a 6.8.x kernel, RDNA 3 and RDNA 4 are already supported by the packaged kernel — invoking DKMS wastes 10 minutes and sometimes fails on systems with secure boot or custom kernels.
Step 2: Add user groups and reboot
This step trips up almost every first-time installer and the error messages when you skip it are not helpful. The ROCm compute stack requires your user to be in the render and video groups to access /dev/kfd (the GPU compute device node) without root.
sudo usermod -a -G render,video $USER
Reboot now. A newgrp session is not sufficient — the group membership must be part of your login session from the start. After reboot, verify:
groups
# Expected: ... render video ...
rocminfo | head -25
Expected output from rocminfo on an RX 9070 XT:
ROCk module is loaded
...
Agent 2
Name: gfx1201
Uuid: GPU-XXXXXXXXXXXXXX
Marketing Name: Radeon RX 9070 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
If rocminfo hangs for more than 30 seconds or shows only Agent 1 (the CPU), you have a group issue or driver conflict. Check dmesg | grep amdgpu first — firmware errors here usually mean you need a linux-firmware update.
Step 3: Verify with rocm-smi
rocm-smi
This displays real-time GPU stats including temperature, power draw, and memory usage. At idle you'll see 0% utilization — that's normal. Run a model in the next step and check again to confirm the GPU is actually being used.
Step 4: Install Ollama with ROCm support
Ollama ships its own bundled ROCm libraries and auto-detects AMD GPUs on Linux:
curl -fsSL https://ollama.com/install.sh | sh
systemctl start ollama
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M
While inference is running, open a second terminal and run watch -n 1 rocm-smi. You should see GPU memory jump to ~5.5 GB and compute utilization hit 90–95%.
If memory shows 0 MB allocated despite the model loading, Ollama may be using CPU. Run OLLAMA_DEBUG=1 ollama serve and check the startup logs — it will report which ROCm libraries it found and whether the GPU was initialized.
Step 5 (optional): Build llama.cpp with HIP
For direct llama.cpp inference — more control over layer offloading and context window than Ollama provides — the HIP backend delivers the best AMD throughput:
sudo apt install cmake git build-essential
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS=gfx1201 \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
Replace gfx1201 with your card's architecture:
| Card | Target arch |
|---|---|
| RX 9070 XT / 9070 / 9060 XT | gfx1201 |
| RX 7900 XTX / 7900 XT / 7900 GRE | gfx1100 |
| RX 7800 XT | gfx1101 |
| RX 7700 XT | gfx1101 |
Run the server:
./build/bin/llama-server \
-m /path/to/model.gguf \
--n-gpu-layers 99 \
--host 0.0.0.0 --port 8080
--n-gpu-layers 99 offloads all layers to GPU. If you hit VRAM limits, reduce this number to shift layers to CPU — useful when running 27B+ models on a 16 GB card.
Real benchmarks
Results from community benchmarks on ROCm 7.x, Ubuntu 24.04, Ollama and llama.cpp HIP:
| Card | VRAM | Model | Quant | tok/s |
|---|---|---|---|---|
| RX 7900 XTX | 24 GB | Llama 3.1 8B | Q4_K_M | 66–96 |
| RX 9070 XT | 16 GB | Llama 3.1 8B | Q4_K_M | ~56 |
| RX 9070 XT | 16 GB | Qwen3:14B | Q4 | 52.2 |
| RX 9070 XT | 16 GB | GPT-OSS:20B | Q4 | 91.9 |
| RX 9070 XT | 16 GB | Qwen3.5:27B-A3B | Q4 (MoE) | 6.3 |
The 66–96 tok/s variance on the RX 7900 XTX reflects different llama.cpp versions and batch size settings across community tests. Mid-range is roughly 80 tok/s for a clean Q4_K_M Llama 8B run.
For comparison: an RTX 4070 Super (12 GB, 504 GB/s) delivers roughly 62–70 tok/s on the same model. The RX 9070 XT at 640 GB/s memory bandwidth edges it out
Top comments (0)