Working with RoCM and Strix Halo drivers is generally a PITA, but thanks to https://github.com/kyuz0 for making https://github.com/kyuz0/amd-strix-halo-llm-finetuning and making our lives easier.
1. Introduction
This document details the experience, challenges, and results of fine-tuning Google's Gemma models on an AMD Strix Halo system running Fedora 43. The goal was to leverage the Ryzen AI Max 390's integrated NPU/GPU capabilities via ROCm 7.9 nightly builds.
We are using https://github.com/kyuz0/amd-strix-halo-llm-finetuning to build the docker image and run the finetuning process.
2. System Configuration
The test bench was an ASUS ROG Flow Z13 (2025) with the following specifications:
- CPU/GPU: AMD Ryzen AI Max 390 (Strix Halo)
- 24 CPU Cores @ 5.06 GHz
- Integrated Radeon 8050S Graphics
- RAM: 32 GB LPDDR5X (Unified Memory)
- OS: Fedora Linux 43 (Workstation Edition)
- Kernel: Linux 6.18.0-rc6
- ROCm Version: 7.9.0 (Nightly)
3. Challenges & Solutions
A. Docker Build Issues (Fedora 43)
The initial build failed due to a Python version mismatch.
- Issue: Fedora 43 defaults to Python 3.14, but the ROCm nightly wheels (
torch,torchvision) currently only support up to Python 3.13. - Fix: Modified the
Dockerfileto explicitly installpython3.13andpython3.13-develinstead of the system default. - Issue: The build failed at the final stage with
jq: command not found. - Fix: Added
jqto thednf installlist in the Dockerfile.
Building and running latest toolbox:
# Build the image
docker build -t amd-strix-halo-llm:latest .
# Transfer to Podman (for Toolbox)
docker save amd-strix-halo-llm:latest | podman load
#Create Toolbox:
toolbox create strix-halo-llm-finetuning \
--image docker.io/library/amd-strix-halo-llm:latest \
-- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --security-opt seccomp=unconfined
#Enter Toolbox:
toolbox enter strix-halo-llm-finetuning
B. GPU Memory Crashes (The "32GB RAM" Bottleneck)
Running the training initially caused immediate system crashes with AcceleratorError: HIP error: unspecified launch failure and kernel page faults.
- Cause: Need to set kernel parameters to prevent the GPU from overcommitting system RAM.
-
Fix: Tuned the kernel parameters to fit the 32GB physical limit, allocating ~16GB to the GPU GTT while leaving ~7GB for the OS.
sudo grubby --update-kernel=ALL --args="amd_iommu=off amdgpu.gttsize=16384 ttm.pages_limit=4194304"
4. Finetuning Results: Gemma-3 270M
We successfully fine-tuned the Gemma-3 270M-IT model using three different methods. Below are the observed memory footprints.
| Method | Trainable Params | Trainable % | Peak Training Memory | Weights Footprint | Training Time |
|---|---|---|---|---|---|
| Full Finetune | 271.9 M | 100% | 12.42 GB | 0.54 GB | 2m 52s |
| LoRA | 3.8 M | 1.40% | 11.40 GB | 0.55 GB | 2m 0s |
| 8-bit + LoRA | 3.8 M | 1.40% | ~11 GB | 0.79 GB | 8m 0s |
| QLoRA (4-bit) | 3.8 M | 1.40% | ~11 GB | 0.79 GB | 9m 0s |
Note: The 12.42 GB peak memory for full fine-tuning fits comfortably within the ~25GB effective VRAM (9GB Hardware Reserved + 16GB GTT).
5. Inference & Validation
Inference was validated using the fine-tuned checkpoints.
- FlashAttention 2: Successfully enabled (
attn_implementation="flash_attention_2") for both base model loading and inference pipeline. - Output: The model successfully generated coherent text responses to prompts (e.g., generating quotes about "love").
6. Conclusion
Finetuning LLMs on AMD Strix Halo is viable even on 32GB systems, provided that:
- Python versions are strictly managed (3.13 max for current ROCm wheels).
- Kernel parameters are tuned to prevent the GPU from overcommitting system RAM.
- Model selection is realistic (Gemma 270M and 1B are safe; 4B requires QLoRA; 12B is borderline).
Top comments (0)