Observations from Finetuning Gemma Model on Strix Halo (Fedora 43)

#deeplearning #linux #llm

Working with RoCM and Strix Halo drivers is generally a PITA, but thanks to https://github.com/kyuz0 for making https://github.com/kyuz0/amd-strix-halo-llm-finetuning and making our lives easier.

1. Introduction

This document details the experience, challenges, and results of fine-tuning Google's Gemma models on an AMD Strix Halo system running Fedora 43. The goal was to leverage the Ryzen AI Max 390's integrated NPU/GPU capabilities via ROCm 7.9 nightly builds.

We are using https://github.com/kyuz0/amd-strix-halo-llm-finetuning to build the docker image and run the finetuning process.

2. System Configuration

The test bench was an ASUS ROG Flow Z13 (2025) with the following specifications:

CPU/GPU: AMD Ryzen AI Max 390 (Strix Halo)
- 24 CPU Cores @ 5.06 GHz
- Integrated Radeon 8050S Graphics
RAM: 32 GB LPDDR5X (Unified Memory)
OS: Fedora Linux 43 (Workstation Edition)
Kernel: Linux 6.18.0-rc6
ROCm Version: 7.9.0 (Nightly)

3. Challenges & Solutions

A. Docker Build Issues (Fedora 43)

The initial build failed due to a Python version mismatch.

Issue: Fedora 43 defaults to Python 3.14, but the ROCm nightly wheels (torch, torchvision) currently only support up to Python 3.13.
Fix: Modified the Dockerfile to explicitly install python3.13 and python3.13-devel instead of the system default.
Issue: The build failed at the final stage with jq: command not found.
Fix: Added jq to the dnf install list in the Dockerfile.

Building and running latest toolbox:

# Build the image
docker build -t amd-strix-halo-llm:latest .

# Transfer to Podman (for Toolbox)
docker save amd-strix-halo-llm:latest | podman load

#Create Toolbox:
toolbox create strix-halo-llm-finetuning \
  --image docker.io/library/amd-strix-halo-llm:latest \
  -- --device /dev/dri --device /dev/kfd \
  --group-add video --group-add render --security-opt seccomp=unconfined

#Enter Toolbox:
toolbox enter strix-halo-llm-finetuning

B. GPU Memory Crashes (The "32GB RAM" Bottleneck)

Running the training initially caused immediate system crashes with AcceleratorError: HIP error: unspecified launch failure and kernel page faults.

Cause: Need to set kernel parameters to prevent the GPU from overcommitting system RAM.
Fix: Tuned the kernel parameters to fit the 32GB physical limit, allocating ~16GB to the GPU GTT while leaving ~7GB for the OS.
```
sudo grubby --update-kernel=ALL --args="amd_iommu=off amdgpu.gttsize=16384 ttm.pages_limit=4194304"
```

4. Finetuning Results: Gemma-3 270M

We successfully fine-tuned the Gemma-3 270M-IT model using three different methods. Below are the observed memory footprints.

Method	Trainable Params	Trainable %	Peak Training Memory	Weights Footprint	Training Time
Full Finetune	271.9 M	100%	12.42 GB	0.54 GB	2m 52s
LoRA	3.8 M	1.40%	11.40 GB	0.55 GB	2m 0s
8-bit + LoRA	3.8 M	1.40%	~11 GB	0.79 GB	8m 0s
QLoRA (4-bit)	3.8 M	1.40%	~11 GB	0.79 GB	9m 0s

Note: The 12.42 GB peak memory for full fine-tuning fits comfortably within the ~25GB effective VRAM (9GB Hardware Reserved + 16GB GTT).

5. Inference & Validation

Inference was validated using the fine-tuned checkpoints.

FlashAttention 2: Successfully enabled (attn_implementation="flash_attention_2") for both base model loading and inference pipeline.
Output: The model successfully generated coherent text responses to prompts (e.g., generating quotes about "love").

6. Conclusion

Finetuning LLMs on AMD Strix Halo is viable even on 32GB systems, provided that:

Python versions are strictly managed (3.13 max for current ROCm wheels).
Kernel parameters are tuned to prevent the GPU from overcommitting system RAM.
Model selection is realistic (Gemma 270M and 1B are safe; 4B requires QLoRA; 12B is borderline).