DEV Community

Ankit Khandelwal
Ankit Khandelwal

Posted on

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

Tested on Fedora 44, kernel 7.0.12, ROG Flow Z13 (Ryzen AI Max 390 / Strix Halo NPU).

Goal: copy-paste setup that gets flm validate and flm run working on Fedora.


TL;DR

You need four layers working together:

Layer What it does
Kernel + DKMS driver (amdxdna) Creates /dev/accel/accel0, loads NPU firmware
XRT base AMD runtime installed to /opt/xilinx/xrt
XRT NPU plugin (xrt_plugin RPM) Provides libxrt_driver_xdna.so so XRT sees the NPU
FastFlowLM (flm) Runs LLMs on the NPU

On Fedora there is no prebuilt PPA like Ubuntu. You build XRT, the NPU plugin, and FastFlowLM from source.

Two non-obvious blockers we hit:

  1. amd_iommu=off in kernel cmdline — common for GPU LLM tuning, but breaks the NPU
  2. Symlinking xrt-smi — the script is path-sensitive; use a wrapper instead

Hardware tested

Item Value
Machine ASUS ROG Flow Z13 GZ302EA
CPU / NPU AMD Ryzen AI Max 390 (Strix Halo)
NPU PCI ID 1022:17f0 rev 11
OS Fedora Linux 44 Workstation
Kernel 7.0.12-201.fc44.x86_64
NPU firmware 1.1.2.65
XRT 2.25.0 (built from amd/xdna-driver)

Also works on other XDNA2 NPUs (Strix, Strix Halo, Kraken, Gorgon Point) with the same stack.


Before you start

Kernel requirements

  • Linux 7.0+ (Fedora 44 ships this) with in-tree amdxdna support
  • For Strix Halo (rev 11), prefer the out-of-tree DKMS driver from xdna-driver over the stock in-tree module
  • IOMMU must be enabled — see IOMMU section

Check your kernel cmdline now

cat /proc/cmdline
Enter fullscreen mode Exit fullscreen mode

If you see amd_iommu=off, remove it before spending time on driver builds. Details below.


1. Install build dependencies

sudo dnf install -y \
  git jq dkms \
  kernel-devel-$(uname -r) kernel-headers \
  gcc gcc-c++ make cmake ninja-build \
  boost-devel boost-filesystem boost-program-options boost-static \
  elfutils-devel libdrm-devel libuuid-devel libcurl-devel \
  openssl-devel zlib-static glibc-static libstdc++-static \
  protobuf-devel protobuf-compiler \
  json-glib-devel libyaml-devel libudev-devel \
  rpm-build curl pciutils \
  fftw-devel \
  opencl-headers opencl-filesystem OpenCL-ICD-Loader-devel
Enter fullscreen mode Exit fullscreen mode

Run AMD's dependency scripts (optional but helpful):

git clone --recursive https://github.com/amd/xdna-driver.git ~/repos/xdna-driver
cd ~/repos/xdna-driver
sudo ./tools/amdxdna_deps.sh
sudo ./xrt/src/runtime_src/tools/scripts/xrtdeps.sh
Enter fullscreen mode Exit fullscreen mode

Fedora OpenCL note: Fedora 44 uses OpenCL-ICD-Loader, not the older ocl-icd package. If the XRT build fails on OpenCL ICD layout or RPM dependencies, see Fedora XRT build fixes.

cmake3 wrapper (Fedora ships CMake 4.x as cmake)

XRT build scripts look for cmake3 on Fedora. Create a local wrapper — no system symlink needed:

mkdir -p ~/.local/xrt-build/bin
ln -sf /usr/bin/cmake ~/.local/xrt-build/bin/cmake3
echo 'export PATH="$HOME/.local/xrt-build/bin:$PATH"' >> ~/.bashrc
export PATH="$HOME/.local/xrt-build/bin:$PATH"
Enter fullscreen mode Exit fullscreen mode

2. Build and install XRT

cd ~/repos/xdna-driver/xrt/build

./build.sh -npu -opt -disable-werror -noinit -j $(nproc)
Enter fullscreen mode Exit fullscreen mode

Install the RPMs (version string may differ slightly):

cd ~/repos/xdna-driver/xrt/build/Release
sudo dnf install -y xrt-base-*.rpm xrt-base-devel-*.rpm xrt-npu-*.rpm
Enter fullscreen mode Exit fullscreen mode

Register XRT libraries system-wide

Without this, flm fails with libxrt_coreutil.so.2: cannot open shared object file:

echo '/opt/xilinx/xrt/lib64' | sudo tee /etc/ld.so.conf.d/xrt.conf
sudo ldconfig
Enter fullscreen mode Exit fullscreen mode

Verify:

ldconfig -p | grep xrt
Enter fullscreen mode Exit fullscreen mode

3. Build and install the NPU plugin (DKMS driver + XRT shim)

This step provides libxrt_driver_xdna.so and replaces the in-tree kernel module with the DKMS build.

export PATH="$HOME/.local/xrt-build/bin:$PATH"
cd ~/repos/xdna-driver/build
./build.sh -release -j $(nproc)
Enter fullscreen mode Exit fullscreen mode

Install the plugin RPM:

sudo dnf install ~/repos/xdna-driver/build/Release/xrt_plugin.*.rpm
Enter fullscreen mode Exit fullscreen mode

Verify the DKMS module is active (path should contain extra/ or updates/dkms/, not kernel/drivers/accel/):

modinfo -F filename amdxdna
# e.g. /lib/modules/7.0.12-201.fc44.x86_64/extra/amdxdna.ko.xz
Enter fullscreen mode Exit fullscreen mode

Reboot if the module was just installed:

sudo reboot
Enter fullscreen mode Exit fullscreen mode

4. Fix memlock limit

The NPU needs locked memory. Check current limit:

ulimit -l
Enter fullscreen mode Exit fullscreen mode

If not unlimited:

sudo tee /etc/security/limits.d/99-memlock.conf <<'EOF'
*    soft    memlock    unlimited
*    hard    memlock    unlimited
EOF
Enter fullscreen mode Exit fullscreen mode

Log out and back in (or reboot), then confirm:

ulimit -l
# unlimited
Enter fullscreen mode Exit fullscreen mode

5. Critical: do NOT use amd_iommu=off

What is amd_iommu?

IOMMU (AMD-Vi on AMD platforms) mediates how PCIe devices access memory. The NPU driver uses PASID / SVA (Shared Virtual Addressing) so the NPU can share your process's virtual address space — this requires IOMMU.

amd_iommu=off disables IOMMU entirely. Strix Halo users often add it for 5–12% faster GPU inference in llama.cpp. That trade-off kills NPU support.

Symptoms with IOMMU off

[ERROR]  No NPU device found.
amdxdna_sva_init: SVA bind device failed, ret -19
PASID unavailable and carveout not configured
Open /dev/accel/accel0 failed (err=-22): Invalid argument
Enter fullscreen mode Exit fullscreen mode

Fix

Edit /etc/default/grub and remove amd_iommu=off:

sudo nano /etc/default/grub
Enter fullscreen mode Exit fullscreen mode

Change:

GRUB_CMDLINE_LINUX="... amd_iommu=off amdgpu.gttsize=24576 ..."
Enter fullscreen mode Exit fullscreen mode

To (keep your GPU tuning flags, drop only the IOMMU disable):

GRUB_CMDLINE_LINUX="... amdgpu.gttsize=24576 ttm.pages_limit=6291456"
Enter fullscreen mode Exit fullscreen mode

Regenerate grub and reboot:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
Enter fullscreen mode Exit fullscreen mode

After reboot:

cat /proc/cmdline | grep amd_iommu || echo "OK: amd_iommu not disabled"
Enter fullscreen mode Exit fullscreen mode

Optional middle ground: Some users use iommu=pt instead of amd_iommu=off for slightly less IOMMU overhead while keeping NPU working. Note: amd_iommu=pt is invalid on AMD — use iommu=pt (no amd_ prefix).


6. Build and install FastFlowLM

sudo dnf install -y \
  libavformat-devel libavutil-devel libavcodec-devel \
  libswresample-devel libswscale-devel

git clone --recursive https://github.com/FastFlowLM/FastFlowLM.git ~/repos/FastFlowLM
cd ~/repos/FastFlowLM/src
cmake --preset linux-default
cd build
cmake --build . -j $(nproc)
sudo cmake --install .
Enter fullscreen mode Exit fullscreen mode

flm installs to /opt/fastflowlm/bin/flm (symlinked to /usr/local/bin/flm).


7. Make xrt-smi available (optional but useful)

xrt-smi lives in /opt/xilinx/xrt/bin/. You can source the environment script:

source /opt/xilinx/xrt/setup.sh
Enter fullscreen mode Exit fullscreen mode

Do not symlink xrt-smi to /usr/local/bin — the wrapper script uses dirname "$0" and breaks when symlinked:

/usr/local/bin/xrt-smi: line 46: /usr/local/bin/unwrapped/xrt-smi: No such file or directory
Enter fullscreen mode Exit fullscreen mode

Instead, create a small wrapper:

sudo tee /usr/local/bin/xrt-smi <<'EOF'
#!/bin/sh
exec /opt/xilinx/xrt/bin/xrt-smi "$@"
EOF
sudo chmod +x /usr/local/bin/xrt-smi
Enter fullscreen mode Exit fullscreen mode

Or add XRT to PATH permanently in ~/.bashrc:

export PATH="/opt/xilinx/xrt/bin:$PATH"
Enter fullscreen mode Exit fullscreen mode

8. Validate everything

Run these in order:

# Kernel driver + firmware
flm validate
Enter fullscreen mode Exit fullscreen mode

Expected:

[Linux]  Kernel: 7.0.12-201.fc44.x86_64
[Linux]  NPU: /dev/accel/accel0 with 8 columns
[Linux]  NPU FW Version: 1.1.2.65
[Linux]  amdxdna version: 0.15
[Linux]  Memlock Limit: infinity
Enter fullscreen mode Exit fullscreen mode
# XRT layer
xrt-smi examine
Enter fullscreen mode Exit fullscreen mode

Expected: one NPU Strix Halo device at [0000:c5:00.1].

# Hardware self-test
xrt-smi validate
Enter fullscreen mode Exit fullscreen mode

Expected: gemm, latency, and throughput tests PASSED.

Important: flm validate checks the kernel DRM device. flm run uses XRT. Both must pass before running models.


9. Run your first model

flm run gemma4-it:e4b
flm list
flm serve gemma4-it:e4b     # OpenAI-compatible server on port 52625
Enter fullscreen mode Exit fullscreen mode

Models download from HuggingFace on first run. Default storage: ~/.config/flm/.

Inside an interactive flm run session, toggle performance reporting:

/verbose    # per-turn TTFT, prefill tok/s, decoding tok/s
/status     # token counts and throughput summary
Enter fullscreen mode Exit fullscreen mode

Formal benchmarks across context lengths:

flm bench gemma4-it:e4b
Enter fullscreen mode Exit fullscreen mode

10. Monitor NPU stats

There is no Linux equivalent to amdgpu_top or Windows Task Manager's NPU tab yet. Use a combination of XRT (device-level) and FLM (inference-level) tools.

Quick reference

What you want Command
Device info, firmware, topology xrt-smi examine
Power, partitions, platform xrt-smi examine -r all -d 0000:c5:00.1
Hardware benchmark (TOPS, latency) xrt-smi validate
Live-ish polling watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'
Inference speed while chatting /verbose and /status in flm run
Formal model benchmarks flm bench <model>

Replace 0000:c5:00.1 with your NPU BDF from xrt-smi examine if it differs.

XRT — device-level snapshots

xrt-smi examine
xrt-smi examine -r all -d 0000:c5:00.1
xrt-smi validate
Enter fullscreen mode Exit fullscreen mode

Poll while a model runs in another terminal:

watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'
Enter fullscreen mode Exit fullscreen mode

Power modes (some require root):

xrt-smi configure --pmode performance -d 0000:c5:00.1
# modes: default, powersaver, balanced, performance, turbo
Enter fullscreen mode Exit fullscreen mode

FLM — inference metrics (most useful in practice)

Terminal 1 — run a model with verbose output:

flm run gemma4-it:e4b
# then type /verbose
Enter fullscreen mode Exit fullscreen mode

Terminal 2 — watch the NPU device:

watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'
Enter fullscreen mode Exit fullscreen mode

Terminal 3 (optional) — GPU is separate from NPU:

amdgpu_top    # Radeon iGPU only, not the NPU
Enter fullscreen mode Exit fullscreen mode

Kernel debugfs (low-level, requires root)

sudo ls /sys/kernel/debug/accel/
sudo ls /sys/kernel/debug/dri/

# When present (exact path varies by kernel/driver):
sudo cat /sys/kernel/debug/dri/0/telemetry_profiling
sudo cat /sys/kernel/debug/dri/0/powerstate
sudo cat /sys/kernel/debug/dri/0/get_app_health
Enter fullscreen mode Exit fullscreen mode

These are read-on-demand debug interfaces, not a live dashboard.

What does NOT show NPU utilization

  • htop / top — CPU and RAM only
  • amdgpu_top / radeontop — GPU only
  • /sys/class/accel/accel0/ — device node metadata, no utilization graph

11. Real-world benchmark (ROG Flow Z13)

Measured on the same machine as this guide after a successful setup (Fedora 44, Ryzen AI Max 390, NPU firmware 1.1.2.65, IOMMU enabled).

gemma4-it:e4b

flm run gemma4-it:e4b
# /verbose enabled during session
Enter fullscreen mode Exit fullscreen mode
Metric Result
TTFT (time to first token) 1.21 s
Prefill speed 18 tok/s
Decoding speed 11 tok/s

These numbers come from FLM's /verbose output (prefill and decoding tokens/s). Your results will vary with prompt length, context size, power mode, and background load.

xrt-smi validate on the same hardware reported 4.4 TOPS (gemm), 52 µs average latency, and ~76k op/s throughput — useful as a hardware sanity check, not directly comparable to LLM tok/s.


Troubleshooting

Symptom Cause Fix
libxrt_coreutil.so.2: cannot open shared object file XRT libs not in loader cache Add /opt/xilinx/xrt/lib64 to ld.so.conf.d, run sudo ldconfig
No NPU device found + clean dmesg IOMMU disabled Remove amd_iommu=off, reboot
PASID unavailable and carveout not configured Same as above Enable IOMMU
Memlock limit is too low (8MB) Default ulimit too low /etc/security/limits.d/99-memlock.conf, re-login
xrt-smi: 0 devices found Missing NPU plugin Install xrt_plugin RPM from xdna-driver/build
/dev/accel/accel0 exists but open fails (ENODEV) In-tree driver failed probe Install DKMS driver via xrt_plugin RPM
flm validate OK but flm run fails with No such device with index '0' XRT can't see NPU Fix XRT plugin + xrt-smi examine
xrt-smi: unwrapped/xrt-smi: No such file Bad symlink Use wrapper script (see section 7)
cmake3 is not installed Fedora CMake naming Create cmake3 wrapper pointing to /usr/bin/cmake
XRT build fails on OpenCL ICD Fedora OpenCL 3.0 layout See Fedora XRT build fixes below
Link errors for libfftw3 Missing dev package sudo dnf install fftw-devel

Useful debug commands

# NPU PCI device
lspci -nn | grep 17f0

# Device node
ls -la /dev/accel/

# Kernel messages
sudo dmesg | grep -iE 'amdxdna|xdna|pasid|firmware|17f0'

# Which driver module is loaded
modinfo -F filename amdxdna
lsmod | grep amdxdna

# Firmware files (rev 11 = 17f0_11)
ls -la /usr/lib/firmware/amdnpu/17f0_11/

# IOMMU status
cat /proc/cmdline
Enter fullscreen mode Exit fullscreen mode

Fedora XRT build fixes

Fedora 44 changed OpenCL packaging. Upstream XRT may fail to build or produce RPMs with wrong dependencies. Symptoms:

  • Compile error in ocl_icd_bindings.cpp (OpenCL 3.0 ICD struct layout)
  • RPM dependency conflict between ocl-icd and OpenCL-ICD-Loader

Workarounds applied in our build (track upstream fix):

  • Patch xrt/src/runtime_src/xocl/api/icd/ocl_icd_bindings.cpp for OpenCL 3.0 ICD compatibility
  • Patch xrt/src/CMake/cpackLin.cmake to require OpenCL-ICD-Loader >= 3.0 on Fedora

Upstream issue: Xilinx/XRT #9163

Install these before building if xrtdeps.sh fails on OpenCL packages:

sudo dnf install -y opencl-headers opencl-filesystem OpenCL-ICD-Loader-devel
Enter fullscreen mode Exit fullscreen mode

Build flags that helped on Fedora:

./build.sh -npu -opt -disable-werror -noinit -j $(nproc)
Enter fullscreen mode Exit fullscreen mode

Use -j $(nproc) with a space — some build scripts break on -j$(nproc).


Architecture: why so many pieces?

┌─────────────────────────────────────────┐
│  flm run / flm serve                    │  ← FastFlowLM (user-facing)
├─────────────────────────────────────────┤
│  libxrt_driver_xdna.so (XRT plugin)     │  ← xrt_plugin RPM
├─────────────────────────────────────────┤
│  libxrt_core.so (XRT base)              │  ← xrt-base RPM
├─────────────────────────────────────────┤
│  amdxdna.ko (DKMS kernel driver)        │  ← xrt_plugin RPM (postinst)
├─────────────────────────────────────────┤
│  NPU firmware (amdnpu/17f0_11/)       │  ← linux-firmware + plugin
├─────────────────────────────────────────┤
│  /dev/accel/accel0                      │  ← kernel DRM device node
└─────────────────────────────────────────┘
         ▲
         │ requires IOMMU (PASID/SVA)
         │ requires memlock = unlimited
Enter fullscreen mode Exit fullscreen mode

Quick re-setup checklist (future you)

After a fresh Fedora install or kernel upgrade:

# 1. Confirm IOMMU is NOT disabled
grep -q 'amd_iommu=off' /proc/cmdline && echo "FIX GRUB FIRST" || echo "IOMMU OK"

# 2. Rebuild DKMS if kernel changed
sudo dkms autoinstall -k "$(uname -r)"
sudo depmod -a

# 3. Check memlock
ulimit -l

# 4. Validate
flm validate && xrt-smi examine && xrt-smi validate
Enter fullscreen mode Exit fullscreen mode

References


Written from a working Fedora 44 + ROG Flow Z13 setup. If AMD ships Fedora packages later, prefer those over building from source — but the troubleshooting sections above still apply.

Top comments (0)