Running LLMs on AMD NPU with FastFlowLM - Fedora Guide
Tested on Fedora 44, kernel 7.0.12, ROG Flow Z13 (Ryzen AI Max 390 / Strix Halo NPU).
Goal: copy-paste setup that getsflm validateandflm runworking on Fedora.
TL;DR
You need four layers working together:
| Layer | What it does |
|---|---|
Kernel + DKMS driver (amdxdna) |
Creates /dev/accel/accel0, loads NPU firmware |
| XRT base | AMD runtime installed to /opt/xilinx/xrt
|
XRT NPU plugin (xrt_plugin RPM) |
Provides libxrt_driver_xdna.so so XRT sees the NPU |
FastFlowLM (flm) |
Runs LLMs on the NPU |
On Fedora there is no prebuilt PPA like Ubuntu. You build XRT, the NPU plugin, and FastFlowLM from source.
Two non-obvious blockers we hit:
-
amd_iommu=offin kernel cmdline — common for GPU LLM tuning, but breaks the NPU -
Symlinking
xrt-smi— the script is path-sensitive; use a wrapper instead
Hardware tested
| Item | Value |
|---|---|
| Machine | ASUS ROG Flow Z13 GZ302EA |
| CPU / NPU | AMD Ryzen AI Max 390 (Strix Halo) |
| NPU PCI ID |
1022:17f0 rev 11 |
| OS | Fedora Linux 44 Workstation |
| Kernel | 7.0.12-201.fc44.x86_64 |
| NPU firmware | 1.1.2.65 |
| XRT | 2.25.0 (built from amd/xdna-driver) |
Also works on other XDNA2 NPUs (Strix, Strix Halo, Kraken, Gorgon Point) with the same stack.
Before you start
Kernel requirements
-
Linux 7.0+ (Fedora 44 ships this) with in-tree
amdxdnasupport - For Strix Halo (rev 11), prefer the out-of-tree DKMS driver from
xdna-driverover the stock in-tree module - IOMMU must be enabled — see IOMMU section
Check your kernel cmdline now
cat /proc/cmdline
If you see amd_iommu=off, remove it before spending time on driver builds. Details below.
1. Install build dependencies
sudo dnf install -y \
git jq dkms \
kernel-devel-$(uname -r) kernel-headers \
gcc gcc-c++ make cmake ninja-build \
boost-devel boost-filesystem boost-program-options boost-static \
elfutils-devel libdrm-devel libuuid-devel libcurl-devel \
openssl-devel zlib-static glibc-static libstdc++-static \
protobuf-devel protobuf-compiler \
json-glib-devel libyaml-devel libudev-devel \
rpm-build curl pciutils \
fftw-devel \
opencl-headers opencl-filesystem OpenCL-ICD-Loader-devel
Run AMD's dependency scripts (optional but helpful):
git clone --recursive https://github.com/amd/xdna-driver.git ~/repos/xdna-driver
cd ~/repos/xdna-driver
sudo ./tools/amdxdna_deps.sh
sudo ./xrt/src/runtime_src/tools/scripts/xrtdeps.sh
Fedora OpenCL note: Fedora 44 uses
OpenCL-ICD-Loader, not the olderocl-icdpackage. If the XRT build fails on OpenCL ICD layout or RPM dependencies, see Fedora XRT build fixes.
cmake3 wrapper (Fedora ships CMake 4.x as cmake)
XRT build scripts look for cmake3 on Fedora. Create a local wrapper — no system symlink needed:
mkdir -p ~/.local/xrt-build/bin
ln -sf /usr/bin/cmake ~/.local/xrt-build/bin/cmake3
echo 'export PATH="$HOME/.local/xrt-build/bin:$PATH"' >> ~/.bashrc
export PATH="$HOME/.local/xrt-build/bin:$PATH"
2. Build and install XRT
cd ~/repos/xdna-driver/xrt/build
./build.sh -npu -opt -disable-werror -noinit -j $(nproc)
Install the RPMs (version string may differ slightly):
cd ~/repos/xdna-driver/xrt/build/Release
sudo dnf install -y xrt-base-*.rpm xrt-base-devel-*.rpm xrt-npu-*.rpm
Register XRT libraries system-wide
Without this, flm fails with libxrt_coreutil.so.2: cannot open shared object file:
echo '/opt/xilinx/xrt/lib64' | sudo tee /etc/ld.so.conf.d/xrt.conf
sudo ldconfig
Verify:
ldconfig -p | grep xrt
3. Build and install the NPU plugin (DKMS driver + XRT shim)
This step provides libxrt_driver_xdna.so and replaces the in-tree kernel module with the DKMS build.
export PATH="$HOME/.local/xrt-build/bin:$PATH"
cd ~/repos/xdna-driver/build
./build.sh -release -j $(nproc)
Install the plugin RPM:
sudo dnf install ~/repos/xdna-driver/build/Release/xrt_plugin.*.rpm
Verify the DKMS module is active (path should contain extra/ or updates/dkms/, not kernel/drivers/accel/):
modinfo -F filename amdxdna
# e.g. /lib/modules/7.0.12-201.fc44.x86_64/extra/amdxdna.ko.xz
Reboot if the module was just installed:
sudo reboot
4. Fix memlock limit
The NPU needs locked memory. Check current limit:
ulimit -l
If not unlimited:
sudo tee /etc/security/limits.d/99-memlock.conf <<'EOF'
* soft memlock unlimited
* hard memlock unlimited
EOF
Log out and back in (or reboot), then confirm:
ulimit -l
# unlimited
5. Critical: do NOT use amd_iommu=off
What is amd_iommu?
IOMMU (AMD-Vi on AMD platforms) mediates how PCIe devices access memory. The NPU driver uses PASID / SVA (Shared Virtual Addressing) so the NPU can share your process's virtual address space — this requires IOMMU.
amd_iommu=off disables IOMMU entirely. Strix Halo users often add it for 5–12% faster GPU inference in llama.cpp. That trade-off kills NPU support.
Symptoms with IOMMU off
[ERROR] No NPU device found.
amdxdna_sva_init: SVA bind device failed, ret -19
PASID unavailable and carveout not configured
Open /dev/accel/accel0 failed (err=-22): Invalid argument
Fix
Edit /etc/default/grub and remove amd_iommu=off:
sudo nano /etc/default/grub
Change:
GRUB_CMDLINE_LINUX="... amd_iommu=off amdgpu.gttsize=24576 ..."
To (keep your GPU tuning flags, drop only the IOMMU disable):
GRUB_CMDLINE_LINUX="... amdgpu.gttsize=24576 ttm.pages_limit=6291456"
Regenerate grub and reboot:
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
After reboot:
cat /proc/cmdline | grep amd_iommu || echo "OK: amd_iommu not disabled"
Optional middle ground: Some users use
iommu=ptinstead ofamd_iommu=offfor slightly less IOMMU overhead while keeping NPU working. Note:amd_iommu=ptis invalid on AMD — useiommu=pt(noamd_prefix).
6. Build and install FastFlowLM
sudo dnf install -y \
libavformat-devel libavutil-devel libavcodec-devel \
libswresample-devel libswscale-devel
git clone --recursive https://github.com/FastFlowLM/FastFlowLM.git ~/repos/FastFlowLM
cd ~/repos/FastFlowLM/src
cmake --preset linux-default
cd build
cmake --build . -j $(nproc)
sudo cmake --install .
flm installs to /opt/fastflowlm/bin/flm (symlinked to /usr/local/bin/flm).
7. Make xrt-smi available (optional but useful)
xrt-smi lives in /opt/xilinx/xrt/bin/. You can source the environment script:
source /opt/xilinx/xrt/setup.sh
Do not symlink xrt-smi to /usr/local/bin — the wrapper script uses dirname "$0" and breaks when symlinked:
/usr/local/bin/xrt-smi: line 46: /usr/local/bin/unwrapped/xrt-smi: No such file or directory
Instead, create a small wrapper:
sudo tee /usr/local/bin/xrt-smi <<'EOF'
#!/bin/sh
exec /opt/xilinx/xrt/bin/xrt-smi "$@"
EOF
sudo chmod +x /usr/local/bin/xrt-smi
Or add XRT to PATH permanently in ~/.bashrc:
export PATH="/opt/xilinx/xrt/bin:$PATH"
8. Validate everything
Run these in order:
# Kernel driver + firmware
flm validate
Expected:
[Linux] Kernel: 7.0.12-201.fc44.x86_64
[Linux] NPU: /dev/accel/accel0 with 8 columns
[Linux] NPU FW Version: 1.1.2.65
[Linux] amdxdna version: 0.15
[Linux] Memlock Limit: infinity
# XRT layer
xrt-smi examine
Expected: one NPU Strix Halo device at [0000:c5:00.1].
# Hardware self-test
xrt-smi validate
Expected: gemm, latency, and throughput tests PASSED.
Important:
flm validatechecks the kernel DRM device.flm runuses XRT. Both must pass before running models.
9. Run your first model
flm run gemma4-it:e4b
flm list
flm serve gemma4-it:e4b # OpenAI-compatible server on port 52625
Models download from HuggingFace on first run. Default storage: ~/.config/flm/.
Inside an interactive flm run session, toggle performance reporting:
/verbose # per-turn TTFT, prefill tok/s, decoding tok/s
/status # token counts and throughput summary
Formal benchmarks across context lengths:
flm bench gemma4-it:e4b
10. Monitor NPU stats
There is no Linux equivalent to amdgpu_top or Windows Task Manager's NPU tab yet. Use a combination of XRT (device-level) and FLM (inference-level) tools.
Quick reference
| What you want | Command |
|---|---|
| Device info, firmware, topology | xrt-smi examine |
| Power, partitions, platform | xrt-smi examine -r all -d 0000:c5:00.1 |
| Hardware benchmark (TOPS, latency) | xrt-smi validate |
| Live-ish polling | watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1' |
| Inference speed while chatting |
/verbose and /status in flm run
|
| Formal model benchmarks | flm bench <model> |
Replace 0000:c5:00.1 with your NPU BDF from xrt-smi examine if it differs.
XRT — device-level snapshots
xrt-smi examine
xrt-smi examine -r all -d 0000:c5:00.1
xrt-smi validate
Poll while a model runs in another terminal:
watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'
Power modes (some require root):
xrt-smi configure --pmode performance -d 0000:c5:00.1
# modes: default, powersaver, balanced, performance, turbo
FLM — inference metrics (most useful in practice)
Terminal 1 — run a model with verbose output:
flm run gemma4-it:e4b
# then type /verbose
Terminal 2 — watch the NPU device:
watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'
Terminal 3 (optional) — GPU is separate from NPU:
amdgpu_top # Radeon iGPU only, not the NPU
Kernel debugfs (low-level, requires root)
sudo ls /sys/kernel/debug/accel/
sudo ls /sys/kernel/debug/dri/
# When present (exact path varies by kernel/driver):
sudo cat /sys/kernel/debug/dri/0/telemetry_profiling
sudo cat /sys/kernel/debug/dri/0/powerstate
sudo cat /sys/kernel/debug/dri/0/get_app_health
These are read-on-demand debug interfaces, not a live dashboard.
What does NOT show NPU utilization
-
htop/top— CPU and RAM only -
amdgpu_top/radeontop— GPU only -
/sys/class/accel/accel0/— device node metadata, no utilization graph
11. Real-world benchmark (ROG Flow Z13)
Measured on the same machine as this guide after a successful setup (Fedora 44, Ryzen AI Max 390, NPU firmware 1.1.2.65, IOMMU enabled).
gemma4-it:e4b
flm run gemma4-it:e4b
# /verbose enabled during session
| Metric | Result |
|---|---|
| TTFT (time to first token) | 1.21 s |
| Prefill speed | 18 tok/s |
| Decoding speed | 11 tok/s |
These numbers come from FLM's /verbose output (prefill and decoding tokens/s). Your results will vary with prompt length, context size, power mode, and background load.
xrt-smi validate on the same hardware reported 4.4 TOPS (gemm), 52 µs average latency, and ~76k op/s throughput — useful as a hardware sanity check, not directly comparable to LLM tok/s.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
libxrt_coreutil.so.2: cannot open shared object file |
XRT libs not in loader cache | Add /opt/xilinx/xrt/lib64 to ld.so.conf.d, run sudo ldconfig
|
No NPU device found + clean dmesg |
IOMMU disabled | Remove amd_iommu=off, reboot |
PASID unavailable and carveout not configured |
Same as above | Enable IOMMU |
Memlock limit is too low (8MB) |
Default ulimit too low |
/etc/security/limits.d/99-memlock.conf, re-login |
xrt-smi: 0 devices found |
Missing NPU plugin | Install xrt_plugin RPM from xdna-driver/build
|
/dev/accel/accel0 exists but open fails (ENODEV) |
In-tree driver failed probe | Install DKMS driver via xrt_plugin RPM |
flm validate OK but flm run fails with No such device with index '0'
|
XRT can't see NPU | Fix XRT plugin + xrt-smi examine
|
xrt-smi: unwrapped/xrt-smi: No such file |
Bad symlink | Use wrapper script (see section 7) |
cmake3 is not installed |
Fedora CMake naming | Create cmake3 wrapper pointing to /usr/bin/cmake
|
| XRT build fails on OpenCL ICD | Fedora OpenCL 3.0 layout | See Fedora XRT build fixes below |
Link errors for libfftw3
|
Missing dev package | sudo dnf install fftw-devel |
Useful debug commands
# NPU PCI device
lspci -nn | grep 17f0
# Device node
ls -la /dev/accel/
# Kernel messages
sudo dmesg | grep -iE 'amdxdna|xdna|pasid|firmware|17f0'
# Which driver module is loaded
modinfo -F filename amdxdna
lsmod | grep amdxdna
# Firmware files (rev 11 = 17f0_11)
ls -la /usr/lib/firmware/amdnpu/17f0_11/
# IOMMU status
cat /proc/cmdline
Fedora XRT build fixes
Fedora 44 changed OpenCL packaging. Upstream XRT may fail to build or produce RPMs with wrong dependencies. Symptoms:
- Compile error in
ocl_icd_bindings.cpp(OpenCL 3.0 ICD struct layout) - RPM dependency conflict between
ocl-icdandOpenCL-ICD-Loader
Workarounds applied in our build (track upstream fix):
- Patch
xrt/src/runtime_src/xocl/api/icd/ocl_icd_bindings.cppfor OpenCL 3.0 ICD compatibility - Patch
xrt/src/CMake/cpackLin.cmaketo requireOpenCL-ICD-Loader >= 3.0on Fedora
Upstream issue: Xilinx/XRT #9163
Install these before building if xrtdeps.sh fails on OpenCL packages:
sudo dnf install -y opencl-headers opencl-filesystem OpenCL-ICD-Loader-devel
Build flags that helped on Fedora:
./build.sh -npu -opt -disable-werror -noinit -j $(nproc)
Use -j $(nproc) with a space — some build scripts break on -j$(nproc).
Architecture: why so many pieces?
┌─────────────────────────────────────────┐
│ flm run / flm serve │ ← FastFlowLM (user-facing)
├─────────────────────────────────────────┤
│ libxrt_driver_xdna.so (XRT plugin) │ ← xrt_plugin RPM
├─────────────────────────────────────────┤
│ libxrt_core.so (XRT base) │ ← xrt-base RPM
├─────────────────────────────────────────┤
│ amdxdna.ko (DKMS kernel driver) │ ← xrt_plugin RPM (postinst)
├─────────────────────────────────────────┤
│ NPU firmware (amdnpu/17f0_11/) │ ← linux-firmware + plugin
├─────────────────────────────────────────┤
│ /dev/accel/accel0 │ ← kernel DRM device node
└─────────────────────────────────────────┘
▲
│ requires IOMMU (PASID/SVA)
│ requires memlock = unlimited
Quick re-setup checklist (future you)
After a fresh Fedora install or kernel upgrade:
# 1. Confirm IOMMU is NOT disabled
grep -q 'amd_iommu=off' /proc/cmdline && echo "FIX GRUB FIRST" || echo "IOMMU OK"
# 2. Rebuild DKMS if kernel changed
sudo dkms autoinstall -k "$(uname -r)"
sudo depmod -a
# 3. Check memlock
ulimit -l
# 4. Validate
flm validate && xrt-smi examine && xrt-smi validate
References
- FastFlowLM — NPU-first LLM runtime
- FastFlowLM Linux docs
- amd/xdna-driver — XRT + NPU plugin source
- Lemonade NPU Linux guide
- XRT OpenCL Fedora issue #9163
- Strix Halo IOMMU discussion — GPU vs NPU trade-off
Written from a working Fedora 44 + ROG Flow Z13 setup. If AMD ships Fedora packages later, prefer those over building from source — but the troubleshooting sections above still apply.

Top comments (0)