Ankit Khandelwal

Posted on Jun 28

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

#amd #npu #llm #ai

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

Tested on Fedora 44, kernel 7.0.12, ROG Flow Z13 (Ryzen AI Max 390 / Strix Halo NPU).

Goal: copy-paste setup that gets flm validate and flm run working on Fedora.

TL;DR

You need four layers working together:

Layer	What it does
Kernel + DKMS driver (`amdxdna`)	Creates `/dev/accel/accel0`, loads NPU firmware
XRT base	AMD runtime installed to `/opt/xilinx/xrt`
XRT NPU plugin (`xrt_plugin` RPM)	Provides `libxrt_driver_xdna.so` so XRT sees the NPU
FastFlowLM (`flm`)	Runs LLMs on the NPU

On Fedora there is no prebuilt PPA like Ubuntu. You build XRT, the NPU plugin, and FastFlowLM from source.

Two non-obvious blockers we hit:

amd_iommu=off in kernel cmdline — common for GPU LLM tuning, but breaks the NPU
Symlinking xrt-smi — the script is path-sensitive; use a wrapper instead

Hardware tested

Item	Value
Machine	ASUS ROG Flow Z13 GZ302EA
CPU / NPU	AMD Ryzen AI Max 390 (Strix Halo)
NPU PCI ID	`1022:17f0` rev 11
OS	Fedora Linux 44 Workstation
Kernel	7.0.12-201.fc44.x86_64
NPU firmware	1.1.2.65
XRT	2.25.0 (built from `amd/xdna-driver`)

Also works on other XDNA2 NPUs (Strix, Strix Halo, Kraken, Gorgon Point) with the same stack.

Before you start

Kernel requirements

Linux 7.0+ (Fedora 44 ships this) with in-tree amdxdna support
For Strix Halo (rev 11), prefer the out-of-tree DKMS driver from xdna-driver over the stock in-tree module
IOMMU must be enabled — see IOMMU section

Check your kernel cmdline now

cat /proc/cmdline

If you see amd_iommu=off, remove it before spending time on driver builds. Details below.

1. Install build dependencies

sudo dnf install -y \
  git jq dkms \
  kernel-devel-$(uname -r) kernel-headers \
  gcc gcc-c++ make cmake ninja-build \
  boost-devel boost-filesystem boost-program-options boost-static \
  elfutils-devel libdrm-devel libuuid-devel libcurl-devel \
  openssl-devel zlib-static glibc-static libstdc++-static \
  protobuf-devel protobuf-compiler \
  json-glib-devel libyaml-devel libudev-devel \
  rpm-build curl pciutils \
  fftw-devel \
  opencl-headers opencl-filesystem OpenCL-ICD-Loader-devel

Run AMD's dependency scripts (optional but helpful):

git clone --recursive https://github.com/amd/xdna-driver.git ~/repos/xdna-driver
cd ~/repos/xdna-driver
sudo ./tools/amdxdna_deps.sh
sudo ./xrt/src/runtime_src/tools/scripts/xrtdeps.sh

Fedora OpenCL note: Fedora 44 uses OpenCL-ICD-Loader, not the older ocl-icd package. If the XRT build fails on OpenCL ICD layout or RPM dependencies, see Fedora XRT build fixes.

cmake3 wrapper (Fedora ships CMake 4.x as `cmake`)

XRT build scripts look for cmake3 on Fedora. Create a local wrapper — no system symlink needed:

mkdir -p ~/.local/xrt-build/bin
ln -sf /usr/bin/cmake ~/.local/xrt-build/bin/cmake3
echo 'export PATH="$HOME/.local/xrt-build/bin:$PATH"' >> ~/.bashrc
export PATH="$HOME/.local/xrt-build/bin:$PATH"

2. Build and install XRT

cd ~/repos/xdna-driver/xrt/build

./build.sh -npu -opt -disable-werror -noinit -j $(nproc)

Install the RPMs (version string may differ slightly):

cd ~/repos/xdna-driver/xrt/build/Release
sudo dnf install -y xrt-base-*.rpm xrt-base-devel-*.rpm xrt-npu-*.rpm

Register XRT libraries system-wide

Without this, flm fails with libxrt_coreutil.so.2: cannot open shared object file:

echo '/opt/xilinx/xrt/lib64' | sudo tee /etc/ld.so.conf.d/xrt.conf
sudo ldconfig

Verify:

ldconfig -p | grep xrt

3. Build and install the NPU plugin (DKMS driver + XRT shim)

This step provides libxrt_driver_xdna.so and replaces the in-tree kernel module with the DKMS build.

export PATH="$HOME/.local/xrt-build/bin:$PATH"
cd ~/repos/xdna-driver/build
./build.sh -release -j $(nproc)

Install the plugin RPM:

sudo dnf install ~/repos/xdna-driver/build/Release/xrt_plugin.*.rpm

Verify the DKMS module is active (path should contain extra/ or updates/dkms/, not kernel/drivers/accel/):

modinfo -F filename amdxdna
# e.g. /lib/modules/7.0.12-201.fc44.x86_64/extra/amdxdna.ko.xz

Reboot if the module was just installed:

sudo reboot

4. Fix memlock limit

The NPU needs locked memory. Check current limit:

ulimit -l

If not unlimited:

sudo tee /etc/security/limits.d/99-memlock.conf <<'EOF'
*    soft    memlock    unlimited
*    hard    memlock    unlimited
EOF

Log out and back in (or reboot), then confirm:

ulimit -l
# unlimited

5. Critical: do NOT use `amd_iommu=off`

What is `amd_iommu`?

IOMMU (AMD-Vi on AMD platforms) mediates how PCIe devices access memory. The NPU driver uses PASID / SVA (Shared Virtual Addressing) so the NPU can share your process's virtual address space — this requires IOMMU.

amd_iommu=off disables IOMMU entirely. Strix Halo users often add it for 5–12% faster GPU inference in llama.cpp. That trade-off kills NPU support.

Symptoms with IOMMU off

[ERROR]  No NPU device found.
amdxdna_sva_init: SVA bind device failed, ret -19
PASID unavailable and carveout not configured
Open /dev/accel/accel0 failed (err=-22): Invalid argument

Fix

Edit /etc/default/grub and remove amd_iommu=off:

sudo nano /etc/default/grub

Change:

GRUB_CMDLINE_LINUX="... amd_iommu=off amdgpu.gttsize=24576 ..."

To (keep your GPU tuning flags, drop only the IOMMU disable):

GRUB_CMDLINE_LINUX="... amdgpu.gttsize=24576 ttm.pages_limit=6291456"

Regenerate grub and reboot:

sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

After reboot:

cat /proc/cmdline | grep amd_iommu || echo "OK: amd_iommu not disabled"

Optional middle ground: Some users use iommu=pt instead of amd_iommu=off for slightly less IOMMU overhead while keeping NPU working. Note: amd_iommu=pt is invalid on AMD — use iommu=pt (no amd_ prefix).

6. Build and install FastFlowLM

sudo dnf install -y \
  libavformat-devel libavutil-devel libavcodec-devel \
  libswresample-devel libswscale-devel

git clone --recursive https://github.com/FastFlowLM/FastFlowLM.git ~/repos/FastFlowLM
cd ~/repos/FastFlowLM/src
cmake --preset linux-default
cd build
cmake --build . -j $(nproc)
sudo cmake --install .

flm installs to /opt/fastflowlm/bin/flm (symlinked to /usr/local/bin/flm).

7. Make `xrt-smi` available (optional but useful)

xrt-smi lives in /opt/xilinx/xrt/bin/. You can source the environment script:

source /opt/xilinx/xrt/setup.sh

Do not symlink xrt-smi to /usr/local/bin — the wrapper script uses dirname "$0" and breaks when symlinked:

/usr/local/bin/xrt-smi: line 46: /usr/local/bin/unwrapped/xrt-smi: No such file or directory

Instead, create a small wrapper:

sudo tee /usr/local/bin/xrt-smi <<'EOF'
#!/bin/sh
exec /opt/xilinx/xrt/bin/xrt-smi "$@"
EOF
sudo chmod +x /usr/local/bin/xrt-smi

Or add XRT to PATH permanently in ~/.bashrc:

export PATH="/opt/xilinx/xrt/bin:$PATH"

8. Validate everything

Run these in order:

# Kernel driver + firmware
flm validate

Expected:

[Linux]  Kernel: 7.0.12-201.fc44.x86_64
[Linux]  NPU: /dev/accel/accel0 with 8 columns
[Linux]  NPU FW Version: 1.1.2.65
[Linux]  amdxdna version: 0.15
[Linux]  Memlock Limit: infinity

# XRT layer
xrt-smi examine

Expected: one NPU Strix Halo device at [0000:c5:00.1].

# Hardware self-test
xrt-smi validate

Expected: gemm, latency, and throughput tests PASSED.

Important: flm validate checks the kernel DRM device. flm run uses XRT. Both must pass before running models.

9. Run your first model

flm run gemma4-it:e4b
flm list
flm serve gemma4-it:e4b     # OpenAI-compatible server on port 52625

Models download from HuggingFace on first run. Default storage: ~/.config/flm/.

Inside an interactive flm run session, toggle performance reporting:

/verbose    # per-turn TTFT, prefill tok/s, decoding tok/s
/status     # token counts and throughput summary

Formal benchmarks across context lengths:

flm bench gemma4-it:e4b

10. Monitor NPU stats

There is no Linux equivalent to amdgpu_top or Windows Task Manager's NPU tab yet. Use a combination of XRT (device-level) and FLM (inference-level) tools.

Quick reference

What you want	Command
Device info, firmware, topology	`xrt-smi examine`
Power, partitions, platform	`xrt-smi examine -r all -d 0000:c5:00.1`
Hardware benchmark (TOPS, latency)	`xrt-smi validate`
Live-ish polling	`watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'`
Inference speed while chatting	`/verbose` and `/status` in `flm run`
Formal model benchmarks	`flm bench <model>`

Replace 0000:c5:00.1 with your NPU BDF from xrt-smi examine if it differs.

XRT — device-level snapshots

xrt-smi examine
xrt-smi examine -r all -d 0000:c5:00.1
xrt-smi validate

Poll while a model runs in another terminal:

watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'

Power modes (some require root):

xrt-smi configure --pmode performance -d 0000:c5:00.1
# modes: default, powersaver, balanced, performance, turbo

FLM — inference metrics (most useful in practice)

Terminal 1 — run a model with verbose output:

flm run gemma4-it:e4b
# then type /verbose

Terminal 2 — watch the NPU device:

watch -n1 'xrt-smi examine -r all -d 0000:c5:00.1'

Terminal 3 (optional) — GPU is separate from NPU:

amdgpu_top    # Radeon iGPU only, not the NPU

Kernel debugfs (low-level, requires root)

sudo ls /sys/kernel/debug/accel/
sudo ls /sys/kernel/debug/dri/

# When present (exact path varies by kernel/driver):
sudo cat /sys/kernel/debug/dri/0/telemetry_profiling
sudo cat /sys/kernel/debug/dri/0/powerstate
sudo cat /sys/kernel/debug/dri/0/get_app_health

These are read-on-demand debug interfaces, not a live dashboard.

What does NOT show NPU utilization

htop / top — CPU and RAM only
amdgpu_top / radeontop — GPU only
/sys/class/accel/accel0/ — device node metadata, no utilization graph

11. Real-world benchmark (ROG Flow Z13)

Measured on the same machine as this guide after a successful setup (Fedora 44, Ryzen AI Max 390, NPU firmware 1.1.2.65, IOMMU enabled).

`gemma4-it:e4b`

flm run gemma4-it:e4b
# /verbose enabled during session

Metric	Result
TTFT (time to first token)	1.21 s
Prefill speed	18 tok/s
Decoding speed	11 tok/s

These numbers come from FLM's /verbose output (prefill and decoding tokens/s). Your results will vary with prompt length, context size, power mode, and background load.

xrt-smi validate on the same hardware reported 4.4 TOPS (gemm), 52 µs average latency, and ~76k op/s throughput — useful as a hardware sanity check, not directly comparable to LLM tok/s.

Troubleshooting

Symptom	Cause	Fix
`libxrt_coreutil.so.2: cannot open shared object file`	XRT libs not in loader cache	Add `/opt/xilinx/xrt/lib64` to `ld.so.conf.d`, run `sudo ldconfig`
`No NPU device found` + clean dmesg	IOMMU disabled	Remove `amd_iommu=off`, reboot
`PASID unavailable and carveout not configured`	Same as above	Enable IOMMU
`Memlock limit is too low (8MB)`	Default ulimit too low	`/etc/security/limits.d/99-memlock.conf`, re-login
`xrt-smi: 0 devices found`	Missing NPU plugin	Install `xrt_plugin` RPM from `xdna-driver/build`
`/dev/accel/accel0` exists but open fails (ENODEV)	In-tree driver failed probe	Install DKMS driver via `xrt_plugin` RPM
`flm validate` OK but `flm run` fails with `No such device with index '0'`	XRT can't see NPU	Fix XRT plugin + `xrt-smi examine`
`xrt-smi: unwrapped/xrt-smi: No such file`	Bad symlink	Use wrapper script (see section 7)
`cmake3 is not installed`	Fedora CMake naming	Create `cmake3` wrapper pointing to `/usr/bin/cmake`
XRT build fails on OpenCL ICD	Fedora OpenCL 3.0 layout	See Fedora XRT build fixes below
Link errors for `libfftw3`	Missing dev package	`sudo dnf install fftw-devel`

Useful debug commands

# NPU PCI device
lspci -nn | grep 17f0

# Device node
ls -la /dev/accel/

# Kernel messages
sudo dmesg | grep -iE 'amdxdna|xdna|pasid|firmware|17f0'

# Which driver module is loaded
modinfo -F filename amdxdna
lsmod | grep amdxdna

# Firmware files (rev 11 = 17f0_11)
ls -la /usr/lib/firmware/amdnpu/17f0_11/

# IOMMU status
cat /proc/cmdline

Fedora XRT build fixes

Fedora 44 changed OpenCL packaging. Upstream XRT may fail to build or produce RPMs with wrong dependencies. Symptoms:

Compile error in ocl_icd_bindings.cpp (OpenCL 3.0 ICD struct layout)
RPM dependency conflict between ocl-icd and OpenCL-ICD-Loader

Workarounds applied in our build (track upstream fix):

Patch xrt/src/runtime_src/xocl/api/icd/ocl_icd_bindings.cpp for OpenCL 3.0 ICD compatibility
Patch xrt/src/CMake/cpackLin.cmake to require OpenCL-ICD-Loader >= 3.0 on Fedora

Upstream issue: Xilinx/XRT #9163

Install these before building if xrtdeps.sh fails on OpenCL packages:

sudo dnf install -y opencl-headers opencl-filesystem OpenCL-ICD-Loader-devel

Build flags that helped on Fedora:

./build.sh -npu -opt -disable-werror -noinit -j $(nproc)

Use -j $(nproc) with a space — some build scripts break on -j$(nproc).

Architecture: why so many pieces?

┌─────────────────────────────────────────┐
│  flm run / flm serve                    │  ← FastFlowLM (user-facing)
├─────────────────────────────────────────┤
│  libxrt_driver_xdna.so (XRT plugin)     │  ← xrt_plugin RPM
├─────────────────────────────────────────┤
│  libxrt_core.so (XRT base)              │  ← xrt-base RPM
├─────────────────────────────────────────┤
│  amdxdna.ko (DKMS kernel driver)        │  ← xrt_plugin RPM (postinst)
├─────────────────────────────────────────┤
│  NPU firmware (amdnpu/17f0_11/)       │  ← linux-firmware + plugin
├─────────────────────────────────────────┤
│  /dev/accel/accel0                      │  ← kernel DRM device node
└─────────────────────────────────────────┘
         ▲
         │ requires IOMMU (PASID/SVA)
         │ requires memlock = unlimited

Quick re-setup checklist (future you)

After a fresh Fedora install or kernel upgrade:

# 1. Confirm IOMMU is NOT disabled
grep -q 'amd_iommu=off' /proc/cmdline && echo "FIX GRUB FIRST" || echo "IOMMU OK"

# 2. Rebuild DKMS if kernel changed
sudo dkms autoinstall -k "$(uname -r)"
sudo depmod -a

# 3. Check memlock
ulimit -l

# 4. Validate
flm validate && xrt-smi examine && xrt-smi validate

References

FastFlowLM — NPU-first LLM runtime
FastFlowLM Linux docs
amd/xdna-driver — XRT + NPU plugin source
Lemonade NPU Linux guide
XRT OpenCL Fedora issue #9163
Strix Halo IOMMU discussion — GPU vs NPU trade-off

Written from a working Fedora 44 + ROG Flow Z13 setup. If AMD ships Fedora packages later, prefer those over building from source — but the troubleshooting sections above still apply.

DEV Community

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

TL;DR

Hardware tested

Before you start

Kernel requirements

Check your kernel cmdline now

1. Install build dependencies

cmake3 wrapper (Fedora ships CMake 4.x as `cmake`)

2. Build and install XRT

Register XRT libraries system-wide

3. Build and install the NPU plugin (DKMS driver + XRT shim)

4. Fix memlock limit

5. Critical: do NOT use `amd_iommu=off`

What is `amd_iommu`?

Symptoms with IOMMU off

Fix

6. Build and install FastFlowLM

7. Make `xrt-smi` available (optional but useful)

8. Validate everything

9. Run your first model

10. Monitor NPU stats

Quick reference

XRT — device-level snapshots

FLM — inference metrics (most useful in practice)

Kernel debugfs (low-level, requires root)

What does NOT show NPU utilization

11. Real-world benchmark (ROG Flow Z13)

`gemma4-it:e4b`

Troubleshooting

Useful debug commands

Fedora XRT build fixes

Architecture: why so many pieces?

Quick re-setup checklist (future you)

References

Top comments (0)

Running LLMs on AMD NPU with FastFlowLM - Fedora Guide

TL;DR

Hardware tested

Before you start

Kernel requirements

Check your kernel cmdline now

1. Install build dependencies

cmake3 wrapper (Fedora ships CMake 4.x as cmake)

2. Build and install XRT

Register XRT libraries system-wide

3. Build and install the NPU plugin (DKMS driver + XRT shim)

4. Fix memlock limit

5. Critical: do NOT use amd_iommu=off

What is amd_iommu?

Symptoms with IOMMU off

Fix

6. Build and install FastFlowLM

7. Make xrt-smi available (optional but useful)

8. Validate everything

9. Run your first model

10. Monitor NPU stats

Quick reference

XRT — device-level snapshots

FLM — inference metrics (most useful in practice)

Kernel debugfs (low-level, requires root)

What does NOT show NPU utilization

11. Real-world benchmark (ROG Flow Z13)

gemma4-it:e4b

Troubleshooting

Useful debug commands

Fedora XRT build fixes

Architecture: why so many pieces?

Quick re-setup checklist (future you)

References

cmake3 wrapper (Fedora ships CMake 4.x as `cmake`)

5. Critical: do NOT use `amd_iommu=off`

What is `amd_iommu`?

7. Make `xrt-smi` available (optional but useful)

`gemma4-it:e4b`