Last week I published an architecture. This week it runs.
The R.A.G-Race-Router engine now dispatches real workloads across all three processors on my GPD Pocket 4 (AMD Ryzen AI 9 HX 370): CPU (Zen 5), iGPU (Radeon 890M via Vulkan), and NPU (XDNA 2 at 50 TOPS). Here's what actually happened when I wired it up.
The Three-Processor Demo
Processors: CPU ok GPU (Vulkan) ok NPU (XDNA) ok
GPU Temp: 43C | VRAM: 1732/8192 MB | Pulse: READY
tokenize -> CPU (7ms) [lightweight, no dispatch overhead]
embed -> NPU (282ms) [embedding lookup, NPU efficient]
matmul -> GPU (5ms) [512x512 via Vulkan SPIR-V shader]
attention -> GPU (5ms) [scaled dot-product, pulsed burst]
normalize -> NPU (5ms) [RMS norm, NPU sweet spot]
project -> GPU (6ms) [linear projection, pulsed burst]
decode -> CPU (6ms) [greedy argmax, trivial]
Pipeline: 316ms total (16ms GPU, 287ms NPU, 13ms CPU)
Each operation is dispatched to the device the engine thinks is best, based on heuristics during the first few runs and learned routing rules after that. The personality database records every execution and gradually builds a profile of this specific chip.
The NPU Was Broken — We Fixed It
The NPU on Strix Halo refused to initialize. The kernel driver's SMU (System Management Unit) failed with smu cmd 4 failed, 0xff on every boot. Three sessions of debugging later:
Root cause: The driver calls SMU init before loading firmware via PSP (Platform Security Processor). On Strix Halo, the SMU doesn't respond until firmware is loaded. Classic init-order bug.
The fix: A three-line patch to the out-of-tree amdxdna driver that skips SMU when it fails, loads firmware via PSP anyway, and continues without power management. The NPU runs at default BIOS clocks.
Result: Llama 3.2 1B at 40-46 tok/s prefill, 14-24 tok/s decode, running on the NPU via FastFlowLM. The patched driver loads automatically on boot via a systemd service.
Vulkan Is Still Faster Than ROCm on This GPU
Updated benchmarks with the engine's Kompute integration:
| Backend | Workload | Performance |
|---|---|---|
| CPU (NumPy/BLAS) | 512x512 matmul | 7.6ms |
| GPU (Vulkan/Kompute) | 512x512 matmul | 5.0ms |
| GPU (Vulkan/IREE) | 1024x1024 matmul | 1,085 GFLOPS |
| NPU (FLM) | Llama 3.2 1B prefill | 40-46 tok/s |
The Vulkan path uses pre-compiled SPIR-V shaders (matmul, attention, fused add-scale) dispatched through Kompute 0.9.0. ROCm's hipMallocManaged remains broken on gfx1150 — Vulkan accesses the full VRAM+GTT pool while HIP only sees the BIOS carveout.
The Engine Learns Your Chip
After 5 runs, the personality database encodes routing rules:
Hardware Personality (35 runs):
Operation Best on Avg (ms) Confidence
tokenize cpu 0.04 100%
embed npu 199.04 100%
matmul gpu 0.50 (learning)
attention gpu 0.53 (learning)
decode cpu 0.06 100%
The system learns that embedding is best on NPU, tokenization belongs on CPU, and matrix ops go to GPU. When GPU temperature spikes, the dispatcher reroutes small ops to NPU or CPU. Every reroute is logged and fed back into the personality.
Thermal Stress Test
I pushed the GPU for 30 seconds of continuous compute to test adaptive rerouting:
- 438 operations dispatched
- 54 reroutes (matmul -> CPU when GPU was busy)
- GPU temperature: 45C -> 50C (well within thermal budget)
- Distribution: 380 GPU, 53 NPU, 5 CPU
The pulsed execution model (burst on GPU, check temperature, cooldown if needed) prevents thermal throttling. The engine's pulse controller adapts the burst/cooldown ratio based on real-time temperature readings from amdgpu_top.
What's Next
This is still pre-alpha. The dispatch overhead matters — for tiny operations, routing through the engine is slower than just running on CPU. The win is thermal management and sustained throughput for long-running workloads.
Next steps:
- Route MusicGen audio generation through the engine (text encoder on CPU, decoder on pulsed GPU, EnCodec on CPU)
- Reduce dispatch overhead for small ops (batch scheduling)
- IREE integration for compiled NPU kernels
- Upstream the amdxdna SMU bypass patch
The code is at Peterc3-dev/rag-race-router. MIT license.
This project is part of CIN (Collaborative Intelligence Network), a distributed inference system spanning a ThinkCentre M70q hub and this GPD Pocket 4 mobile workstation, connected via Tailscale mesh.
Top comments (0)